Thermal Image Super‐Resolution Methods Using Neural
Networks
Andrii Didenkoa, Andrii Oliinyka and Sergey Subbotina
a
    National University “Zaporizhzhia Polytechnic”, Zhukovskoho street 64, Zaporizhzhia, 69063, Ukraine


                Abstract
                Thermography has gained popularity in many fields. Since the human eye cannot see the
                thermal spectrum and in low light, thermal image analysis has become an integral part of
                medicine, manufacturing, construction, and other industries. Most thermal cameras produce
                low-resolution images when analyzing the temperature of objects, which complicates the
                process of analyzing original thermogram. Therefore, the problem of improving the quality of
                thermal images is relevant today. With the development of artificial intelligence and deep
                learning technologies, new super-resolution methods are emerging. Such methods can also be
                adapted for thermal image processing. This work examines the performance of modern super-
                resolution methods in the thermal vision domain.

                Keywords 1
                Machine learning, thermography, deep learning, thermal image, neural network, super-
                resolution, image processing, computer vision

1. Introduction
    At all temperatures above absolute zero, every object emits energy from its surface in the form of a
spectrum of different wavelengths and intensities. The radiation we call visible light, forms a very small
part of the electromagnetic spectrum. The infrared or thermal wavelength region extends over the range
0.7 to 1000 µm [1]. Infrared radiation can carry a lot of useful information about an object that is
studied. However, it is invisible to the human eye, thus it is important to have special equipment to
analyze infrared data.
    Infrared thermography (IRT) has a lot of applications in different fields. It is fast and non-invasive
method of diagnostics that is widely used in medicine to detect abnormal body temperature, to diagnose
breast cancer, diabetes neuropathy, etc. [2]. It is also used in the architectural and civil engineering
fields [3]. In manufacturing, for example, a combination of IRT and deep learning helps to detect cracks
in steel plates [4]. IRT also finds its application in the condition monitoring of electrical equipment
using machine learning algorithms [5]. Moreover, it can be also used for field phenomics of different
tree species using unmanned aerial vehicles (UAVs) [6].
    Despite having a lot of applications in different fields, IRT has its own drawbacks. The main issue
with thermal imaging is the resolution of the output thermogram. While modern RGB cameras are able
to produce high-resolution images, the majority of thermal cameras produce images of low spatial
resolution. Besides, thermal cameras that can produce high-resolution thermograms cost much more
than regular ones and are not affordable to ordinary users. This creates a need for thermal image super-
resolution methods.
    The goal of Super-Resolution (SR) methods is to recover a high-resolution image from one or more
low-resolution input images [7]. A high-resolution image provides a higher pixel density and thus more
detail about the original scene. The need for high-quality images often arises in the field of computer


The Sixth International Workshop on Computer Modeling and Intelligent Systems (CMIS-2023), May 3, 2023, Zaporizhzhia, Ukraine
EMAIL: an232did@gmail.com (A. Didenko); olejnikaa@gmail.com (A. Oliinyk); subbotin@zntu.edu.ua (S. Subbotin)
ORCID: 0009-0009-9236-3936 (A. Didenko); 0000-0002-6740-6078 (A. Oliinyk); 0000-0001-5814-8268 (S. Subbotin)
             © 2023 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org) Proceedings
vision to improve the efficiency of pattern recognition and image analysis. SR methods have been
developing for decades and have come a long way from classical approaches to modern deep learning
models.
   The purpose of this work is the implementation and analysis of different modern SR methods in the
thermal image domain.

2. Related Work
   SR methods can be divided into several groups: interpolation methods, Convolutional Neural
Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. Each of these
methods has its own advantages and disadvantages but in practice, neural network based methods show
better performance which usually overlaps their disadvantages.

2.1.    Interpolation Methods

    Interpolation methods are the simplest among SR methods. Among them are nearest neighbor
interpolation, bilinear interpolation, and bicubic interpolation [8]. In the nearest neighbor method, each
interpolated output pixel is assigned the value of the nearest sample point in the input image while
bilinear interpolation is used to know values at random positions from the weighted average of the four
closest pixels to the specified input coordinates. Bicubic is an advancement over the above interpolation
and uses polynomials, cubic, or cubic convolution algorithms [8]. These methods are available in every
image processing software. Despite their ease of implementation and clarity, the main disadvantage of
the interpolation methods is the quality of the output image, thus they are usually used as baselines for
more advanced SR methods.

2.2.    CNNs
    CNN is the main type of neural network that is used for image processing. It also has found its
application in SR tasks. For the first time, the idea of using CNNs instead of classical approaches was
introduced in [9]. In this method, the image is first upscaled using bicubic interpolation, then processed
through a network consisting of three parts: image patch extraction, nonlinear mapping and
reconstruction. Patch extraction represents patches of image as a multidimensional vector of features.
Nonlinear mapping transforms each multidimensional vector into another multidimensional vector. At
the reconstruction stage, the image patch representations are aggregated to generate the original
enlarged image [9].
    Method [10] is the improvement of the previous method that speedups image super-resolution.
Unlike the method [9], the authors do not enlarge the image with bicubic interpolation before passing
it to the neural network but work with the original image. In addition, the size of the convolutional layer
filter was reduced to 5. Then, the number of filters is reduced to reduce the training load. At the
nonlinear mapping stage, several convolutional layers are used instead of one. The next stage increases
the number of filters, which improves the quality of image processing. The last step is the
deconvolution, which produces an enlarged image.
    Method [11] uses neural network with autoencoder architecture for image restoration which includes
SR task. The authors propose a deep fully convolutional auto-encoder network, which is an encoding-
decoding framework with symmetric convolutional-deconvolutional layers. The network is composed
of multiple layers of convolution and de-convolution operators, learning end-to-end mappings from
corrupted images to the original ones [11].
    The ideas of the method [9] found their usage in thermal image SR in [12]. The authors examined
the use of datasets from different domains to train the model. The results show that a model trained to
upscale regular RGB images is better at enhancing thermal images than a model trained on thermal
images.
    In [13], the authors also studied the influence of the dataset domain on the quality of training and
model results. The authors conducted experiments in such color models as grayscale, HSL, HSI, and
HSV. In addition, the architecture of the model was also inspired by the method [9]. According to the
results, the network trained by the gray outperformed the one that used the lightness and intensity
domains, but the networks based on the brightness domain provided better performance compared to
the gray-based network [13].
   In [14], on contrary to the above methods, the authors showed that training a model on a dataset of
thermal images gives better results than training on a dataset of RGB images.
   In [15] authors use a progressive upscaling strategy with asymmetrical residual learning [15] and
also compare their work to SR methods that are commonly used in RGB image SR task.

2.3.    GANs

    GANs [16], as its name suggests, originally were used to generate images. GANs consist of two
models: a generative model G that captures the data distribution, and a discriminative model D that
estimates the probability that a sample came from the training data rather than G. The training procedure
for G is to maximize the probability of D making a mistake [16].
    The architecture of GANs is widely used in SR tasks. For example, in [17] authors proposed a novel
model for SR, called SRGAN. This method uses residual network called SrResNet as a generator.
Besides, the authors propose a perceptual loss function which consists of an adversarial loss and a
content loss.
    An improvement of SRGAN is ESRGAN [18] which increases model performance and improves
architecture and loss function of the previous method. In particular, they introduced a special block in
generator model, called Residual-in-Residual Dense Block (RRDB). They also modified perceptual loss
by using features before activation layers.
    GANs found their application in thermal image SR task. For example, method [19] uses CycleGAN
[20]. The generator is ResNet6 model and discriminator is PatchGAN, whole model is trained in
unsupervised way. Besides, the authors also released dataset for training thermal image SR models,
which consists of low-resolution, middle-resolution, and high-resolution thermal images.

2.4.    Transformers

    Recently, the Transformer neural network architecture has been increasingly used in deep learning
solving different tasks. Transformer was first introduced for solving natural language processing (NLP)
tasks [21], but then showed impressive performance in other domains. The core of Transfomer is the
attention mechanism that helps neural network to memorize long sequences and focus on specific parts
of the input.
    In [22], the authors propose to use a Transformer, namely an encoder, to classify images, while
preserving the original architecture as much as possible. This architecture is called a Vision Transformer
(ViT). To do this, the image is divided into several patches (in this article, a patch is 16x16 pixels), so
a sequence of patches is fed to the transformer's input. To preserve information about the location of
the patches, information about the position of these patches relative to each other (positional coding) is
added to the input sequence. In order for the model to learn to classify images, a representation of the
image class to be learned during model training is also added to the input sequence. ViT shows better
results in comparison to CNN models.
    Despite having good performance, ViT has several important drawbacks. One of these drawbacks is
the processing of high-resolution images, as the computational complexity increases quadratically with
image size. In addition, the architecture of a ViT is poorly suited for solving other computer vision
tasks, such as segmentation, since in this task it is important to distinguish image features at different
scales.
    To solve these problems Shifted-window Transformer (Swin Transformer) [23] was introduced.
Swin Transformer has linear computation complexity as it computes self-attention not within every
patch of the image but within patches in the local window. Then, to compute connections between each
window, shifted window attention is used. It can thus serve as a general-purpose backbone for both
image classification and dense recognition tasks [23].
   Transformers can be also used for solving SR tasks. For example, SwinIR [24] is a transformer-
based image restoration model mostly inspired by Swin Transformer. SwinIR consists of three parts:
shallow feature extraction, deep feature extraction, and HQ image restoration. In turn, deep feature
extraction block is a stack of special residual Swin Transformer blocks (RSTB) [24]. This architecture
makes it possible to achieve excellent results in different image restoration tasks, such as SR, denoising,
and artifact reduction.

3. Method

   As each method has its own advantages and disadvantages, it was decided to conduct experiments
with classical algorithms as baseline, CNNs, GANs and Transformer-based neural networks and then
compare the results of these methods.
   In particular, in this paper following neural networks are examined: SRGAN, SrResNet (generator
of SRGAN), ESRGAN, RRDBNet (generator of ESRGAN), SwinIR on thermal images and RGB
images domains, and SwinIRGAN (combination of SwinIR as generator and SRGAN discriminator).

4. Experiments

   Each of the selected models was already pretrained on its own domain. Also, models were trained
and evaluated with the help of BasicSR [25] – deep learning framework for solving image restoration
tasks, such as denoising and SR. Each model was trained to upscale thermal images by a factor of 2.

4.1.    Dataset

    Since the task of thermal image SR is quite specific, the number of open-source datasets for this task
is quite low.
    One of the largest and most popular is the FLIR dataset [26]. This dataset offers annotated thermal
and RGB images for object detection systems. The dataset contains more than 26000 annotated frames
and 15 different object categories. In total, the dataset contains almost 10,000 thermal and more than
9,000 RGB images with a size of 640 by 480 pixels [26]. Figure 1 shows an example of data from FLIR
dataset.


Figure 1: Example data from FLIR [26] dataset

   One of the most popular datasets for solving the RGB SR problem is DIV2K [27]. This dataset
contains 1000 high-resolution images (1500-2000 pixels per image side) divided into training, test and
validation sets. This dataset also contains reduced images by a factor of 2, 3, 4, and 8.
   In this work, both of these datasets are used. In particular, DIV2K is used to examine the
performance of model trained on grayscale images of this dataset in thermal image domain.
4.2.    Data Preprocessing

    To use FLIR dataset for the thermal image SR task, only thermal images (single channel) were
selected and a version of the dataset with thermal images reduced by a factor of 2 (320 by 240 pixels)
was created.
    In order to use the DIV2K dataset for the thermal image SR task, a version of the dataset with images
in grayscale format was created. Thus, images have one channel with a pixel range of 0-255.

4.3.    Evaluation

   To evaluate trained models SSIM [28] (structural similarity index measure) and RSNR (peak signal-
to-noise ratio) metrics were used. SSIM is defined as follows (1):

                                                 2𝜇 𝜇   𝑐     2𝜎         𝑐
                           𝑆𝑆𝐼𝑀 𝑥, 𝑦                                             ,                    1
                                             𝜇     𝜇    𝑐    𝜎       𝜎       𝑐

   where x, y are windows of image of size NN;
   x – mean of x;
   y – mean of y;
   x – variance of x;
   y – variance of y;
   c1, c2 – stabilization variables.
   PSNR is defined as follows (2):

                                                 𝑀𝐴𝑋                     𝑀𝐴𝑋
                          𝑃𝑆𝑁𝑅      10 log                  20 log                   ,                2
                                                 𝑀𝑆𝐸                     √𝑀𝑆𝐸

   where MAXI is maximum pixel value of image I;
   MSE – mean squared error.

4.4.    Experiment Setup

   Before conducting experiments, a set of hyperparameters was chosen.
   Adam [29] with a learning rate 0.0002 was chosen as an optimization algorithm for all models. All
models were trained for 100000 iterations. The number of images in one batch is 16 for SwinIR,
ESRGAN and RRDBNet and 64 for SRGAN and SrResNet.
   All models were trained not on whole images, but on square patches that were randomly selected
from each image. This speeds up training and forces the neural network to pay attention to high-
frequency features. Thus, for all models, a patch of 128128 pixels was chosen, except for the classic
SwinIR model (9696 pixels).
4.5.    Results

   The quantitative comparison is listed in Table 1.

Table 1
Results of thermal image super‐resolution
               Model                             SSIM                             PSNR
        Nearest Neighbors                       0,7389                           30,3350
      Bilinear Interpolation                    0,7557                           31,1910
      Bicubic Interpolation                     0,7672                           31,5820
              SRGAN                             0,6617                           28,3197
            SwinIRGAN                           0,6751                           29,6528
              ESRGAN                            0,6853                           29,8002
          SwinIR (DIV2K)                        0,7642                           31,4355
             RRDBNet                            0,7829                           32,3622
             SRResNet                           0,7828                           32,3471
           SwinIR (FLIR)                        0,7833                           32,3688

Qualitative results are shown in Figure 2, Figure 3 and Figure 4.


Figure 2: Qualitative comparison on FLIR sample (small objects), scaling factor 2
Figure 3: Qualitative comparison on FLIR sample (pedestrian), scaling factor 2


Figure 4: Qualitative comparison on FLIR sample (car), scaling factor 2

   Results of SSIM and PSNR metrics during training are shown in Figure 5 and Figure 6 respectively.
Figure 5: SSIM metric on the validation set during training


Figure 6: PRNR metric on the validation set during training

    From the obtained results, it can be concluded that SwinIR trained on FLIR dataset shows the best
results among other models. Additionally, CNNs such as SrResNet and RRDBNet trained separately
are better than these CNNs trained as GANs. As it can be seen from qualitative results, GANs have a
lot of noise on upscaled images which causes bad quantitative results, while CNNs and Transformers
almost completely denoise upscaled images. It can be explained by the fact that GANs are sensitive to
the input data and selected training hyperparameters. It is also visible that SwinIR trained on FLIR
dataset performs better than SwinIR trained on the grayscale DIV2K dataset.
5. Conclusion

   This paper implements modern SR methods and examines their results on thermal image SR
problem. The aim of this paper was to analyze the performance of popular SR methods on upscaling of
thermal images.
   The main problem of this task is the lack of datasets for thermal image SR. Current datasets don’t
have enough training data, the size of thermal images in dataset is not as big as in RGB datasets for SR
tasks, and quality of thermal images is worse than quality of RGB images in training sets.
   According to the obtained results of thermal image upscaling it was concluded that Transformer and
CNN models can perform better than GANs and classical algorithms.
   In addition, the following steps can be taken to improve the quality of SR of thermal images: increase
the size of the dataset and training batch to improve the generalization ability of the models; collect
dataset of thermal images with higher quality and lower percentage of noise in the images to improve
the results of SR methods; tune training hyperparameters such as optimization algorithm, learning rate,
loss function, etc.; improve the architecture of used models or modify them for the task of thermal
image SR.

6. References

[1] J. M. Hart A practical guide to infra-red thermography for building surveys, Building Research
     Establishment, Garston, 1991.
[2] B.B. Lahiri, S. Bagavathiappan, T. Jayakumar, John Philip, Medical applications of infrared
     thermography: A review, Infrared Physics & Technology, vol. 55(4), (2012) pp. 221-235. doi:
     10.1016/j.infrared.2012.03.007.
[3] C. Meola, Infrared Thermography in the Architectural Field, The Scientific World Journal, vol.
     2013, (2013). doi: 10.1155/2013/323948.
[4] J. Yang, W. Wang, G. Lin, Q. Li, Y. Sun and Y. Sun, Infrared Thermal Imaging-Based Crack
     Detection Using Deep Learning, IEEE Access, vol. 7, (2019) pp. 182060-182077. doi:
     10.1109/ACCESS.2019.2958264.
[5] M. Najafi, Y. Baleghi, S. A. Gholamian and S. Mehdi Mirimani, Fault Diagnosis of Electrical
     Equipment through Thermal Imaging and Interpretable Machine Learning Applied on a Newly-
     introduced Dataset, 2020 6th Iranian Conference on Signal Processing and Intelligent Systems
     (ICSPIS), (2020) pp. 1-7. doi: 10.1109/ICSPIS51611.2020.9349599.
[6] R. Ludovisi, F. Tauro, R. Salvati, S. Khoury, G. Scarascia Mugnozzaa, A. Harfouche, UAV-Based
     Thermal Imaging for High-Throughput Field Phenotyping of Black Poplar Response to Drought,
     Front. Plant Sci, vol. 8, (2017) doi: 10.3389/fpls.2017.01681.
[7] D. Glasner, S. Bagon, M. Irani, Super-resolution from a single image, 2009 IEEE 12th International
     Conference on Computer Vision, (2009) pp. 349-356. doi: 10.1109/ICCV.2009.5459271.
[8] S. Fadnavis, Image Interpolation Techniques in Digital Image Processing: An Overview,
     International Journal Of Engineering Research and Application, vol. 4(10), (2014) pp. 70-73
[9] C. Dong, C. C. Loy, K. He and X. Tang, Image Super-Resolution Using Deep Convolutional
     Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38(2), (2016) pp.
     295-307. doi: 10.1109/TPAMI.2015.2439281.
[10] C. Dong, C. C. Loy, and X. Tang, Accelerating the super-resolution convolutional neural network,
     European Conference on Computer Vision, vol. 9906, (2016) pp. 391–407. doi: 10.1007/978-3-
     319-46475-6_25.
[11] X. Mao, C. Shen, and Y. Yang, Image restoration using convolutional auto-encoders with
     symmetric skip connections, Advances in Neural Information Processing Systems, (2016).
[12] Y. Choi, N. Kim, S. Hwang and I. S. Kweon, Thermal Image Enhancement using Convolutional
     Neural Network, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems
     (IROS), (2016) pp. 223-230. doi: 10.1109/IROS.2016.7759059.
[13] K. Lee, Junhyeop Lee, Joosung Lee, S. Hwang, and S. Lee, Brightness-based convolutional neural
     network for thermal image enhancement, IEEE Access, vol. 5, (2017) pp. 26867–26879. doi:
     10.1109/ACCESS.2017.2769687.
[14] R. E Rivadeneira, P. L. Suarez, A. D. Sappa, and B. X. Vintimilla, Thermal image superresolution
     through deep convolutional neural network, International Conference on Image Analysis and
     Recognition, (2019) pp. 417–426. doi: 10.1007/978-3-030-27272-2_37.
[15] V. Chudasama et al., TherISuRNet - A Computationally Efficient Thermal Image Super-Resolution
     Network, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
     (CVPRW), (2020) pp. 388-397. doi: 10.1109/CVPRW50498.2020.00051.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
     Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems
     (NIPS), vol. 3(11), (2014) pp. 2672–2680. doi: 10.1145/3422622.
[17] C. Ledig, L. Theis, F. Husz´ar, et al, Photo-Realistic Single Image Super-Resolution Using a
     Generative Adversarial Network, 2017 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), (2017) pp. 105-114. doi: 10.1109/CVPR.2017.19.
[18] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy, ESRGAN: Enhanced
     Super-Resolution Generative Adversarial Networks, European Conference on Computer Vision
     2018 Workshop, vol. 11133 (2019). pp. 63-79. doi: 10.1007/978-3-030-11021-5_5.
[19] R. E. Rivadeneira, A. D. Sappa, and B. X. Vintimilla, Thermal image super-resolution: a novel
     architecture and dataset, International Conference on Computer Vision Theory and Applications,
     (2020) pp 1–2. doi: 10.5220/0009173601110119.
[20] J. -Y. Zhu, T. Park, P. Isola and A. A. Efros, Unpaired Image-to-Image Translation Using Cycle-
     Consistent Adversarial Networks, 2017 IEEE International Conference on Computer Vision
     (ICCV), (2017) pp. 2242-2251. doi: 10.1109/ICCV.2017.244.
[21] Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.;
     Polosukhin, I. Attention is all you need, Proceedings of the 31st International Conference on Neural
     Information Processing System, (2017) pp. 6000–6010. doi: 10.48550/arXiv.1706.03762.
[22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words:
     Transformers for image recognition at scale, International Conference on Learning
     Representations, (2021). doi: 10.48550/arXiv.2010.11929.
[23] Z. Liu, Y. T. Lin, Y. Cao; H. Hu, B. N. Guo, Swin Transformer: Hierarchical Vision Transformer
     using Shifted Windows, 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
     (2021) pp. 9992-10002. doi: 10.1109/ICCV48922.2021.00986.
[24] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool and R. Timofte, SwinIR: Image Restoration Using
     Swin Transformer, 2021 IEEE/CVF International Conference on Computer Vision Workshops
     (ICCVW), (2021) pp. 1833-1844. doi: 10.1109/ICCVW54120.2021.00210.
[25] Xintao Wang, Liangbin Xie, Ke Yu, Kelvin C.K. Chan, Chen Change Loy and Chao Dong.
     BasicSR: Open Source Image and Video Restoration Toolbox, 2022. URL:
     https://github.com/xinntao/BasicSR.
[26] FREE        Teledyne     FLIR      Thermal      Dataset     for     Algorithm     Training.    URL:
     https://www.flir.eu/oem/adas/adas-dataset-form
[27] E. Agustsson and R. Timofte, NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset
     and Study, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops
     (CVPRW), (2017) pp. 1122-1131. doi: 10.1109/CVPRW.2017.150.
[28] Z. Wang, E. P. Simoncelli and A. C. Bovik, Multiscale structural similarity for image quality
     assessment, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2,
     (2003) pp. 1398-1402. doi: 10.1109/ACSSC.2003.1292216.
[29] D. P. Kingma and J. Lei Ba, Adam: A Method for Stochastic Optimization. Proceedings of the 3rd
     International Conference on Learning Representations (ICLR 2015), (2015), pp 1–13. doi:
     10.48550/arXiv.1412.6980.