<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>for Semantically Robust Unpaired Image</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Nauky av., 14, Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper a novel approach for semantically robust unpaired image translation is presented. CLIPTraVeLGAN replaces the Siamese network in TraVeLGAN with a contrastively pretrained language-image model (CLIP) with frozen weights. This approach significantly simplifies the model selection and training process of TraVeLGAN, making it more robust and easier to use.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Image-to-image translation</kwd>
        <kwd>GAN</kwd>
        <kwd>CLIP</kwd>
        <kwd>Transfer knowledge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Currently generative deep learning has become the most promising direction in the development of
IT technologies, in which modern generative neural network models are being developed. There is a
lot of research going on in this rapidly growing field of machine learning, and a large proportion of it
is focused on generative adversarial networks [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The main property of generative adversarial
networks (GAN) is unsupervised learning, thanks to which GANs successfully demonstrate a wide
range of creative potential and in particular, the ability to generate images. Interesting tasks in this
direction belong to image-to-image translation (I2I) which includes image translation from one
subject domain to another while maintaining the main content. In recent years, many different GAN
models have been developed that solve this type of problem with different variations of architectures.
Their classification, advantages and disadvantages are considered concerning analysis methods and
applications for image-to-image translation problems in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Semantic robustness is an essential aspect of image translation. In the context of unpaired image
translation, semantic robustness is particularly important to ensure that the translations are accurate
and meaningful. Image translation from one domain to another is a challenging task that requires the
generated image to belong to the target domain while also retaining the individuality of the input
image. TraVeLGAN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was designed to address this task by using a Siamese network to encode the
high-level semantics that characterizes the domains. However, the Siamese network selection and
cooperative training process are complex. In this paper, we propose a novel approach to simplify the
training process of TraVeLGAN. Our approach is based on the use of a contrastively pretrained
language-image model (CLIP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) with frozen weights instead of the Siamese network. We call this
new model CLIPTraVeLGAN.
      </p>
      <p>
        Our approach aims to solve the problems associated with cooperative learning in TraVeLGAN. In
TraVeLGAN, the generator and the Siamese network have the same goal, which leads to additional
difficulties in determining the effectiveness of each solution. Our approach eliminates the need for
choosing and training a Siamese network and thus avoids these difficulties. At the same time, the
generator still receives the TraVeL loss as proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This makes our approach simpler and
more straightforward, while still ensuring the high-level semantics are captured in the generated
image.
      </p>
      <p>The transfer of knowledge from CLIP to CLIPTraVeLGAN enables the generator to understand
the relationships between words and images without any additional training. In this paper, we present
the results of our experiments and compare CLIPTraVeLGAN with the original TraVeLGAN. Our
results show that CLIPTraVeLGAN outperforms TraVeLGAN in terms of both stability and quality
of the generated images. This paper is a contribution to the field of image translation and provides a
promising new direction for further research.</p>
      <p>The effectiveness of our proposed method, CLIPTraVeLGAN, is evaluated on a benchmark
dataset for unpaired image translation. The results are compared with other methods to demonstrate
the performance of our method.</p>
      <p>Our code is available on www.kaggle.com/code/unfriendlyai/cliptravelgan-gta-cityscapes</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. TraVeLGAN</title>
      <p>
        The field of image translation has seen significant advancements in recent years with the
introduction of Generative Adversarial Networks. One such approach, CycleGAN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], introduced the
idea of using a cycle consistency loss to preserve the content and structure of the input image during
translation. Despite its success, CycleGAN has some limitations such as the assumption of cyclic
consistency which can result in blurry translations and mode collapse. To address these limitations,
TraVeLGAN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was introduced as a competitor to CycleGAN for one-sided image translation.
individual features and semantic content of the input image should be reflected in the generated
image.
      </p>
      <p>Thus, this task of unpaired image translation consists of two components: the generated image
must be a member of the target domain and have the individuality of the input image.</p>
      <p>Membership in the target domain. The generator must ensure that GXY (X) ∈ Y. To do this, a
standard GAN architecture is used, in which a discriminator DY tries to distinguish generated images
from real samples from Y. The goal of optimizing the generator parameters is to maximize DY (GXY
(X)), and the goal of optimizing the discriminator parameters is, conversely, to minimize DY (GXY
(X)) and maximize DY (Y). The networks compete with each other.</p>
      <p>
        Individuality. If xi, xj ∈ X, i ≠ j, then there must be a relationship between xi and GXY (xi), which
explains why GXY (xi) is a representation in the domain Y of the image xi, and not xj. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for this
purpose the Siamese network S is trained, which will transform the images of both subject areas into a
vector in some hidden space. For training S, pairwise differences between images are determined for
each batch of input images: Vij = S (xi) – S (xj) and Zij = S (Gxy (xi)) – S (Gxy(xj)). The goal of
optimizing the parameters of the Siamese network S and the generator Gxy is to maximize the cosine
similarity between Vij and Zij for all cases when i ≠ j. The network S is a proof that there is an
explanation according to which the generated image GXY (xi) differs from any GXY (xj) as much as the
samples xi and xj are different from each other. Since networks S and Gxy have the same goal, they do
not compete, but help each other - they cooperate.
      </p>
      <p>
        The authors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduce the concept of a transformation vector between two points. In natural
language processing tasks, words are represented by points in a space in which if a certain vector
would transform the word "man" into the word "woman", then the word "king" into the word "queen"
would be transformed by a very similar vector. A Siamese network S represents an image by points in
a certain space. In the image translation task, instead of changing the gender of the word, the
transformation vector can change the background colour, size or shape of the image. But the main
idea is that the vector of transformation of a point obtained from the Siamese network from one image
S(xi) ("man") to a point of another original image S(xj) ("woman") will also transform the point of the
generated image S (GXY (xi)) "king" to the point of the generated S(GXY(xj)) "queen".
      </p>
      <p>TraVeLGAN used an additional Siamese network to encode high-level semantics between the
source and target domains. This idea seemed like a breakthrough as the Siamese network was
believed to outperform CycleGAN in terms of translation quality. However, TraVeLGAN has not
received much development due to the difficulties in choosing the architecture of the Siamese
network and the parameters of its training. This results in a large set of possible solutions and makes it
difficult to determine the effectiveness of each of them.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Contrastive language-image pretraining</title>
      <p>
        CLIP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was introduced as a language-image model for the transfer of knowledge without any
further training. After pretraining the model, it can be used for any purpose with any images without
any tuning. Trained on a dataset of billions of image-caption pairs from the Web (WIT), the model
can successfully classify images with text class labels for a wide range of tasks, even quite far from its
training set: geolocation, car brands. CLIP trained on WIT shows better accuracy on ImageNet than
ResNet50 trained on ImageNet. The worst performance of knowledge transfer without additional
training is shown on very specialized data sets, such as classification of satellite images, medical
images, and object counting in synthetic images.
      </p>
      <p>The authors discovered an unexpected feature that, on many data sets, knowledge transfer without
any post-training performs better than adding logistic regression on the top of the frozen network and
post-training in 4 epochs on a new data set. Even worse indicators were obtained when trying not to
freeze CLIP, but to fit all layers on a new data set.</p>
      <p>The internal representation of CLIP. One of the side effects of CLIP is that encoders learn the
internal representation of images in a shared space with the internal representation of natural language
texts. Although there is no consensus in the scientific community on what is a "perfect"
representation, one common option for testing the quality of a representation is to train a linear
classifier attached to a frozen model and determine the performance of that model on different
datasets. According to the results of experiments, all CLIP models regardless of encoder and size
outperform any other known models in this test.</p>
      <p>Natural language encodes semantic content and hierarchical relations between concepts with
words. Contrastive learning of a visual model using natural language texts as a learning cue led to the
learning and generalization of such special knowledge about image elements as expressed in
imagerelevant texts in human language. The extent to which the visual model learns the hierarchy of
concepts that exists in a human language requires separate research. Currently, CLIP reflects the
meaning of the image in hidden representation most effectively among other well-known models.</p>
      <p>The vector into which CLIP transforms images is the best choice for finding similar images. Other
options for using CLIP in the search task are finding images that are most relevant to the content of
some text and finding the text that most relevantly describes the image.</p>
      <p>Overall, CLIP's powerful internal representation of images and text make it a valuable tool for a
wide range of applications, with potential future uses that have yet to be imagined.</p>
      <p>In our work, CLIPTraVeLGAN, we use CLIP as a means of preserving high-level semantics
between the source and target domains in unpaired image-to-image translation.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Semantic robustness</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the concept of semantic stability of unpaired translation of images was introduced and the
reasons for the conflict between compliance with the subject area and accuracy of the translation, and
the reasons for hallucinating objects that are absent in the input image were highlighted. SRUNIT
model is proposed to provide translation semantically robust, which is simultaneously trained with a
generator and a discriminator similar to TraVeLGAN's Siamese network. CLIP is not used. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the
use of Vector Symbolic Architectures was proposed to improve the semantic robustness of unpaired
image translation, which showed even better indicators of semantic translation accuracy than
SRUNIT. CLIP is not used also.
      </p>
      <p>
        The robustness of CLIP under natural skew of data distribution was tested in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. If the model is
trained on one set of data, and then the efficiency is determined on the updated (sometimes
synthetically corrected), then the efficiency of the models is significantly reduced. CLIP is more
reliable in distribution bias problems compared to models pretrained on ImageNet. This property is
especially important for the case of ensuring semantic robustness when translating images between
domains.
      </p>
      <p>The use of CLIP in our work adds the ability to preserve the high-level semantics between the
source and target domains, making the translations semantically robust.</p>
      <p>In conclusion, our work, CLIPTraVeLGAN, builds upon the idea of TraVeLGAN and adds the
advantages of CLIP to improve the quality of unpaired image translation while preserving the
highlevel semantics between the source and target domains.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Method</title>
      <p>The core of the CLIPTraVeLGAN approach is the use of a pre-trained language-image model
(CLIP) as a Siamese network in TraVeLGAN setup. The proposed CLIPTraVeLGAN model is
composed of a generator, a discriminator and pretrained CLIP model. The generator takes an image
from one domain and generates an image that belongs to the other domain. The discriminator is
responsible for distinguishing between real and fake images. In CLIPTraVeLGAN, we replace the
Siamese network in TraVeLGAN with the pre-trained language-image model CLIP. The CLIP model
is used to encode the high-level semantics of the input and target domains.</p>
      <p>We train the CLIPTraVeLGAN model using the adversarial loss and the TraVeL loss. The
adversarial loss is used to ensure that the generated image belongs to the target domain, while the
TraVeL loss encourages the generator to preserve the high-level semantics of the input image. Thus,
the final objective terms of the generator are:</p>
      <p>=  +  , (1)
where λ controls the relative importance of TraVeL loss.</p>
      <p>TraVeL loss is the same as in TraVeLGAN:

= 
≠  
[  ( ) −  ( ),
  ( ) −   ( ) ] ,
(2)
where Dist is a distance metric, such as cosine similarity.</p>
      <p>One advantage of our approach is that it eliminates the need for choosing and training a Siamese
network, which can be complex and time-consuming. Instead, the transfer of knowledge from CLIP to
CLIPTraVeLGAN enables the generator to understand the relationships between images without any
additional training. This makes our approach simpler and more straightforward, while still ensuring
the high-level semantics are captured in the generated image.</p>
      <p>In this context, the use of CLIP in CLIPTraVeLGAN adds the ability to preserve high-level
semantics between the source and target domains, making the translations semantically robust.
Therefore, our work builds upon the idea of TraVeLGAN and leverages the advantages of CLIP to
improve the quality of unpaired image translation while maintaining semantic robustness.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Experiment</title>
      <p>
        For experiments we use Kaggle environment with TensorFlow and TPU support. Pretrained CLIP
weights were loaded from https://huggingface.co transformers package openai/clip-vit-large-patch14.
The dataset contains images from two different domains that are not aligned with each other. We
preprocess the dataset by resizing all the images to a fixed size of 256x256 and normalising the pixel
values to lie in the range [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ]. CLIP model receives a central crop with the size of 224x224 of
images preprocessed according to its configuration.
      </p>
      <p>
        To test the effectiveness of the proposed approach, we used the models studied in the Kaggle
competition "I'm Something of a Painter Myself" [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as a basis for the experiments. CycleGAN
showed the best results in the competition, while TraVeLGAN was the most promising one-side
image translation model. We replaced the Siamese network in TraVeLGAN with a contrastively
pretrained language-image model (CLIP) to create the CLIPTraVeLGAN model. Batch size 128 was
chosen for the experiments, following the example of other CLIP applications with large batch-size
values.
      </p>
      <p>The generators and discriminators of all models were identical. The purpose of the experiments
was to establish the effect of using the proposed improvement, that is replacing the Siamese network
with pretrained CLIP. We use the Adam optimizer to train both the generator and discriminator.</p>
      <p>To eliminate the irrelevant effects associated with struggling with overfitting on a small number of
Monet paintings in competition, as well as the difficulty of visually evaluating Monet paintings,
models must generate realistic photos of the landscape from Monet paintings. Some of Monet's
paintings did not participate in the training. It is designed to test the quality of image translation and
determine the FID metric.</p>
      <p>
        We evaluate CLIPTraVeLGAN on GTA (Grand Theft Auto) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to Cityscapes dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which
is a benchmark dataset for unpaired image translation because it involves translating images from one
domain to another, where the source and target domains are vastly different.
      </p>
      <p>The GTA and Cityscapes datasets represent two different domains of real-world urban
environments. The GTA dataset consists of images of urban scenes generated from a video game,
while the Cityscapes dataset comprises real-world urban scenes captured by a camera mounted on a
car. The images in these datasets differ in terms of lighting conditions, weather, time of day, and
many other factors. The main problem is that GTA images have more sky than Cityscapes. The
discriminator can easily distinguish fake image by that criterion. Cityscape images have more
vegetation instead. Thus, models may hallucinate vegetation in open sky regions which is semantic
mistake.</p>
      <p>We made an ablation study using CLIP model without pretrained weights to understand if
semantic CLIP knowledge is necessary for the correct translation of an image. TraVeL loss turned
into cosine similarity of noise vectors instead of semantic vectors.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Results</title>
      <p>The results of the translation of Monet's paintings into photographs were the following. The basic
CycleGAN model showed on the test dataset FID = 6.9. Basic TraVeLGAN had 7.7. The investigated
CLIPTraVeLGAN showed a result between the two base models, FID = 7.3. An example of resulted
images is shown in Figure 2.</p>
      <p>We compare the results of CLIPTraVeLGAN with those of TraVeLGAN to evaluate the effect
from using pretrained CLIP instead of the Siamese network using GTA – Cityscapes benchmark
(Figure 3).</p>
      <p>Yellow lane lines from GTA should be translated into white lane lines. All real Cityscapes images
have Mercedes hood ornaments. GTA has more sky than Cityscapes. An example of semantic flipping
is hallucinations of trees instead of the sky.</p>
      <p>We compared results with different values of λ that controls the relative importance of TraVeL
loss and a different number of updates. FID metric conflicts with semantic robustness. When the
coefficient of the importance of TraVeL loss λ is small it still prevents the model from collapse mode
but allows semantic flipping. The FID metric is the best and the output image fits target domain.
Increasing the value of λ results in worse FID values, but helps control semantic flipping. When the λ
value is too high, the model renders the input images unchanged. This is the easiest way to ensure
TraVeL consistency between input and output pairs.</p>
      <p>The results of the GTA-Cityscapes experiment are shown in Table 1 and in Figures 3,4. There is a
trade-off between image quality and semantic robustness. In our experiments, we searched for the
optimal value of λ to achieve satisfactory results according to both criteria.</p>
      <sec id="sec-8-1">
        <title>5.74 The model generates trees and artefacts in the sky. Yellow lane</title>
        <p>lines are translated correctly into white lane lines
0.001
40 000
6.22</p>
        <p>Model hallucinates multiple Mercedes hood ornaments. There
are artefacts in regions out of CLIP control. The rest of the sky is
translated correctly
0.001
6.81 There is no semantic flipping. Yellow lane lines are translated
correctly into white lane lines</p>
      </sec>
      <sec id="sec-8-2">
        <title>9.09 High values of λ prevent translation. Yellow lane lines remain 9.08 the same. Output images are almost equal to the input except 8.90 some artefacts in regions out of CLIP control</title>
        <p>
          In Figure 5 we show the results of translation of the same images that were used to compare the
state-of-the-art VSAIT model against other models in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We demonstrate that our method does not
exhibit semantic flipping. CLIPTraVeLGAN results are very close to VSAIT.
        </p>
        <p>In the Ablation study (Figure 6) we use CLIP model without pretrained weights. Semantic CLIP
knowledge is necessary for the correct translation of an image.</p>
        <p>We trained CLIPTraVeLGAN to translate GTA Label to Cityscapes image. Our method exhibits
semantic flipping when λ is small. Increasing the value of λ leads to the disappearance of the gradient
from the discriminator. This seems to be a limitation of our method when domain images (like
semantic labels) were not found in the CLIP training dataset.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>6. Discussions</title>
      <p>Our biggest impression from the experiments is that the CLIPTraVeLGAN is very easy to train
compared to many image-to-image translation models we tried before. With this model, we did not
meet the collapse mode. The only TraVeL loss from pretrained CLIP is enough to prevent it.</p>
      <p>We evaluated our proposed method on a benchmark dataset for unpaired image translation and
compared it with other methods. The results demonstrated the effectiveness of our approach and the
potential for further research in this area. However, it should be noted that our approach has some
limitations and there is still room for improvement.</p>
      <p>The main problem is that perfect semantically robust translation and perfect membership in the
target domain are incompatible. The FID metric itself is only an additional indicator of the quality of
the generated images but has nothing to do with the accuracy of the translation. Even more, the
accuracy of the translation conflicts with the FID metric - the more the image matches the target
domain, the more semantic changes were made during the translation. Thus, an approximate
correspondence of the FID metric different models is enough to consider the model as a candidate for
further research. Trying to demonstrate SOTA by metric FID on any dataset for an image translation
task is not proof of its superiority. The evaluation on the GTA to Cityscapes dataset showed that the
proposed CLIPTraVeLGAN approach outperformed TraVeLGAN for unpaired image translation in
terms of both stability and quality of the generated images. The FID score of CLIPTraVeLGAN is
lower than that of TraVeLGAN when λ is small and translation is not perfect. But CLIPTraVeLGAN
can provide such semantic stability of translation that is not available for TraVeLGAN. The results of
the experiment demonstrate that CLIPTraVeLGAN is a promising approach to image translation and
provides a new direction for further research.</p>
      <p>Additionally, we evaluated our method on Kaggle competition dataset. Since the task of the
Kaggle competition is not to translate images from one domain to another, but to generate images, the
FID metric displays only quality generation and diversity of images of the target domain. Semantic
robustness of the translation is not evaluated in any way.</p>
      <p>One of the key features of CLIP is its internal representation of images and natural language texts
in a shared space learned through contrastive learning. This shared space allows for finding similar
images and relevant text descriptions of images, among other potential applications. Additionally,
CLIP's encoders learn to represent images in a way that reflects the meaning of the image most
effectively among other well-known models.</p>
      <p>Specifically, we tried using CLIP to evaluate the similarity of the semantic content between the
input image and the generated image. During training, the generator network tries to produce an
output image that is not only visually similar to the input image but also semantically similar in the
sense that it is classified similarly by CLIP.</p>
      <p>To achieve this, we used a pre-trained CLIP model as a feature extractor and obtain the CLIP
embedding of both the input and generated images. We then use the cosine similarity between these
embeddings to evaluate the semantic similarity between the two images. Such straightforward method
is not effective even for GTA to Cityscapes translation. It prevents translating yellow lane lines into
white lane lines that are commonly found in Cityscapes images. Translating “men into women” or
“horse to zebra” conflicts with such semantic identity too. To avoid such effect, we developed an idea
using CLIP in the TraVeLGAN setup. By using CLIP in this way, we aim to improve the semantic
robustness of our image-to-image translation model, allowing it to translate details that must be
translated. There is one more limitation of our method. When domain images are very specific and
were not found in the CLIP training dataset, our method is almost useless. In the example, we are
trying to translate GTA labels into Cityscapes photos. We could not prevent semantic flipping in this
task. Finally, we suggest exploring the use of multiple CLIP models with different pre-training data
and architecture to improve the robustness and accuracy of the generated images. This could lead to
even better performance and expand the range of applications for image translation. Possible use of
two or three various CLIP models (for example based on ResNET and on visual transformers) to
increase the reliability of Siamese network.</p>
    </sec>
    <sec id="sec-10">
      <title>7. Conclusions</title>
      <p>In this paper, we proposed a novel approach for semantically robust unpaired image translation,
CLIPTraVeLGAN. CLIPTraVeLGAN simplifies the Siamese network selection and training process
of TraVeLGAN by using a contrastively pretrained language-image model (CLIP) with frozen
weights. To our knowledge, we were the first to utilize CLIP to enforce that the individual features
and semantic content of the input image are reflected in the generated image during image-to-image
translation. The proposed model CLIPTraVeLGAN proved to be much easier to train than the original
TraVeLGAN. The training is quite stable, the results are quite comparable with other models built on
the same generators and discriminators. Our results show that CLIPTraVeLGAN outperforms both
CycleGAN and TraVeLGAN in terms of semantic robustness while being easier to train and
producing comparable results in terms of quality.</p>
      <p>
        The methodology for achieving semantic robustness in image translation typically involves
training a model to effectively preserve the high-level semantics between the source and target
domains. This can be done using a variety of techniques, including the use of specialized loss
functions and Vector Symbolic Architectures [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our results are close to state-of-the-art Vector
Symbolic Architectures but our approach is simpler and more straightforward.
      </p>
      <p>There is a trade-off between image quality and semantic robustness. In our experiments, we
manually searched for the optimal value of λ to achieve satisfactory results according to both criteria.
In future works, the automated estimation of the optimal value of λ during training has the potential to
improve the results and stability of training. The proposed model showed promising results in terms
of semantic robustness and ease of training and can be used as a starting point for future research on
efficient image translation models. Possible future work includes exploring the use of multiple CLIP
models and new solutions of generators and discriminators.</p>
    </sec>
    <sec id="sec-11">
      <title>8. Acknowledgements</title>
      <p>The work was performed as part of the state budget research project “Development of methods and
algorithms for combined learning of deep neuro-neo-fuzzy systems under short training set
conditions” (state registration number 0122U001701) of Artificial Intelligence Department of Kharkiv
National University of Radio Electronics.</p>
    </sec>
    <sec id="sec-12">
      <title>9. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Foster</surname>
          </string-name>
          , Generative Deep Learning. Teaching Machines to Paint, Write, Compose and Play,
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Langr</surname>
          </string-name>
          , V. Bok, GANs in Action.
          <article-title>Deep Learning with Generative Adversarial Networks</article-title>
          ,
          <source>Manning Publications Co</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>"Image-to-Image Translation: Methods and Applications."</article-title>
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>24</volume>
          (
          <year>2022</year>
          ):
          <fpage>3859</fpage>
          -
          <lpage>3881</lpage>
          . doi:
          <volume>10</volume>
          .1109/TMM.
          <year>2021</year>
          .
          <volume>3109419</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amodio</surname>
          </string-name>
          , S. Krishnaswamy,
          <article-title>TraVeLGAN: image-to-image translation by transformation vector learning</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>8983</fpage>
          -
          <lpage>8992</lpage>
          . doi 10.1109/CVPR.
          <year>2019</year>
          .
          <volume>00919</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proceedings of the International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. -Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , T. Park,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          ,
          <article-title>Unpaired image-to-image translation using cycleconsistent adversarial networks</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          , ICCV, Venice, Italy,
          <year>2017</year>
          , pp.
          <fpage>2242</fpage>
          -
          <lpage>2251</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>244</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Zhiwei</surname>
          </string-name>
          et al.,
          <article-title>Semantically robust unpaired image translation for data with unmatched semantics statistics</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          , ICCV, Montreal, QC, Canada,
          <year>2021</year>
          , pp.
          <fpage>14253</fpage>
          -
          <lpage>14263</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>01401</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Justin</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Theiss</surname>
          </string-name>
          et al.,
          <article-title>Unpaired image translation via vector symbolic architectures</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Amy</given-names>
            <surname>Jang</surname>
          </string-name>
          , Ana Sofia Uzsoy, Phil Culliton,
          <string-name>
            <surname>I'</surname>
          </string-name>
          <article-title>m something of a painter myself</article-title>
          ,
          <source>Kaggle</source>
          ,
          <year>2020</year>
          . URL: https://kaggle.com/competitions/gan-getting-started.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stephan</surname>
            <given-names>R Richter</given-names>
          </string-name>
          et al.,
          <article-title>Playing for data: Ground truth from computer games</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          . Springer.
          <year>2016</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cordts</surname>
          </string-name>
          et al.,
          <article-title>The cityscapes dataset for semantic urban scene understanding</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <given-names>CVPR</given-names>
            ,
            <surname>Las</surname>
          </string-name>
          <string-name>
            <surname>Vegas</surname>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , pp.
          <fpage>3213</fpage>
          -
          <lpage>3223</lpage>
          , doi: 10.1109/CVPR.
          <year>2016</year>
          .
          <volume>350</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>