=Paper=
{{Paper
|id=Vol-3344/paper04
|storemode=property
|title=Generating Manga Images from Sketch Via GAN
|pdfUrl=https://ceur-ws.org/Vol-3344/paper04.pdf
|volume=Vol-3344
|authors=Ziyuan Liu,Ye Ding,Qi Wan,Zhiyang He,Qing Zhang
|dblpUrl=https://dblp.org/rec/conf/icceic/LiuDWHZ22
}}
==Generating Manga Images from Sketch Via GAN==
<pdf width="1500px">https://ceur-ws.org/Vol-3344/paper04.pdf</pdf>
<pre>
Generating Manga Images from Sketch Via GAN 1
Ziyuan Liu, Ye Ding, Qi Wan, Zhiyang He, Qing Zhang
Dongguan University of Technology Dongguan, Guangdong, China

                Abstract
                Image generation has been a popular research topic in computer graphics and computer vision.
                However, most existing image generation works focus on real-life photographs rather than
                manga images. Generating manga images directly using photographic image generation
                methods often results in poor visual performance. We propose a novel manga generation
                method based on sketch images via GAN. The proposed method is irrelevant to real-life
                photographs and does not require explicit tags. The resulting manga images are generated based
                on the sketch images painted by the user. The generated manga image has consistent labels and
                outlines with the original sketch image and is rendered in manga style. Through our intensive
                experiments on the public dataset AnimeFace, comparing with the state-of-the-art methods
                Pix2Pix and SofGAN, the sketch detection model reduces 10.9% FID from SofGAN; and the
                PSNR of the proposed model is higher than Pix2Pix and SofGAN by 1.4% and 3.9%,
                respectively. The above qualitative and quantitative evaluations show that our manga
                generation method has excellent visual performance and serves a controllable and label-free
                generation of manga images. Statistically, the proposed method outperforms the state-of-the-
                art.

                Keywords
                generative adversarial network; image generation; semantic segmentation; style migration


1. INTRODUCTION
    Image generation has been a popular research topic of computer graphics and computer vision.
Manga-orientated image generation methods often work in two ways: 1) generate manga images
randomly through training, such as Pix2Pix [1], Pix2PixHD [2], DCGAN [3], and WGAN [4]. However,
it isn’t easy to control the results for a random generation model, which is not practical in most
application scenarios; and 2) generate manga images based on user-specified tags, such as SofGAN [4],
SIS [5]. However, due to the complexity and limited interpretability of computer-generated labels, it is
difficult for actual users to specify the desired labels. To conquer the above disadvantages, we propose
a novel manga generation method based on sketch images via GAN in this paper.
    The proposed manga image generation model is as in Figure 1. The model consists of two parts: 1)
the sketch detection model, which generates a feature matrix consisting of feature tags and corresponding
positions from the original sketch image through multiple convolution layers. We visualize the feature
matrix as a feature map for comparison analytics; and 2) the manga image generator, which takes the
feature matrix as input, and generates a corresponding manga image with similar contents to the original
sketch image through a texture generator. These two parts are trained separately. We do not need data
pairs for training and can generate manga based on random sketches drawn by the user.
    We performed a quantitative and qualitative comparison of our generated comic images. The
evaluation results show that our system can generate visually pleasing and highly reproducible images
that express user’s needs.


ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control
EMAIL: *dingye@dgut.edu.cn (Ye Ding)
             © 2022 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 23
Figure 1.        The framework of the sketch-to-manga algorithm
2. METHODS
   The framework of the sketch-to-manga algorithm is as in (1). 𝑥 represents the sketch; 𝐴(𝑥 )
represents the feature matrix; 𝑤 represents the texture vectors for manga; 𝐺 (∙) represents the
transformation of the feature matrix into a manga texture; 𝑆(∙) represents the whole sketch-to-manga
algorithm.
                                 𝑆(𝑥 ) = 𝐺 (𝐴(𝑥 ), 𝑤).                (1)
   In the following, we will introduce the sketch detection model and the manga image generator in
detail.
2.1.    The Sketch Detection Model
    The function of the sketch feature recognizer is to extract facial features from sketches. Its core focus
is to realize the recognition and extraction of key points of sketch faces, which is the key part of realizing
the label-free generation of manga.
    The module first passes the input image into several convolutional network groups to obtain the
feature matrix at different group levels. The higher the level of the feature map, the coarser the detail,
while the lower the level, the finer the detail. To enable the module to learn the overall features of the
input image while preserving the details of the input image, we take the three layers of high, middle, and
low-level feature maps as the input of the subsequent neural network. The high-level feature maps are
added and fused with the low level-feature maps and upsampled to obtain new feature maps with more
information. Finally, we get three layers from the high, middle, and low layers. These three layers
represent feature maps with coarse, medium, and fine features, respectively. These feature maps are
passed into a fully convolutional neural network with step size 2 in the ratio of least in the high layer,
second in the middle layer, and most in the low layer. So that the number of output layers is continuously
reduced during the convolution process to obtain several individual 512-dimensional vectors fed into the
generator. Eventually, the sketched facial features are presented in a face with a reasonable facial
structure. However, the output obtained at this point is not in the form of a segmentation map and cannot
fit the input of the subsequent modules. To extract the feature matrix from the output, we propose the
SegExtract part about the real-time semantic segmentation network BiSeNet [6]. It feeds the result into
the SegExtract part for segmentation graph extraction and finally gets the feature matrix extracted from
the sketch. The network structure of this part is as in (2)(3)(4). x represents the input; out represents the
output; 𝑐𝑝 represents the features that contain contextual semantic information; 𝑠𝑝 represents the
inclusion of spatial information features; 𝑓 (∙) represents the 1/i size feature map obtained after
downsampling; ARM(∙) represents partially optimized ARM [7] features; 𝐹𝐹𝑀(∙) represents partial
feature fusion.
                                                sp = f (x).                        (2)
                         𝑐𝑝 = AR𝑀(𝑓 (𝑥))⨁ ARM f (𝑥 ) ⨁f                     .      (3)
                                            𝑜𝑢𝑡 = 𝐹𝐹𝑀 (𝑠𝑝, 𝑐𝑝).                    (4)
    Since the previously obtained contextual semantic features and spatial features have different output
levels and cannot be fused directly, this part reweights them so that the features representing different
levels are fused. Therefore, the feature maps obtained from the two parts are finally passed to the FFM


                                                     24
[8] part for feature fusion, and finally, the output layer is convolved and up-sampled to obtain the final
output result.
2.2.    The Manga Image Generator
    To achieve attribute-specific generation, the process of feature matrix to manga images is
implemented. We adopt the core idea of StyleGAN [9], a new generative model using an adversarial
network progressive resolution enhancement strategy. Starting from a very low resolution, we layer up
to a high resolution to control the image attributes. StyleGAN obtains the direction vectors of the specific
attributes of the caricature images from a large number of caricature images and then reconstructs the
face feature vectors based on the caricature feature direction vectors to generate caricature images with
caricature effects consistent with the sketch contours. This process to obtain the style vector. The
framework diagram of our caricature image generator model is shown in Figure 2.


Figure 2.        The framework of manga generation

     The manga image generator divides the identified feature maps into two regions distance p and 1 −
p. We perform random vector sampling of the manga map texture space to get the style vectors z and
z , encode and decode the style vector for each label, and fuse the style vectors of the two distance
titling maps to get the manga image. Since the goal of StyleGAN is to encode the spatial constraints of
the StyleGAN synthesis process while serving the generation quality of the pre-trained StyleGAN, we
need to precisely map the encoding conditions to the corresponding parts of the original synthesis process.
To achieve this, we formulate the objective function of the training process as in (5). z and z represent
the generator mixing two style vectors; W(∙) represents the decoding/encoding process of the feature
labels; the p’s value is between 0 and 1; p represents the similarity between the two styles; β and γ
represent the mean and variance of the spatially adaptive normalization parameters; F represents the
matrix of different labels; F represents the final generated manga image.
                      F = γ. (F ∗ W(z ). p + (F ∗ W(z ). (1 − p) + β. (5)

3. EXPERIMENTS AND RESULTS
3.1.    Datasets
   No sketch dataset has been found to meet the needs of this project from the publicly available sketch
dataset resources on the web. The amount of work required to manually draw sketch datasets by hand is
too large, and the time cost required is too high. We have used several ways to obtain sketch datasets, as
shown in Figure 3.


Figure 3.        Comparison of sketches generated by different methods


                                                    25
   In Figure 3, (a) uses Anime2Sketch [10]; since the sketch generated by (a) contains too much detail,
we use a sketch simplification algorithm [11] to generate the sketch to obtain (b); (c) uses the effect of
manual processing by Photoshop software; (d) is the effect of processing by our algorithm, and we can
see that (d) is more similar to our hand-drawn drawing is more similar.


Figure 4.        The flow of the sketch production

   The process from the real image to the sketch is shown in Figure 4. The process begins by extracting
the contours of facial features from real images. The existing semantic segmentation model is used to
segment the facial features of the real image to obtain the semantic class markers at the pixel level. Each
semantic class is filled with a different distinguishable colour. Figure 5(b) is generated from Figure 5(a).
Then the contour filtering operation is done in Figure 5 (b), and the contours of the facial features of the
real image are extracted to obtain Figure 5 (c). Then, the extracted contour is greyed out and binarized
to make it a binary image to obtain Figure 5 (e). However, the image is not coherent after enlargement,
and the contour edges are unlear. Therefore, a corrupted convolution operation is done to obtain the final
Figure 5(f), which is the final sketch.
   The face dataset used in this paper for sketching is from the publicly available CelebA-HQ, which
contains 30,000 high-quality live face images. The manga dataset in our paper is taken from the publicly
available manga face dataset AnimeFace downloaded from the face dataset material website
(seeprettyface.com).
3.2.    The Sketch Detection Model


Figure 5.        The effect of the sketch detection model

    Figure 5(a)-(f) show six visualization effects after recognition by the sketch detection model. The
sketches are shown on the left side of the image, and the recognized label images are shown on the right
side of the image. Three sketches in (a)(c)(e) are from the sketch dataset, and three sketches in (b)(d)(f)
are manually drawn sketches. It can be seen that for the sketches with the same drawing style as the
training dataset, the feature maps have extremely high similarity and are consistent with the sketches.
For sketches with different styles from the training dataset, the feature maps are consistent with the
sketch’s facial orientation, five senses layout, and overall structure. In general, the recognition effect of
the module is basically as expected.


                                                     26
   The comparison of different sketch detection models FID is in Table 1. The smaller the value of FID
means, the smaller the distance between images. In the hand-drawn sketch, the FIDs of the segmentation
maps generated by the SofGAN model and the real segmentation maps tend to be close to 14. But the
FIDs of the sketch detection model are close to 11.96. This comparison confirms that the segmentation
maps generated by the sketch detection module implemented in this paper have a small distance from
the real segmentation images, and the similarity is high.

TABLE I.    FID VALUES OF DIFFERENT MODELS
                                          FID Values
                         Modelname
                                          SofGAN               Ours

                         FID              13.42                11.96

3.3.    The Manga Image Generator


Figure 6.       Rendering of different segmentation maps

   We use the trained manga image generator model to generate manga avatars for the feature matrix.
The experimental results are shown in Figure 6. Figure 6(a)-(f) shows the generation effects of six comic
avatars generated from the feature matrix, with the feature matrix on the left and the generated comic
avatars on the right of the images. It can be seen that the generated manga avatars correspond to the
feature map in terms of facial proportion, facial orientation, distribution of features, and expression
display. The model is basically as expected.
3.4.    Overall Results
   As shown in Figure 7, the system without sketch recognition could not extract labels and generate
reasonable comic avatars. The system with sketch recognition performs well on both sketch datasets and
manual hand-drawn sketches. It is able to generate comic avatars that correspond to the sketches in terms
of facial orientation, layout distribution of the five senses, and expression display, and the generated
results are basically as expected.
   Figure 8 shows the results of different algorithms for generating comic images. We can see that the
outline and pose of our generated cartoon image match better with the hand-drawn cartoon image, which
can better express the drawer’s intention.


                                                   27
Figure 7.        Rendering of different sketches


Figure 8.        Rendering of different sketch-to-manga algorithms

    To objectively confirm the module’s effectiveness, we evaluate the similarity between the sketches
of this system and the generated comic avatars using an evaluation index like PSNR. The larger the value
of PSNR indicates, the smaller the difference value between the two images. From the results in Table
2, we can see that the cartoon generation system implemented in this paper has a better performance
compared with other algorithms.

TABLE II.    PSNR VALUES OF DIFFERENT MODELS
                         PSNR Values
                         Pix2Pix          SofGAN                   Ours
                         28.83            28.14                    29.24

4. CONCLUSION AND PROSPECT
    We propose an algorithm to generate comic images based on sketches based on social and market
demands. We have reviewed the relevant literature published in recent years at home and abroad. We
divided the current comic image generation algorithms into three categories: one is to generate comic
images with comic features based on real photos, which has the disadvantage that it cannot control the
local feature attributes of the images (such as eyes, nose, mouth, etc.) and must have real photos as input;
two is to generate comic images randomly based on training data, which has the disadvantage that the
generated comic images are random and cannot control the local The third is to generate comic images
based on specified labels, which can control the local feature attributes, but the labels must be selected
first, and the generated comic images will be strange if the labels are selected incorrectly, and it is tedious
to select the labels each time. Based on the shortcomings of the above algorithms, we propose an
algorithm that explicitly controls the generation effect of comic images by hand-drawn sketches without
selecting labels. The experiments prove that the network model proposed by the author has a high cartoon
generation effect on the number of hand-drawn sketches, and the over-fitting problem of the network is
reduced after data filtering, which further improves the quality of cartoon image generation. There are

                                                      28
still many unsolved problems in the author's research. For example, the current label recognition
performed by the author cannot separate the left and right eyes, and cannot render cartoons for hand-
drawn sketches with different postures of the left and right eyes. The future recognition of sketches with
more feature details to produce caricature drawings with richer pose contours is an important direction
for our research.
5. ACKNOWLEDGEMENTS
   This work is supported in part by the National Natural Science Foundation of China under grant no.
61976051, U19A2067, and U1811463.
6. REFERENCES
[1] P. Isola, J. Y. Zhu, T. Zhou, and A. E. Alexei, “Image-to-Image Translation with Conditional
     Adversarial Networks,” CVPR, 2017, pp. 1125-1134
[2] T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image
     synthesis and semantic manipulation with conditional gans,” Proceedings of the IEEE conference
     on computer vision and pattern recognition. 2018: 8798-8807.
[3] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep
     convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[4] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks”.
     International conference on machine learning. PMLR, 2017: 214-223.
[5] T. Park, M. Y. Liu, T. C. Wang, and J. Y. Zhu, “Semantic Image Synthesis With Spatially-
     Adaptive Normalization,” Proceedings of the IEEE/CVF conference on computer vision and
     pattern recognition. 2019: 2337-2346.
[6] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral Segmentation Network
     for Real-time Semantic Segmentation,” Proceedings of the European conference on computer
     vision (ECCV). 2018: 325-341.
[7] T. Shen, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Bi-directional block self-attention for fast and
     memory-efficient sequence modeling,” arXiv preprint arXiv:1804.00857, 2018.
[8] Z. Wu, C. Shen, A. Hengel, “Real-time Semantic Image Segmentation via Spatial Sparsity,” arXiv
     preprint arXiv:1712.00213, 2017.
[9] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, C. Theobalt, “StyleRig: Rigging StyleGAN for 3D
     Control over Portrait Images,” Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition. 2020: 6142-6151.
[10] X. Xiang, D. Liu, X. Yang, Y. Zhu, and J. P. Allebach, “Adversarial open domain adaption for
     sketch-to-photo synthesis,”.
[11] E. Simo-Serra, S. Iizuka, K. Sasaki, H. Ishikawa, “Learning to simplify: fully convolutional
     networks for rough sketch cleanup,” ACM Transactions on Graphics (TOG), 2016, 35(4): 1-11.


                                                   29

</pre>