CNN-based System for Low Resolution Face Recognition (Discussion Paper) Fabio Valerio Massoli[0000−0001−6447−1301] , Giuseppe [0000−0003−0171−4315] Amato , Fabrizio Falchi[0000−0001−6258−5313] , Claudio [0000−0002−0967−5050] Gennaro , and Claudio Vairo[0000−0003−2740−4331] ISTI-CNR, via G. Moruzzi 1, 56124 Pisa, Italy {fabio.massoli, giuseppe.amato, fabrizio.falchi, claudio.gennaro, claudio.vairo}@isti.cnr.it Abstract Since the publication of the AlexNet in 2012, Deep Convolutional Neural Network models became the most promising and powerful technique for image representation. Specifically, the ability of their inner layers to extract high level abstractions of the input images, called deep features vectors, has been employed. Such vectors live in a high dimensional space in which an inner product and thus a met- ric is defined. The latter allows to carry out similarity measurements among them. This property is particularly useful in order to accomplish tasks such as Face Recognition. Indeed, in order to identify a person it is possible to compare deep features, used as face descriptors, from different identities by means of their similarities. Surveillance systems, among others, utilize this technique. To be precise, deep features extrac- ted from probe images are matched against a database of descriptors from known identities. A critical point is that the database typically contains features extrac- ted from high resolution images while the probes, taken by surveillance cameras, can be at a very low resolution. Therefore, it is mandatory to have a neural network which is able to extract deep features that are robust with respect to resolution variations. In this paper we discuss a CNN-based pipeline that we built for the task of Face Recognition among images with different resolution. The entire system relies on the ability of a CNN to extract deep features that can be used to perform a similarity search in order to fulfill the face recognition task. Keywords: Convolutional Neural Networks · Face Recognition · Ensemble Methods. Copyright c 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. 1 Introduction Content based image retrieval (CBIR) is one of the most active research field in the computer vision community. In this context, commonly faced tasks are instance-level retrieval and class retrieval [16]. In the former, given a query image, the goal is to retrieve images that contain the same object regardless of image distortions such as different illumination, rotation or occlusion. Instead, in the latter the purpose is to retrieve all the available images that belong to the same class. Before the advent of the Convolutional Neural Networks (CNN), the scale-invariant feature transform [12] (SIFT) based methods were among the most frequently used in order to extract global descriptors from the given images. A breakthrough occurred in 2012 when the AlexNet [9] was created and won the ImageNet Large Scale Visual Recognition Competition (ILSRVC) improving upon the state-of-the-art by a noticeable margin. Since then, CNN-based methods for image retrieval received considerably more attention from the scientific community [1], [15]. Under the hood, these methods rely on the ability of deep models to extract the so called deep features from given input images. From a theoretical perspective, the inner layers of a CNN realize an abstraction of the input that describes spe- cific concepts contained inside the data. Moreover, due to the typical structure of deep models architecture, inner layers combine the information available from previous layers thus achieving a higher level of abstraction that summarizes the overall content of the input data. Based on this observation, deep features are usually adopted as global descriptors for input images. Thus, the deeper the layer from which we extract deep features is, the more descriptive of the input they are. It is common practice to extract them from the penultimate layer of a CNN. As previously said, deep features are high dimensional vectors defined in a space on which it is defined an inner product and thus a metric. This property is fundamental since it allows to evaluate similarities among descriptors, extracted from different images, that can be used as indicators of the similarities of the content of the original data. An example of this principle is sketched in Figure 1. Figure 1. Schematic example of a CNN-based features extraction method for images that belong to two different classes. Vectors from same images tends to cluster in the arrival space. This concept is typically applied in the context of surveillance systems. Indeed, in a scenario where an input face image is acquired by a camera, its global descriptor is extracted and used in order to perform a similarity search on a database (db) containing a gallery of features vectors that belong to known identities. For example, the research in the db can be accomplished by evaluat- ing the cosine similarity, or the Euclidean distance, among the probe and gallery vectors. The similarity search becomes even more challenging when gallery and probe come from different resolution domains. This is the background from which we started the study we present in this pa- per. Our final goal is to conceive a pipeline for face recognition based on neural networks. In order to extract deep features we used a pre-trained ResNet-50 [7] architecture, with Squeeze-and-Excitation blocks [8]. The extracted descriptors are then used to perform a similarity measurements. The performance of the deep model used for features extraction has been evalu- ated on the 1:1 verification protocol on the IARPA Janus Benchmark-B (IJB-B) dataset [17]. 1.1 Paper Organization The remaining part of the paper is organized as follows. In Section 2 we briefly review some related works. In Section 3 we describe in detail the pipeline that we implemented. In Section 4 we present the experimental results. In Section 5 we conclude the paper with a summary of the main results and future perspectives. 2 Related Works Deep learning techniques are currently experiencing a huge expansion in their field of application mainly as a result of the extremely high computational power reached by the modern GPUs. Moreover, the existence of big data- sets [11], [18], [3], [2] and [6] has made it possible to train neural networks and to let them nearly reach human levels of performance when tested against tasks such as image classification [9], object detection [4] and face recognition [14]. Due to its wide range of applications, the task of Face Recognition (FR) is among the hottest topics in the computer vision community. In particular, FR plays a key role in the context of smart surveillance systems [10], [13]. In such systems, the case is usually that a low resolution face image, taken by a surveillance camera, has to be matched against a database containing deep features extracted from high resolution images. To this end, several techniques have been developed in order to train deep models to deal with low resolution images. Some examples are Super Resolution and Common Space Projection techniques. Super Resolution is a technique based on the ability of a neural network to synthesize a high resolution image starting from a low resolution one. The recognition task is later fulfilled in the high resolution domain [5]. One of the weaknesses of the super resolution technique is that the identity information can be lost. In [20] they developed a neural network that, together with the super resolution task, tries to recover the initial low resolution image identity in the high resolution one. Instead, Common Space Projection techniques concern the ability of a neural networks to minimize the distance among deep features, in a common space, extracted from a low resolution image and its high resolution counterpart. For example, in [19] they train a two-branch CNN to learn a mapping from high/low resolution domain to a common space. Specifically, given a low and high resolution image, the model extracts features vectors of size 2048 and their distance is evaluated and used as loss in order to lead the training towards the desired direction. 3 Pipeline In this section we briefly describe the main modules of the pipeline we developed. A schematic view is shown in Figure 2. Face Detector CNN for features extraction Probe Gallery Figure 2. Sketch of the implemented pipeline. There are three main components at the heart of the system: a face detector, a features extractor and a classifier. We will be focusing on the first two. The face detection task is accomplished by means of a multi-stage architecture that, given an input image, delivers the coordinates of the bounding boxes that are centred around each face visible in the image. Specifically, we used the Multi-task Cascaded Convolutional Neural Networks (MTCNN) [21]. This step is performed once for each input frame. After all the faces have been identified from the picture, they are cropped, preprocessed and then used as input for the features extractor. The preprocess step includes the rescaling of the image with the shortest side resized at 256 pixels, the cropping of the squared central region with side of 224 pixels and the normalization of the image The features extractor module is made of a ResNet-50 [7] architecture, equipped with Squeeze-and-Excitation [8] blocks, that has been pretrained on the VGGFace2 dataset [3]. Features vectors are extracted before the classification layer, they have dimen- sionality equals to 2048 and they have been L2 -normalized before to evaluate any metric on them. 4 Experimental Tests In order to test the performance of the features extractor we used the IJB-B dataset [17]. In particular, we tested the model against the 1:1 verification pro- tocol aiming to estimate its ability to extract discriminative features. Due to the low resolution requirements for a surveillance system, we conducted the per- formance evaluation using the dataset with different resolutions. In Figure 3, we show an example of the various resolution versions of the test images. The first column contains the full resolution test images while from the second to the last column down sampled images in the [8, 256] pixels range are shown. Figure 3. Example of images down sampled at various resolutions. First column: im- ages at full resolutions. Second - Last columns: down sampled images in [8x, 256x] pixels. The resulting Receiver Operating Characteristic (ROC) for the 1:1 verifica- tion task is shown in Figure 4. In order to evaluate the similarity among the different features vectors we measured the cosine similarity among them. As it is clear from Figure 4 the performance of the features extractor are degraded for lower resolution, especially below 32 pixels. Moreover in Table 1 we reported the True Acceptance Rate (TAR) at a reference value for the False Acceptance Rate (FAR) of 1.e−3 . 1.0 ROC curve Full resolution DS @ 24 px DS @ 128 px DS @ 16 px 0.9 DS @ 64 px DS @ 8 px True Accept Rate (TAR) DS @ 32 px 0.8 0.7 0.6 0.5 10 5 10 4 10 3 10 2 10 1 100 False Accept Rate (FAR) Figure 4. Feature extractor Receiver Operating Characteristic (ROC) for different values of the image resolution (DS stands for “Down Sampled”). Table 1. True Acceptance Rate (TAR @ FAR = 1.e−3 ) for different values of the images resolution. Architecture TAR ( @ FAR = 1.e−3) 8 16 24 32 64 128 256 Se-ResNet-50 4.8 24.4 60.2 77.0 88.3 89.5 89.8 Up until now, we have considered the case in which the images had the same resolution. In a real scenario it usually happens that probe and gallery have different resolutions. For example, in the case of surveillance systems it is common that the gallery database is populated with high resolution images descriptors while the probe has a lower resolution. In Table 2 we reported the value for the TAR @ FAR equals to 1.e−3 for the cross resolution 1:1 verification task. It is clear from Table 2 that the cross resolution face recognition task is very challenging for a deep neural network model, especially when the images have very low resolutions. 5 Conclusion and Future Experiments Surveillance systems require high performance from CNN-based systems on the face recognition task. Precisely, the deep models have to be robust with respect to variations of the input image resolution since it is usually the case that low resolution images, from surveillance cameras, have to be matched against a data- base containing deep features extracted from high resolution images. In fact, the descriptor extracted for each human face have to be robust with respect to res- olution variations otherwise any kind of similarity search across the identities Table 2. True Acceptance Rate (TAR @ FAR = 1.e−3 ) for 1:1 verification protocol considering images with mixed resolutions. Image Resolution (pixel) 8 16 24 32 64 128 256 8 4.8 16 0.2 24.4 24 0.2 18.6 60.2 Image 32 0.2 9.0 65.3 77.0 Resolution (pixel) 64 0.2 2.9 60.4 80.5 88.3 128 0.2 2.4 57.9 80.1 88.9 89.5 256 0.2 2.3 57.5 80.1 90.0 89.7 89.8 database will fail. We have seen that, even though there are models that per- form well on the FR task, their performances suddenly drop when we used low resolution images. Although we have shown the feasibility of a pipeline for FR based on deep models, we need to improve upon its performance especially in the case of mixed resolutions. Thus, we are planning a new training campaign focused on the low resolution domain below 32 pixels. Finally, we will pay particular attention to the case of FR in which probe and gallery have different resolutions. What we expect from such a campaign is that, even if we might obtain a small drop in the performance at high resolution, the improvement at low and mixed resolutions should outweigh that drop. References 1. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision. pp. 584–599. Springer (2014) 2. Bansal, A., Nanduri, A., Castillo, C.D., Ranjan, R., Chellappa, R.: Umdfaces: An annotated face dataset for training deep networks. In: 2017 IEEE International Joint Conference on Biometrics (IJCB). pp. 464–473. IEEE (2017) 3. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). pp. 67–74. IEEE (2018) 4. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional net- works for accurate object detection and segmentation. IEEE transactions on pat- tern analysis and machine intelligence 38(1), 142–158 (2016) 5. Grm, K., Pernuš, M., Cluzel, L., Scheirer, W., Dobrišek, S., Štruc, V.: Face hallucination revisited: An exploratory study on dataset bias. arXiv preprint arXiv:1812.09010 (2018) 6. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision. pp. 87–102. Springer (2016) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 8. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arxiv (2017) 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012) 10. Lavi, B., Serj, M.F., Ullah, I.: Survey on deep learning techniques for person re- identification task. arXiv preprint arXiv:1807.05284 (2018) 11. Learned-Miller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: A survey. In: Advances in face detection and facial image analysis, pp. 189–248. Springer (2016) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision 60(2), 91–110 (2004) 13. Nikouei, S.Y., Chen, Y., Song, S., Xu, R., Choi, B.Y., Faughnan, T.: Smart surveil- lance as an edge network service: From harr-cascade, svm to a lightweight cnn. In: 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC). pp. 256–265. IEEE (2018) 14. Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: bmvc. vol. 1, p. 6 (2015) 15. Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max- pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015) 16. Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: European conference on computer vision. pp. 776–789. Springer (2010) 17. Whitelam, C., Taborsky, E., Blanton, A., Maze, B., Adams, J., Miller, T., Kalka, N., Jain, A.K., Duncan, J.A., Allen, K., et al.: Iarpa janus benchmark-b face data- set. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 90–98 (2017) 18. Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. IEEE (2011) 19. Zangeneh, E., Rahmati, M., Mohsenzadeh, Y.: Low resolution face recognition using a two-branch deep convolutional neural network architecture. arXiv preprint arXiv:1706.06247 (2017) 20. Zhang, K., Zhang, Z., Cheng, C.W., Hsu, W.H., Qiao, Y., Liu, W., Zhang, T.: Super-identity convolutional neural network for face hallucination. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 183–198 (2018) 21. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)