=Paper=
{{Paper
|id=Vol-2744/short17
|storemode=property
|title=Neural Network Model for Face Recognition from Dynamic Vision Sensor (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2744/short17.pdf
|volume=Vol-2744
|authors=Fedor Shvetsov,Anton Konushin,Anna Sokolova
}}
==Neural Network Model for Face Recognition from Dynamic Vision Sensor (short paper)==
Neural Network Model for Face Recognition from Dynamic Vision Sensor? Fedor Shvetsov [0000-0001-5112-0430] , Anton Konushin [0000-0002-6152-0021] , and Anna Sokolova [0000-0001-8777-2035] {fedor.shvetsov, anton.konushin, anna.sokolova}@graphocs.cs.msu.ru Lomonosov Moscow State University Abstract. In this work, we consider the applicability of the face recognition al- gorithms to the data obtained from a dynamic vision sensor. A basic method using a neural network model comprised of reconstruction, detection, and recognition is proposed that solves this problem. Various modifications of this algorithm and their influence on the quality of the model are considered. A small test dataset recorded on a DVS sensor is collected. The relevance of using simulated data and different approaches for its creation for training a model was investigated. The portability of the algorithm trained on synthetic data to the data obtained from the sensor with the help of fine-tuning was considered. All mentioned variations are compared to one another and also compared with conventional face recogni- tion from RGB images on different datasets. The results showed that it is possible to use DVS data to perform face recognition with quality similar to that of RGB data. Keywords: DVS, Face Recognition, Data Simulation 1 Introduction In recent years, a new type of camera is gaining popularity, Dynamic Vision Sensor (DVS). While in traditional cameras, information is recorded with a certain fixed fre- quency (usually 25–30 times per second), the dynamic vision sensor records only the fact of a change in the level of illumination in a pixel if it exceeds a certain threshold. Thus, these cameras operate according to the principle of the human eye, which re- sponds only to changes. This approach allows to get rid of a large amount of redundant static data, focusing only on dynamic events. Such sensors have several advantages: high speed (which allows to catch very fast events in small details), low power and memory consumption (an important feature for embedded systems, where there is no way to place a large battery and hard drive), high sensitivity (a key property for record- ing under extreme light conditions). Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ? Publication is supported by RFBR grant 18-08-01484. 2 F. Shvetsov, A. Konushin, A. Sokolova Such cameras are quite expensive and do not have high resolution, however, due to the rapid development of technologies in this area, it is necessary to develop algorithms for solving various applied problems, such as 3D reconstruction, detection and tracking of objects [1]. One of these tasks is person identification, since these cameras are cur- rently often used in video surveillance systems. There are various ways to recognize a person by frame, for example, by the walk [2] [3] [4] or by face. If the face is distin- guishable in the frame and has a sufficient size, it makes sense to recognize person by it. Nowadays, systems that perform face recognition are very relevant, since they im- plement the most effective way of contact-less identification of a person. They are used in security systems, bank card verification, people mark-up, forensics, online payments etc. Face recognition problem can be decomposed into several sub-tasks: finding the face in the image, normalizing the found face and, finally, identifying the person. In this paper we propose a new method for face recognition using data from DVS sensor. 2 Related Work 2.1 Detection The problem of detecting faces is one of the special cases of the detection problems, but it has its own specificities. The human face has distinctive features, which were searched and analyzed in first approaches in this area. A big breakthrough was [5], which used haar filters to find faces using cascades of detectors. However, such algorithms did not provide stability, since faces had great variability due to different lighting and viewing angles. Then the partially-deformed models [6] were proposed, which aimed at solving detection problem. However, these methods were computationally costly and required complex markup for training. Similarly to most of computer vision tasks, the detection problem can be solved by the deep learning methods, the popularity of which has grown significantly after the work [7]. These methods have also been successfully applied to the task of detecting faces, for example, in [8]. Being also based on this approach, the work MTCNN [9] uses three light neural networks to find faces in the image. 2.2 Recognition The face recognition problem has been of interest to the scientific world for a long time. The first systems for solving this problem were developed back in 1964 [10]. Since then, the level of quality of this technology has greatly increased, and modern algorithms are able to distinguish people’s faces better than the people themselves [11]. Various methods were used to solve the face recognition problem, and these meth- ods have changed greatly over time. The first algorithms attempted to distinguish be- tween faces by finding distinctive features such as eye color, face proportions, etc. [12]. The work [13] has made a great contribution to the development of methods by using the similarity of eigenvectors for faces (eigenface). However, in general, the majority Neural Network Model for Face Recognition from Dynamic Vision Sensor 3 of modern methods [14] try to recognize faces by creating embeddings for them. These methods have become especially popular after the widespread use of convolutional neu- ral networks [15]. The same approach is used in the popular work [16]. 2.3 Reconstruction One of the key aspects of this work is an algorithm for reconstructing frames from the stream of events of a dynamic vision sensor. The algorithm was proposed by [17]. It also apply artificial neural networks. In this case, recurrent neural networks are used, the main feature of which is the ability to memorize the state obtained by processing the next element of the sequence and use it for further calculations. In this algorithm, the neural network receives at the input of stream of events from the dynamic vision sensor for a certain period of time, and the model reconstructs an image that visually looks like grayscale image. 3 Proposed method 3.1 Formal problem The face recognition problem can be considered in two equivalent forms: identification and verification. In this paper verification form was chosen. Data from the dynamic vision sensor comes in the form of set of events. Event — (x, y, ts, p), where x, y ∈ Z, x ∈ [0, N ], y ∈ [0, M ] are the coordinates of the pixel in the matrix N × M , ts ∈ R — timestamp, p ∈ {−1, 1} — the polarity of the change (the brightness in the pixel decreased/increased by a given threshold). Algorithm Input: sets of events T1 and T2 received from the dynamic vision sensor. Algorithm Output: A ∈ {0, 1}: A = 1 if the sets T1 and T2 describe one person, A = 0 if different. 3.2 Our method Since the main source of information in computer vision tasks is usually an image or a sequence of images, but not the event stream obtained by DVS, it is necessary to convert the stream of events into visual representation. Such a visualization can be made in different ways. The most simple one is setting time marks and counting events occurred between and visualizing it in gray-scale. However, it turns out, that this approach does not provide with satisfying quality and detectors could not find faces on such images. Thus, the reconstructions with neural network [17] were used for visualizations which yields much better results (see Fig. 1). The proposed basic method works as follows (see Fig. 2): first, the stream of events from the sensor is reconstructed into frames with model [17], then the faces are located in frames using the detector [9], then the neural network calculates the internal repre- sentations for them in the form of vectors f1 , f2 ∈ Rn [16], then the proximity of these vectors is determined using the cosine distance (1). If the proximity is higher than the specified threshold, then 1 is predicted, otherwise – 0. 4 F. Shvetsov, A. Konushin, A. Sokolova Fig. 1. Simple visualization (left), neural net reconstruction (right) Pn f1 f2 i=1 f1i f2i cos(f1 , f2 ) = = pPn pPn (1) kf1 kkf2 k i=1 (f1i ) 2 i=1 (f2i ) 2 Fig. 2. Overall pipeline of method 4 Data simulation The dynamic vision sensor is a fairly new type of cameras, and very few datasets have been recorded for it so far. As far as we know there are no publicly available datasets for a face recognition task. Therefore, it was proposed to use the collections of color videos collected for the face recognition task and simulate dynamic vision sensor data from them. It can be done in two steps: firstly, the intermediate values in each pixel are interpolated between two neighboring frames and secondly, at each point the change in Neural Network Model for Face Recognition from Dynamic Vision Sensor 5 intensity between adjacent interpolated frames is compared, and if this change exceeds the threshold, an event is generated. As you can see in fig. 3, the results of real and simulated event streams are very similar. This gives us the opportunity to assume that studies conducted on simulated data will be fairly well transferred to real data. Fig. 3. Data from sensor (left) simulated frame (right) Since the linear interpolation of intermediate frames is not very fair leading to blurred frames, it was proposed to improve the simulation method by using better ap- proximation. To do this, it was decided to use the results of [18], which creates a slow motion effect, and incorporate them into the reconstruction process. This approach uses the creation of intermediate frames to simulate dynamic vision sensor data from color video sequences, thereby smoothing visualizations. A variation with the creation of one intermediate frame was used. In fig. 4 the visual difference in the images presented. Fig. 4. Simple neural net reconstruction (left) and with intermediate frame (right) 6 F. Shvetsov, A. Konushin, A. Sokolova 5 Experimental Evaluation 5.1 Datasets YouTubeFaces. The main dataset that met the criteria for simulation was YouTube- Faces Dataset [19], which consists of videos collected from the YouTube, each of which contains a specific person. Its main advantage is a large number of people. The collec- tion consists of 3425 video sequences containing information about 1595 people. Due to the large number of subjects in this dataset, it was possible to apply the neural network fine-tuning. Two-thirds of the collection was held-out for the training set, and one-third for testing, where all videos with a specific person were fully included in either the first or second group. Fine-tuning on the training set was performed where original network was trained on VGGFace2 [20] collection. ChokePoint. In addition, it is proposed to use dataset obtained under conditions similar to real scenarios of using a dynamic vision sensor as a test set. For this, the ChokePoint [21] dataset was selected. In this dataset, 48 video sequences were recorded containing 40 people passing through the entrance to the room. Along with the video, frame-by-frame mark-up of a person in the frame is provided. The viewing angles of the cameras used to record the dataset are similar to the same angles in video surveillance systems, which reflects the possible location of the dynamic vision sensor designed to solve this problem. GML DVS. In order to check the portability of the created model for real data obtained from a dynamic vision sensor, a small dataset of eight video sequences containing eight people was captured. 80 faces were automatically found by face detector and manually labelled. 5.2 Results To evaluate the quality of the proposed method, we make a verification experiment se- lecting the pairs of objects and comparing their similarity with some threshold to decide if they belong to the same person or not. Setting different thresholds to distinguish faces we can obtain AUC metric which is the area below ROC curve and that can be a great indicator of general performance of the model. The method is tested against verifica- tion on RGB images when possible. Variations with advanced reconstruction and uses of fine-tuning on those reconstructions are examined. The results are presented in Table 1 and 2. We can see that recognition results on DVS reconstructions are quite similar to those on RGB images and that fine-tuning enables us to improve quality of a model. Furthermore, this fine-tuning allows to enhance performance on data obtain from the real DVS sensor which was also quite good comparing to simulated data proving the portability of model. Neural Network Model for Face Recognition from Dynamic Vision Sensor 7 Table 1. AUC metric for simulated images Experiment YTF ChokePoint RGB images 0.958 0.996 Reconstructions 0.922 0.948 Reconstructions + fine-tuning 0.928 0.961 Advanced Reconstructions 0.921 0.952 Advanced Reconstructions + fine-tuning 0.922 0.962 Table 2. AUC metric for GML DVS Experiment AUC Reconstructions 0.931 Reconstructions + fine-tuned 0.936 Advanced Reconstructions + fine-tuned 0.954 6 Conclusion This paper explores the possibility of constructing a solution to the face recognition problem based on dynamic vision sensor data. It implements the basic solution method using a neural network model. The results show that we can apply existing methods to solve this task at a level similar to that of the RGB images. Uses of simulated frames provided a great way to improve performance of the model which is very helpful due to the scarce amount of real data. References 1. C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck. Retinomorphic event-based vision sensors: Bioinspired cameras with spiking output. Proceedings of the IEEE, 102(10):1470–1484, 2014. 2. Anna Sokolova and Anton Konushin. Human identification by gait from event-based camera. In 2019 16th International Conference on Machine Vision Applications (MVA), IEEE Xplore Digital Library, pages 1–6. IEEE, 2019. 3. Anna Sokolova and Anton Konushin. Pose-based deep gait recognition. IET Biometrics, 8(2):134–143, 2018. 4. Yanxiang Wang, Bowen Du, Yiran Shen, Kai Wu, Guangrong Zhao, Jianguo Sun, and Hongkai Wen. Ev-gait: Event-based robust gait recognition using dynamic vision sensors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 5. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–I, 2001. 6. P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with de- formable part models. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2241–2248, 2010. 7. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 8 F. Shvetsov, A. Konushin, A. Sokolova 8. Haoxiang Li, Zhe Lin, Xiaohui Shen, and Jonathan Brandt. A convolutional neural network cascade for face detection. pages 5325–5334, 06 2015. 9. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and align- ment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, Oct 2016. 10. Jan Bergstra Karl de Leeuw. The history of information security: A comprehensive hand- book. page 264–265, 2007. 11. P. Jonathon Phillips and Alice J. O’Toole. Comparison of human and computer performance across face recognition experiments. Image and Vision Computing, 32(1):74 – 85, 2014. 12. Roberto Brunelli and Tomaso Poggio. Face recognition: Features versus templates. IEEE transactions on pattern analysis and machine intelligence, 15(10):1042–1052, 1993. 13. Matthew Turk and Alex Pentland. Face recognition using eigenfaces. In Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, pages 586– 587, 1991. 14. Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014. 15. Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2892–2900, 2015. 16. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015. 17. Henri Rebecq, Rene Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2020. 18. Huaizu Jiang, Deqing Sun, Varan Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018. 19. L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In CVPR 2011, pages 529–534, 2011. 20. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognis- ing faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, 2018. 21. Yongkang Wong, Shaokang Chen, Sandra Mau, Conrad Sanderson, and Brian C. Lovell. Patch-based probabilistic image quality assessment for face selection and improved video- based face recognition. In IEEE Biometrics Workshop, Computer Vision and Pattern Recog- nition (CVPR) Workshops, pages 81–88. IEEE, June 2011.