Identity Documents Recognition and Detection using Semantic Segmentation with Convolutional Neural Network Mykola Kozlenkoa, Volodymyr Sendetskyib, Oleksiy Simkivb, Nazar Savchenkob, and Andy Bosyib a Vasyl Stefanyk Precarpathian National University, 57 Shevchenko str., Ivano Frankivsk, 76018, Ukraine b MindCraft AI LLC, 19 Lisna str., Lviv, 79010, Ukraine Abstract Object recognition and detection are well-studied problems with a developed set of almost standard solutions. Identity documents recognition, classification, detection, and localization are the tasks required in a number of applications, particularly, in physical access control security systems at critical infrastructure premises. In this paper, we propose the new original architecture of a model based on an artificial convolutional neural network and semantic segmentation approach for the recognition and detection of identity documents in images. The challenge with the processing of such images is the limited computational performance and the limited amount of memory when such an application is running on industrial one- board microcomputer hardware. The aim of this research is to prove the feasibility of the proposed technique and to obtain quality metrics. The methodology of the research is to evaluate the deep learning detection model trained on the mobile identity document video dataset. The dataset contains five hundred video clips for fifty different identity document types. The numerical results from simulations are used to evaluate the quality metrics. We present the results as accuracy versus threshold of the intersection over union value. The paper reports an accuracy above 0.75 for the intersection over union (IoU) threshold value of 0.8. Besides, we assessed the size of the model and proved the feasibility of running the model on an industrial one-board microcomputer or smartphone hardware. Keywords 1 Identity document, object detection, semantic segmentation, document recognition, document classification, deep learning, neural network 1. Introduction Almost every organization today uses access control security systems. Usually, employees use special access cards. But there is a problem for guests or people who visit an object for the first time and do not have an access card. In this case, the identification of the person can be performed according to the data of any official identity document. Identification can be done by detecting a document in an image from a camera or scanner followed by extraction of text information. Object recognition and detection are well-studied problems with a developed set of almost standard solutions. Identity documents recognition, classification, detection, and localization are very popular tasks in the computer vision area and are required in many security applications [1]. Nowadays there are some classical approaches to object detection: Viola-Jones object detection framework based on Haar features [2], scale-invariant feature transform [3], a histogram of oriented gradients [4], etc. Also, object detection algorithms are implemented in popular frameworks and libraries such as OpenCV and many others. There are many deep learning-based approaches as well [5]. In this paper, we propose a new neural network (NN) architecture and investigate the performance of the semantic segmentation-based approach for identity documents detection. Cybersecurity Providing in Information and Telecommunication Systems, January 28, 2021, Kyiv, Ukraine EMAIL: mykola.kozlenko@pnu.edu.ua (A.1); volodymyr.sendetskyi@mindcraft.ai (B.2); alex.simkiv@mindcraft.ai (B.3); nazar.savchenko@mindcraft.ai (B.4); andy.bosyi@mindcraft.ai (B.5) ORCID: 0000-0002-2502-2447 (A.1); — (B.2); — (B.3); — (B.4); — (B.5) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 234 2. Related Work In recent years, many successful approaches to object detection using deep learning were proposed. R-CNN solution was proposed first in [6]. Reference [7] presents the Fast R-CNN. The Faster R-CNN is reported in [8]. Also, the following are well-known and widely used approaches. Single Shot MultiBox Detector (SSD) [9] approach is based on a feed-forward convolutional network that produces a collection of bounding boxes and scores for the presence of object class instances. One of the most popular object detectors is the You Only Look Once (YOLO) detector [10]. YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes [10]. It outperforms all other detection methods, including R-CNN. There are also some other well-known methods: Single-Shot Refinement Neural Network for Object Detection (RefineDet) [11], Retina-Net [12], Deformable convolutional networks [13], and others. Reference [14] is devoted to identity document recognition in a video stream. The paper [15] studies the problem of image classification of identity documents composed of few textual information fields and complex backgrounds. The proposed approach simultaneously locates the document and recognizes the class. Paper [16] discusses the problem of simultaneous document type recognition and projective distortion parameter estimation for the images of identity documents. The problem of face detection on identity documents under unconstrained environments was sufficiently studied in [17]. In [18] it is proposed the original neural network architecture for the semantic image segmentation task contains layers calculating direct and transposed integral Fast Hough Transform operators. 3. Dataset In this research, we use the Mobile Identity Document Video dataset (MIDV-500) [19]. It consists of 500 video clips for 50 different identity document types with ground truth. The dataset contains data on 17 types of ID cards, 14 types of passports, 13 types of driving licenses, and 6 other identity documents of various countries. Each captured frame had the same resolution of 1080 by 920 pixels. There are the following cases in the dataset: the document lies on the table with homogeneous background, the document lies on various keyboards, the document is held by a hand, the document is partially hidden, the background is stuffed with unrelated objects. Total counts of train and test samples are 10500 and 4500. Some instances of images are presented in Fig. 1. Fig. 1 also shows the detection results obtained using OpenCV (light green boundaries). There are several images in which this approach works well, such as the bottom-right picture in the figure. But for most of the images, we conclude that the conditions are very diverse. A simple image processing algorithm cannot cover all the variety of colors, lighting, shadows, blur, and other differences. We converted the data in our dataset into the following structure (refer to Fig. 2), where: the ‘path’ is the path to an image within the dataset, the ‘x0’, ‘y0’, ‘x1’, ‘y1’, ‘x2’, ‘y2’, ‘x3’, ‘y3’ are the ground truth coordinates of quadrilateral vertices of the document image, the ‘part’ is a number specifying a part of the dataset, the ‘group’ is the background used in the image. The idea of the data import is simple: iterate over all the images, resize them, draw the ground truth in a blank image, and store them in the corresponding variables. Then, we can simply return a batch of a certain size. 4. Method and Model Design The proposed architecture of the artificial convolutional neural network (CNN) is presented in Fig. 3. The idea behind this model is as follows: we downsample the input image to the size 8x8 while learning some features about most of the regions. Then, we pass those features to a few dense layers that make a decision on whether there is an identity document in the image and if yes, where it is located. Finally, we use that decision and features calculated in the downsampling part. 235 Figure 1: Examples of identity documents and backgrounds from the dataset Figure 2: The structure of the converted data All the concatenate layers implement a kind of skip-connections in the CNN. Despite this decision layer inside the model, it is still a semantic segmentation network that produces a probability map to define whether a pixel belongs to an identity document or not. For architecture details, data dimensionality, hyper-parameters, and the number of neurons in layers refer to Fig. 3. The optimizer is ‘Adam,’ Keras built-in. The learning rate is 0.001. The number of training epochs is 60. The loss function is binary cross-entropy. The metrics are the following: accuracy, precision, recall. There are a total of 198,273 trainable model parameters. The size of the model is 832 KiB. It appears to be small enough to run on a smartphone or one-board microcomputer. We used the TensorFlow [20] and Keras [21] frameworks in our work. The Tensorboard was used for the visualization of training scalars and neural network structures. 236 Figure 3: The architecture of the proposed CNN 237 5. Training and Evaluation Training of the model was performed using a conventional server with Intel(R) Core(TM) i7- 9700K CPU, 3.60 GHz with 64 GiB of RAM. The training procedure takes approximately 9 ms per one sample, 290 ms per step (batch), 95 seconds per epoch. The number of samples per gradient update (the batch size) is 32. The training and validation loss, accuracy, precision, and recall versus epoch number are presented in Fig. 4 and 5. Values are taken at the end of each epoch. We used post-predict evaluation in order to evaluate the model. The test set went through the prediction method. After that, predictions were compared to the ground truth and the confusion matrix was derived. The following class-wise metrics were obtained from the confusion matrix: accuracy, true positive rate (TPR, recall), positive predictive value (PPV, precision), etc. Figure 4: Loss and accuracy versus the number of epochs Figure 5: Precision and recall versus the number of epochs 6. Results The plot of accuracy versus Intersection over Union (IoU) threshold value is presented in Fig. 6. We achieved an accuracy value of 0.77 for an IoU threshold value of 0.8 on the test set. That is much 238 better in comparison with the simple OpenCV-based approach (accuracy value of 0.32 for this dataset). An example of the model input, ground truth, and prediction is presented in Fig. 7. Figure 6: Accuracy versus the IoU threshold value Figure 7: Model input, ground truth, and the prediction After NN makes its prediction on the resize of a given image, we threshold the result by 0.5 and search it for all the contours. After we smooth each contour, we check whether it has four edges and if it occupies the minimum allowed area. If yes, it is checked to be the biggest among other such contours. The selected contour is resized correspondingly to an input image and the rectangle is extracted using OpenCV tools. The result is shown in Fig. 8. 239 Time complexity is one of the most important issues related to real-time data processing. We found the run-time complexity of the detection by measuring the time of one image processing on the needed hardware platform. The average processing time of one image is 8 ms. So, it is possible to perform real-time object detection with the mentioned above hardware platform. Detailed Python3 code of the working prototype we provide in [22]. Figure 8: The input image and the prediction 7. Conclusion and discussion The overall purpose of the study was to prove the feasibility of efficient identity document detection using the convolution neural network of the proposed architecture. Our main finding suggests that the use of the proposed CNN has an acceptable outcome. CNN layers, as feature extractors, and dense neural layers are easy to implement computational structures with modern hardware platforms such as smartphones, microcontrollers, and industrial one-board microcomputers. They can be easily implemented using modern software frameworks. So, it is possible to build different applications and services using this approach. As stated above, the accuracy of the method is high enough. An important advantage of the proposed method is the ability to permanently retraining on new data. This makes it easy to adapt to new conditions and image properties. 8. Limitations and further research The concern about the study was the limitation of the use of only one dataset. Other data might have different properties. Therefore, there is a need to evaluate the model on other data. In addition, tuning the hyperparameters issue is to be studied. The limitations of the study are not fatal and will be addressed in our future research. Also, we are planning to apply this semantic segmentation-based deep learning approach to process one-dimensional [23] and three-dimensional LiDAR data. 9. Acknowledgment The authors gratefully acknowledge the contributions of scientists of the MindCraft AI LLC and the Department of Information Technology of the Vasyl Stefanyk Precarpathian National University for scientific guidance given in discussions and technical assistance helped in the actual research. 10. Disclosures The authors declare that there is no conflict of interest. 240 11. References [1] S. Dasiopoulou, V. Mezaris, I. Kompatsiaris, V. Papastathis, M. G. Strintzis. Knowledge-assisted semantic video object detection, IEEE Transactions on Circuits and Systems for Video Technology 15.10 (2007). doi:10.1109/TCSVT.2005.854238. [2] D. Peleshko, K. Soroka, Research of usage of Haar-like features and AdaBoost algorithm in Viola-Jones method of object detection, in:12th International Conference on the Experience of Designing and Application of CAD Systems in Microelectronics, CADSM, IEEE, 2013, pp. 284–286. [3] W. Cheung, G. Hamarneh, n -SIFT: n -Dimensional Scale Invariant Feature Transform, in: IEEE Transactions on Image Processing 18.9 (2009) 2012–2021. doi:10.1109/TIP.2009.2024578. [4] N. Dalal, Bill. Triggs, Histograms of Oriented Gradients for Human Detection, in: International Conference on Computer Vision & Pattern Recognition, CVPR ’05, San Diego, United States, 2005. pp.886–893. [5] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012. [6] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014, pp. 580–587, doi:10.1109/CVPR.2014.81. [7] R. Girshick, Fast R-CNN, in: IEEE International Conference on Computer Vision, ICCV, Santiago, 2015, pp. 1440–1448, doi:10.1109/ICCV.2015.169. [8] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39.6 (2017) 1137–1149. doi:10.1109/TPAMI.2016.2577031. [9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, SSD: Single Shot MultiBox Detector, in: B. Leibe, J. Matas, N. Sebe, M. Welling, (Eds.), Computer Vision – ECCV, 2016. pp. 21–37. doi:10.1007/978-3-319-46448-0_2. [10] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, 2016, pp. 779–788, doi:10.1109/CVPR.2016.91. [11] S. Zhang, L. Wen, X. Bian, Z. Lei, S. Z. Li, Single-Shot Refinement Neural Network for Object Detection, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 4203–4212, doi:10.1109/CVPR.2018.00442. [12] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, Single-Shot Refinement Neural Network for Object Detection, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 4203–4212, doi:10.1109/CVPR.2018.00442. [13] X. Zhu, H. Hu, S. Lin, J. Dai, Deformable ConvNets V2: More Deformable, Better Results, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 2019, pp. 9300–9308, doi:10.1109/CVPR.2019.00953. [14] K. Bulatov, V. V. Arlazarov, T. Chernov, O. Slavin, D. Nikolaev, Smart IDReader: Document Recognition in Video Stream, in: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR, Kyoto, 2017, pp. 39–44, doi:10.1109/ICDAR.2017.347. [15] A. M. Awal, N. Ghanmi, R. Sicre, T. Furon, Complex Document Classification and Localization Application on Identity Document Images, in: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR, Kyoto, 2017, pp. 426–431, doi:10.1109/ICDAR.2017.77. [16] N. Skoryukina, V. Arlazarov, D. Nikolaev, Fast Method of ID Documents Location and Type Identification for Mobile and Server Application, in: International Conference on Document Analysis and Recognition, ICDAR, Sydney, Australia, 2019, pp. 850–857, doi:10.1109/ICDAR.2019.00141. 241 [17] S. Bakkali, M. M. Luqman, Z. Ming, J. Burie, Face Detection in Camera Captured Images of Identity Documents Under Challenging Conditions, in: International Conference on Document Analysis and Recognition Workshops, ICDARW, Sydney, Australia, 2019, pp. 55–60, doi:10.1109/ICDARW.2019.30065. [18] A. Sheshku, D. Nikolaev, V. L. Arlazaro, Houghencoder: Neural Network Architecture for Document Image Semantic Segmentation, in: IEEE International Conference on Image Processing, ICIP, Abu Dhabi, United Arab Emirates, 2020, pp. 1946–1950, doi:10.1109/ICIP40778.2020.9191182. [19] V. Arlazarov, K. Bulatov, T. Chernov, V. Arlazarov, MIDV-500: a dataset for identity document analysis and recognition on mobile devices in video stream, Computer Optics 43.5 (2019) 818– 824. doi:10.18287/2412-6179-2019-43-5-818-824. [20] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, Tensorflow: a system for large-scale machine learning, OSDI 16 (2016) 265–283. [21] F. Chollet, Keras, 2015. URL: https://keras.io [22] A. Simkiv, Practical Guide to Semantic Segmentation, 2020. URL: https://towardsdatascience.com/practical-guide-to-semantic-segmentation-7c55b540489c [23] M. Kozlenko, I. Lazarovych, V. Tkachuk, V. Vialkova, Software Demodulation of Weak Radio Signals using Convolutional Neural Network, in: IEEE 7th International Conference on Energy Smart Systems, ESS, Kyiv, Ukraine, 2020, pp. 339–342, doi:10.1109/ESS50319.2020.9160035. 242