Photo Privacy Detection based on Text Classification and Face Clustering Lyudmila Kopeykina Andrey Savchenko National Research University Higher School of Economics Laboratory of Algorithms and Technologies for Network Analysis, Nizhny Novgorod, Russia National Research University Higher School of Economics lnkopeykina@mail.ru Nizhny Novgorod, Russia avsavchenko@hse.ru Abstract— Nowadays, the photo privacy detection is classification [9, 10, 11] and text recognition (optical character becoming an acute task due to a wide spread of mobile devices recognition, OCR) [12, 13]. In particular, to detect scanned with photos published on social networks. As a photo might personal documents, it is proposed to sequentially use the contain private or sensitive data, there is an urgent need to EAST text detector [14], the Tesseract OCR library [12] and accurately determine them and impose restrictions on their the neural network classification of recognized text on images. processing. In this paper we focus on the task of personal data To detect personal photos containing faces of the user himself, detection in a photo gallery. A novel two-stage approach is his close friends and relatives, the well-known methods of face proposed. At first, text of scanned documents is processed based clustering [15, 16, 17] are applied to face embeddings extracted on an EAST text detector, and extracted text is recognized using with CNNs (convolutional neural networks) [2, 18]. Tesseract and neural network classifier. At the second stage, face clustering is implemented for the remaining photos to identify The rest of the paper is organized as follows: In Section II large groups of people (friends, relatives) whose photos also refer we describe the proposed approach in detail. Section III to personal data and must be processed directly on a mobile includes experimental study of privacy detection methods. device. The remaining images can be sent to a remote server for Finally, in Section 4 the conclusion and future plans are processing with higher accuracy. The experimental results of text discussed recognition and face clustering methods using various convolutional networks for facial features extraction are II. MATERIALS AND METHODS presented. In this paper we concentrate on the following task. It is Keywords—photo privacy detection, face clustering, text required to assign an image from photo album to one of two detection and classification possible classes: private or public. The proposed approach is shown in Fig. 1. Let us discuss the most important parts of this I. INTRODUCTION pipeline in the rest of this section. The photo gallery of a typical mobile device contains A. Detection of Scanned Documents unique information about its user and reflects his or her preferences [1]. As a result, image-processing methods can be As a part of scanned documents detection, it is proposed to applied to build visual recommender engines [2]. Such deep consider various methods of text recognition. Firstly, image learning-based methods usually require significant computing areas containing textual information are detected using the resources and should be implemented on a remote server with EAST algorithm [14]. Further, Tesseract OCR in GPUs. However, there is an urgent need to restrict the image_to_string mode with LSTM (Long-Short Term processing of photos with some sensitive data in order to avoid Memory) recursive model is used to recognize text in each the potential risk of inappropriate usage of private information. detected area. The given approach is subsequently compared with a simplified text recognition method, in which the step of The privacy detection on photos is a worth considering preliminary text detection by the EAST detector is omitted. problem [3, 4] that has already reached a certain level of Instead, Tesseract is used both in text recognition mode and in maturity [5, 6, 7]. The demand for handling this issue is automatic page segmentation mode. justified by the need to distinguish personal photos that cannot be transferred to the third parties in terms of privacy policy, After that, to classify personal data in the extracted text, it and public information that can be sent to a remote server for is proposed to use a neural network, which is trained based on further deep processing and analysis. Moreover, the separate the input sequence of words recognized in the training set of processing of public and private photos improves the accuracy scanned documents [13]. One-hot encoding is used to and computational efficiency of algorithms. represent the input data as a feature vector. To be more exact, a dictionary of the V most frequently used words in the It is noticeable that the vast majority of private images training set is created, and each text is represented as a V- mainly contain such characteristics like human faces, textual dimensional binary vector, where the v-th component of the data (identification data and credit card numbers) and other vector is 1 only if the v-th word from the dictionary is general objects (private cars and buildings) [3, 8]. Therefore, presented in the input text ( so-called bag-of-words model) this work proposes a unified approach for personal data detection in photo gallery using well-known methods of face [19, 20]. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Image Processing and Earth Remote Sensing Scanned documents processing Text Feature Text recognition detection extraction Text classification private public Remote server Mobile device Personal images Public images Large clusters Small clusters Face clustering Face detection Facial features extraction Processing of photos with faces Fig. 1. Proposed pipeline for photo privacy detection. To solve the binary classification problem, it is proposed to The procedure for combining selected individuals into use a computationally efficient implementation of a fully clusters supposes the assignment of each i-th facial image (i = connected neural network, which has already shown high 1, ..., N) to one of C ≥ 1 group, where C is usually unknown. performance in a similar problem of sentiment analysis [19]. Hence, one can apply either traditional agglomerative To train the above-mentioned network, we created a balanced clustering algorithms or rank linkage [15, 16] and graph CNNs corpus of 700 images [13]. The positive class is presented by [17]. An image is considered to be private if it contains faces 350 images of driving license and medical insurance cards, from sufficiently large clusters. In other words, a person passports and invoices from extension of the MIDV dataset presents at least Kmin times on different types of photos, where [21], whereas negative class consists of photos from publicly Kmin is a hyper-parameter of our method. That assumption is available datasets for text classification tasks DIQA [22] and based on the idea that the user’s gallery contains his own face Ghega [23]. This approach is sometimes as accurate as more complex methods based on CNNs and LSTMs. Moreover, it and faces of his close friends on the substantial part of photos. outperforms well-known traditional methods for detecting III. EXPERIMENTS AND RESULTS personal data, for example, the keyword spotting method [13]. In this section we present the experimental results of a B. Detection of Personal Photos Based on Face Clustering comparative analysis of the well-known text classification. As scanned documents are not the only option for personal Moreover, the comparison of clustering methods applied to data in the gallery, it is proposed to select images that contain facial features extracted with various CNN is given. Finally, faces of the user himself, his close friends and relatives [1, we analyze the performance of our approach to split user’s 24]. To detect such kind of personal photos, it is proposed to photos into to private and public images. apply the following approach. At first, the facial regions are A. Detection of Scanned Documents detected in all photographs using well-known methods for At first, we compare various approaches for text extraction face detection like cascade classifiers or MTCNN [25]. Since in terms of traditional keyword spotting method, which aims to there are no labels of people in the user's photo gallery, the search specially selected words (“passport”, “card”, etc.) [13] task can be reformulated as a face clustering problem [16, 24]. in recognized text. Namely, we compare simultaneous For doing this, D-dimensional feature vectors are extracted [9, detection of text on images and its recognition using Tesseract 11] for each of N > 0 selected facial images by using a CNN, with the approach when text regions are preliminary detected pre-trained to identify faces from a large (external) datasets by EAST detector and text is recognized by Tesseract OCR like VGGFace-2, MS-Celeb, etc. engine. In addition to traditional keyword spotting, three neural network models are compared: VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 172 Image Processing and Earth Remote Sensing  Recurrent model, which fed a sequence of 400 words the face is not detected, a square region with the size from a dictionary of V = 5000 frequently encountered words as chosen as a 1.5-times distance between eyes is input for the vector representation (embedding) with the size of extracted. the attribute space 256. Next, we use the LSTM layer with 128 hidden components, the dropout layer with a drop rate of 0.5.  Subset of labeled faces in the wild (LFW) dataset [27] used to test face identification algorithms [11]. It  CNN, consisting of one-dimensional convolutional includes photos of those subjects, who has at least two layer (with 32 neurons, core size of 7 and ReLU activation images in the original LFW dataset and at least one function), maxpooling and dropout layers (with a drop rate of video in the YouTube Faces (YTF) collection. 0.5). As the first layer of the model, a vector representation Firstly, hierarchical agglomerative clustering is considered (embedding) of 256 was also used. for the distance L2 between normalized feature vectors with the  Fully connected network with 2 hidden layers of 16 following types of linkage: single linkage, average linkage, neurons with hyperbolic tangent activation. The V-dimensional complete linkage, weighted linkage, centroid linkage and vector encoded as described in Subsection IIA (bag-of-words) median linkage from the SciPy library. Further, the rank-order is considered as input for the model. clustering [15] was examined as it was specially developed for organizing faces in photo albums. It uses special rank linkage, The last fully connected layer of each model used the which is further used to compute distance measure. Then this sigmoid activation. To train classifiers, TensorFlow and Keras approach was compared to the approximate rank-order frameworks were used. All classifiers were trained over 20 algorithm [28], in which only the top-k neighbors are taken epochs using the RMSprop optimizer. into consideration rather than the complete list of neighbors. A quantitative comparison of all methods described above This approach makes the actual rank of neighbors irrelevant is presented in Table I. The results were obtained using a 5- because the importance is shifted towards the presence / fold cross-validation. absence of shared nearest neighbors. Finally, we examined clustering method based on the graph CNN [29, 30]. Each TABLE I. RESULS FOR CLASSIFICATION OF SCANNED DOCUMENTS element of the feature matrix is considered as a separate vertex of the graph. Using the cosine distance, k nearest neighbors are Error Model Precision Recall F-score rate found for each element of the dataset. Thus, by connecting Keyword between neighbors, a similarity graph for the entire dataset is 0.83 0.62 0.70 0.276 obtained. Instead of processing such graph directly, subgraphs- spotting Tesseract LSTM 0.97 0.93 0.94 0.043 proposals are first generated, on the basis of which the CNN 0.88 0.77 0.82 0.161 resulting clusters are subsequently built. Fully- 0.98 0.94 0.95 0.028 connected To extract facial features, traditional pre-trained models Keyword downloaded from the official websites of their developers were 0.90 0.75 0.81 0.161 spotting considered: Proposed LSTM 0.93 0.99 0.95 0.038 (EAST+ Tesseract) CNN 0.89 0.79 0.83 0.144  VGGFace (VGGNet-16) [31] extracts 4096-D vectors; Fully- 1.00 0.97 0.98 0.015 connected  VGGFace2 (ResNet-50) [9] extracts 2048-D vectors; Here the use of EAST text detector to identify areas with  MobileNet [24] extracts 1024-D vectors; text was a reasonable solution. While the error rate attained  InsightFace (ArcFace) [32] extracts 512-D vectors; using only Tesseract is more than 27%, the proposed preliminary detection of text using the EAST detector reduces  FaceNet (Inception ResNet v1) [10] extracts 512-D this error to approximately 16%. In addition, we can conclude vectors. that the proposed implementation with the EAST text detector Table III contains the Rand index (ARI), mutual increases the average accuracy by approximately 2%. A fully- information index (AMI), homogeneity and completeness. In connected network achieves best results with accuracy that addition, the average number K of selected clusters to the exceeds even traditional LSTM. Moreover, such an number of groups C and the b-cubed F-measure, traditional for implementation 15% more accurately determines the image assessing the quality of face clustering, are calculated. class of the document in comparison with the traditional keyword spotting. Considering the results, clustering applied to facial features extracted with ResNet-50 (VGGFace2) and Inception ResNet B. Face Clustering v1 (FaceNet) perform more accurate results according to most We used the publicly available facial datasets: of the metrics compared to other models. Although MobileNet is slightly inferior, it takes twice less time to extract face  Gallagher collection person dataset [26], which contains embeddings compared to VGGFace2 and FaceNet. InsightFace 589 images with 931 labeled faces of 32 various people. features in most cases shows slightly worse capacity to define As only eyes positions are available in this dataset, to clusters. In addition, the weighted linkage demonstrates higher gather faces MTCNN [25] was preliminarily used to F-score for both datasets in comparison with other clustering detect faces and choose the subject with the largest methods (over 92%). intersection of facial region with given eyes region. If VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 173 Image Processing and Earth Remote Sensing TABLE II. CLUSTERING RESULTS FOR GALLAGHER DATASET CNN Time, sec K/C ARI AMI Homogeneity Completeness F-score VGGFace2 32.17 1.25 0.480 0.627 0.794 0.635 0.706 VGGFace 21.72 1.50 0.439 0.569 0.764 0.585 0.671 Rank-order MobileNet 22.71 2.09 0.674 0.678 0.965 0.611 0.725 InsightFace 27.84 1.59 0.502 0.530 0.729 0.716 0.625 FaceNet 24.54 1.53 0.674 0.681 0.906 0.633 0.760 VGGFace2 0.016 3.06 0.267 0.568 0.553 0.752 0.631 VGGFace 0.024 2.75 0.260 0.559 0.531 0.763 0.623 Single MobileNet 0.022 2.72 0.280 0.586 0.562 0.767 0.636 linkage InsightFace 0.025 2.72 0.109 0.294 0.296 0.607 0.503 FaceNet 0.013 3.09 0.286 0.592 0.579 0.762 0.642 VGGFace2 0.021 1.50 0.662 0.763 0.762 0.819 0.892 VGGFace 0.021 2.15 0.648 0.771 0.794 0.808 0.802 Average MobileNet 0.019 2.03 0.882 0.868 0.961 0.822 0.891 linkage InsightFace 0.027 3.12 0.707 0.711 0.891 0.660 0.739 FaceNet 0.018 2.31 0.886 0.868 0.942 0.835 0.895 VGGFace2 0.032 1.09 0.859 0.867 0.911 0.853 0.888 VGGFace 0.023 1.18 0.616 0.743 0.876 0.690 0.711 Complete MobileNet 0.019 0.41 0.863 0.816 0.798 0.861 0.836 linkage InsightFace 0.018 1.75 0.367 0.576 0.819 0.521 0.512 FaceNet 0.013 0.65 0.710 0.813 0.826 0.830 0.821 VGGFace2 0.033 1.50 0.891 0.898 0.946 0.876 0.921 VGGFace 0.019 1.03 0.599 0.737 0.704 0.830 0.762 Weighted MobileNet 0.018 0.75 0.751 0.788 0.792 0.818 0.806 linkage InsightFace 0.018 1.72 0.655 0.697 0.806 0.675 0.734 FaceNet 0.015 1.47 0.884 0.881 0.934 0.857 0.902 VGGFace2 0.785 3.91 0.515 0.535 0.586 0.641 0.704 VGGFace 1.312 3.78 0.446 0.485 0.509 0.681 0.653 Approximate MobileNet 1.414 6.68 0.417 0.516 0.522 0.795 0.635 rank-order InsightFace 1.220 5.78 0.324 0.324 0.471 0.656 0.571 FaceNet 1.092 4.05 0.567 0.621 0.626 0.764 0.724 VGGFace2 5.006 1.67 0.867 0.845 0.954 0.793 0.859 VGGFace 4.741 0.78 0.641 0.536 0.627 0.539 0.578 GCN-D MobileNet 6.290 0.69 0.675 0.748 0.799 0.742 0.728 InsightFace 6.862 0.65 0.409 0.612 0.603 0.682 0.637 FaceNet 6.164 0.91 0.636 0.726 0.751 0.749 0.687 TABLE III. CLUSTERING RESULTS FOR LFW DATASET CNN Time, sec K/C ARI AMI Homogeneity Completeness F-score VGGFace2 416.73 0.96 0.719 0.781 0.980 0.911 0.862 VGGFace 309.44 0.82 0.675 0.748 0.812 0.762 0.746 Rank-order MobileNet 305.03 0.77 0.786 0.816 0.944 0.907 0.806 InsightFace 361.02 1.21 0.673 0.721 0.842 0.912 0.683 FaceNet 359.62 0.91 0.784 0.832 0.924 0.917 0.812 VGGFace2 0.47 1.66 0.969 0.940 0.998 0.951 0.917 VGGFace 0.64 1.86 0.854 0.876 0.962 0.931 0.847 Single MobileNet 0.60 1.52 0.744 0.871 0.930 0.951 0.854 linkage InsightFace 0.68 2.08 0.837 0.838 0.951 0.911 0.804 FaceNet 0.50 1.63 0.967 0.935 0.993 0.952 0.912 VGGFace2 0.69 1.49 0.966 0.945 0.998 0.955 0.926 VGGFace 0.61 1.36 0.946 0.933 0.988 0.953 0.911 Average MobileNet 0.64 1.48 0.968 0.943 0.997 0.954 0.923 linkage InsightFace 0.73 1.37 0.887 0.873 0.972 0.920 0.831 FaceNet 0.67 1.54 0.960 0.937 0.997 0.949 0.918 VGGFace2 0.57 1.13 0.744 0.935 0.992 0.951 0.910 VGGFace 0.62 0.99 0.621 0.873 0.966 0.921 0.821 Complete MobileNet 0.62 1.06 0.852 0.925 0.980 0.953 0.894 linkage InsightFace 0.55 0.90 0.756 0.793 0.926 0.889 0.720 FaceNet 0.53 1.07 0.748 0.929 0.986 0.951 0.900 VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 174 Image Processing and Earth Remote Sensing TABLE III. CLUSTERING RESULTS FOR LFW DATASET (CONT.) VGGFace2 0.63 1.37 0.893 0.941 0.998 0.952 0.923 VGGFace 0.61 1.28 0.925 0.925 0.984 0.950 0.901 Weighted MobileNet 0.59 1.44 0.961 0.940 0.996 0.952 0.919 linkage InsightFace 0.67 1.42 0.879 0.864 0.972 0.913 0.820 FaceNet 0.64 1.44 0.935 0.938 0.997 0.950 0.919 VGGFace2 9.49 1.42 0.803 0.877 0.924 0.952 0.923 VGGFace 7.12 1.30 0.621 0.706 0.893 0.816 0.724 Approximate MobileNet 7.06 1.79 0.610 0.741 0.864 0.912 0.740 rank-order InsightFace 12.32 1.57 0.684 0.711 0.849 0.908 0.685 FaceNet 12.72 1.13 0.782 0.859 0.932 0.937 0.844 VGGFace2 30.33 0.84 0.075 0.395 0.814 0.711 0.512 VGGFace 28.47 0.69 0.044 0.235 0.866 0.669 0.456 GCN-D MobileNet 31.23 0.86 0.332 0.665 0.882 0.825 0.639 InsightFace 30.18 0.74 0.802 0.732 0.874 0.875 0.666 FaceNet 31.79 0.92 0.141 0.543 0.828 0.770 0.588 Agglomerative clustering with average linkage performs are initially private and they are marked as private by the second most accurate results (approximately 90%). algorithm. If Kmin=3, then 5% of private photos will be moved Furthermore, connectivity graph-based method demonstrates to public set. With an increase of Kmin, the trend for type 1 error poor results on the given data. The use of rank distance is is going upwards unstably and ends up with 2%. At the same impractical due to the rather low values for each metric and its time, the probability to assign public images to private quadratic complexity. Even though the approximation of rank- decreases and reaches 0%. order clustering takes less time to split data into groups compared to the original method, the results still do not In the final experiment, we compared the results given by outperform those of traditional agglomerative algorithms. various descriptors on LFW (Table IV). “0” class consists of 3263 private images, whereas public class “1” includes 474. Moreover, we analyzed the dependence between the Here, images containing faces from clusters that include minimum number of faces in cluster to set it private (Kmin) and Kmin=3 or more facial images, were considered personal. Here the type 1 and type 2 error rates for the LFW subset (Fig. 2). all face descriptors lead to a fairly high quality of detection, but Since ground truth labels in terms of private and public photos zero probability of missing personal data was not achieved. In for that dataset were not provided, we determined them as this case, the best results are obtained using VGGFace2 follows. All objects from classes, the number of photos in (ResNet-50) and FaceNet models. which is greater than or equal to Kmin, were considered to be private. The remaining images were assigned to public images. TABLE IV. CLASSIFICATION RESULTS FOR LFW We used agglomerative clustering with weighted linkage and Feature FPR FNR Recall F1- Error VGGFace2 descriptor as it provided best results according to Precision extractor score rate conducted experiments. VGGFace2 0.051 0.019 0.738 0.978 0.842 0.047 VGGFace 0.055 0.276 0.655 0.723 0.688 0.084 MobileNet 0.054 0.168 0.687 0.831 0.752 0.069 InsightFace 0.115 0.281 0.474 0.719 0.571 0.137 FaceNet 0.056 0.044 0.712 0.952 0.816 0.055 IV. CONCLUSION The task of personal photos detection is difficult in terms of finding an effective solution due to its inherent subjectivity. In this paper, it is assumed that personal data contains confidential textual information and images with the user, his close friends and relatives. This assumption allows to highlight personal photos accurately and impose restrictions on their processing. To highlight such data, a novel approach was proposed in the current work (Fig. 1). It is proposed to use the EAST text detector and recognize text in the detected areas with Tesseract OCR library to classify scanned documents. It has been experimentally shown that a simple fully-connected neural network for text encoded using bag-of-words [13] exceeds more complex network architectures, such as CNN, by more than 10% and achieves high accuracy in detecting personal Fig. 2. The dependence between the minimal number Kmin of photos in a documents. In addition, in agglomerative clustering with a personal cluster and type1/type 2 error rates, LFW dataset. weighted linkage performed higher results in extracting groups of user’s faces, friends and relatives (Tables II and III). According to the results, zero rate of missing private photos is achieved with Kmin=2. It means that all photos from dataset VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 175 Image Processing and Earth Remote Sensing ACKNOWLEDGMENT the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5551-5560, 2017. The paper was prepared within the framework of the [15] C. Zhu, F. Wen and J. Sun, “A rank-order distance based clustering Academic Fund Program at the National Research University algorithm for face tagging,” CVPR IEEE, pp. 481-488, 2011. Higher School of Economics (HSE University) in 2019-2020 [16] Y. Shi, C. Otto and A. K. Jain, “Face clustering: representation and (grant No 19-04-004) and by the Russian Academic Excellence pairwise constraints,” IEEE Transactions on Information Forensics and Project «5-100». Security, vol. 13, no. 7, pp. 1626-1640, 2018. [17] L. Yang, X. Zhan, D. Chen, J. Yan, C. C. Loy and D. Lin, “Learning to REFERENCES cluster faces on an affinity graph,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2298-2306, 2019. [1] I. Grechikhin and A.V. Savchenko, “User Modeling on Mobile Device Based on Facial Clustering and Object Detection in Photos and Videos,” [18] A.V. Savchenko, "Probabilistic neural network with complex Iberian Conference on Pattern Recognition and Image exponential activation functions in image recognition," IEEE Analysis, Springer, Cham, pp. 429-440, 2019. Transactions on Neural Networks and Learning Systems, vol. 31, no. 2, pp. 651-660, 2020. [2] I. Goodfellow, Y. Bengio and A. Courville, “Deep learning,” MIT press, 2016. [19] F. Chollet, "Deep learning with Python," Manning Publications, 2017. [3] L. Tran, D. Kong, H. Jin and J. Liu, “Privacy-CNH: A framework to [20] A.V. Savchenko and E.V. Miasnikov, “Event recognition based on detect photo privacy with convolutional neural network using classification of generated image captions,” International Symposium on hierarchical features,” Thirtieth AAAI Conference on Artificial Intelligent Data Analysis (IDA), pp. 418-430, 2020. Intelligence (AAAI), pp. 1317-1323, 2016. [21] V.V. Arlazarov, K. Bulatov, T. Chernov and V.L. Arlazarov, “MIDV- [4] H. Zhong, A.C. Squicciarini, D.J. Miller and C. Caragea, “A Group- 500: a dataset for identity document analysis and recognition on mobile Based Personalized Model for Image Privacy Classification and devices in video stream”, Computer Optics, vol. 43, no. 5, pp. 818-824, Labeling,” International Joint Conferences on Artificial Intelligence 2019. DOI: 10.18287/2412-6179-2019-43-5-818-824. (IJCAI), vol. 17, pp. 3952-3958, 2017. [22] P. Ye and D. Doermann, “Document image quality assessment: A brief [5] A. Tonge and C. Caragea, “Dynamic deep multi-modal fusion for image survey”, 12th International Conference on Document Analysis and privacy prediction,” The World Wide Web Conference (WWW), pp. Recognition, IEEE, pp. 723-727, 2013. 1829-1840, 2019. [23] A. Bartoli, G. Davanzo, E. Medvet and E. Sorio, “Improving features [6] A. Tonge and C. Caragea, “Image privacy prediction using deep neural extraction for supervised invoice classification,” Proceedings of the 10th networks,” ACM Transactions on the Web (TWEB), vol. 14, no. 2, pp. IASTED International Conference, vol. 674, no. 040, p. 401, 2010. 1-32, 2020. [24] A.V. Savchenko, “Efficient facial representations for age, gender and [7] C. Sitaula, Y. Xiang, S. Aryal and X. Lu, “Unsupervised deep features identity recognition in organizing photo albums using multi-output for privacy image classification,” Pacific-Rim Symposium on Image and ConvNet,” PeerJ Computer Science, e197, 2019. Video Technology, pp. 404-415, 2019. [25] K. Zhang, Z. Zhang, Z. Li and Y. Qiao, “Joint face detection and [8] J. He, B. Liu, D. Kong, X. Bao, N. Wang, H. Jin and G. Kesidis, alignment using multitask cascaded convolutional networks,” IEEE “Puppies: Transformation-supported personalized privacy preserving Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016. partial image sharing,” 46th Annual IEEE/IFIP International Conference [26] A.C. Gallagher and T. Chen, “Clothing cosegmentation for recognizing on Dependable Systems and Networks (DSN), IEEE, pp. 359-370, 2016. people” IEEE Conference on Computer Vision and Pattern Recognition, [9] Q. Cao, L. Shen, W. Xie, O.M. Parkhi and A. Zisserman, “Vggface2: A pp. 1-8, 2008. dataset for recognising faces across pose and age,” 3th International [27] G.B. Huang, M. Mattar, T. Berg and E. Learned-Miller, “Labeled faces Conference on Automatic Face & Gesture Recognition (FG), IEEE, pp. in the wild: A database forstudying face recognition in unconstrained 67-74, 2018. environments,” 2018. [10] F. Schroff, D. Kalenichenko and J. Philbin, “FaceNet: A unified [28] C. Otto, D. Wang and A.K. Jain, “Clustering millions of faces by embedding for face recognition and clustering,” Proceedings of the identity,” IEEE transactions on pattern analysis and machine IEEE conference on computer vision and pattern recognition, pp. 815- intelligence, vol. 40, no. 2, pp. 289-303, 2017. 823, 2015. [29] L. Yang, D. Chen, X. Zhan, R. Zhao, C.C. Loy and D. Lin, “Learning to [11] A.V. Savchenko and N.S. Belova, "Unconstrained face identification cluster faces via confidence and connectivity estimation,” arXiv preprint using maximum likelihood of distances between deep off-the-shelf arXiv:2004.00445, 2020. features," Expert Systems with Applications, vol. 108, pp. 170-182, [30] L. Yang, D. Chen, X. Zhan, R. Zhao, C.C. Loy and D. Lin, “Learning to 2018. cluster faces on an affinity graph,” Proceedings of the IEEE Conference [12] R. Smith, “An overview of the Tesseract OCR engine” Ninth on Computer Vision and Pattern Recognition, pp. 2298-2306, 2019. International Conference on Document Analysis and Recognition [31] O.M. Parkhi, A. Vedaldi and A. Zisserman, “Deep face recognition,” (ICDAR), IEEE, vol. 2, pp. 629-633, 2007. Britich Machine Vision Conference (BMVC), 2015. [13] L. Kopeykina, A.V. Savchenko, “Automatic privacy detection in [32] J. Deng, J. Guo, N. Xue and S. Zafeiriou, “Arcface: Additive angular scanned document images based on deep neural networks,” Proceedings margin loss for deep face recognition,” Proceedings of the IEEE of International Russian Automation Conference (RusAutoCon), IEEE, Conference on Computer Vision and Pattern Recognition, pp. 4690- pp. 1-6, 2019. 4699, 2019. [14] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He and J. Liang, "EAST: an efficient and accurate scene text detector," Proceedings of VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 176