Photo Privacy Detection based on Text Classification
                and Face Clustering
                       Lyudmila Kopeykina                                                               Andrey Savchenko
      National Research University Higher School of Economics                    Laboratory of Algorithms and Technologies for Network Analysis,
                     Nizhny Novgorod, Russia                                        National Research University Higher School of Economics
                       lnkopeykina@mail.ru                                                          Nizhny Novgorod, Russia
                                                                                                       avsavchenko@hse.ru

    Abstract— Nowadays, the photo privacy detection is                         classification [9, 10, 11] and text recognition (optical character
becoming an acute task due to a wide spread of mobile devices                  recognition, OCR) [12, 13]. In particular, to detect scanned
with photos published on social networks. As a photo might                     personal documents, it is proposed to sequentially use the
contain private or sensitive data, there is an urgent need to                  EAST text detector [14], the Tesseract OCR library [12] and
accurately determine them and impose restrictions on their                     the neural network classification of recognized text on images.
processing. In this paper we focus on the task of personal data                To detect personal photos containing faces of the user himself,
detection in a photo gallery. A novel two-stage approach is                    his close friends and relatives, the well-known methods of face
proposed. At first, text of scanned documents is processed based               clustering [15, 16, 17] are applied to face embeddings extracted
on an EAST text detector, and extracted text is recognized using
                                                                               with CNNs (convolutional neural networks) [2, 18].
Tesseract and neural network classifier. At the second stage, face
clustering is implemented for the remaining photos to identify                     The rest of the paper is organized as follows: In Section II
large groups of people (friends, relatives) whose photos also refer            we describe the proposed approach in detail. Section III
to personal data and must be processed directly on a mobile                    includes experimental study of privacy detection methods.
device. The remaining images can be sent to a remote server for                Finally, in Section 4 the conclusion and future plans are
processing with higher accuracy. The experimental results of text              discussed
recognition and face clustering methods using various
convolutional networks for facial features extraction are                                       II. MATERIALS AND METHODS
presented.
                                                                                  In this paper we concentrate on the following task. It is
    Keywords—photo privacy detection, face clustering, text                    required to assign an image from photo album to one of two
detection and classification                                                   possible classes: private or public. The proposed approach is
                                                                               shown in Fig. 1. Let us discuss the most important parts of this
                         I. INTRODUCTION                                       pipeline in the rest of this section.
    The photo gallery of a typical mobile device contains                      A. Detection of Scanned Documents
unique information about its user and reflects his or her
preferences [1]. As a result, image-processing methods can be                      As a part of scanned documents detection, it is proposed to
applied to build visual recommender engines [2]. Such deep                     consider various methods of text recognition. Firstly, image
learning-based methods usually require significant computing                   areas containing textual information are detected using the
resources and should be implemented on a remote server with                    EAST algorithm [14]. Further, Tesseract OCR in
GPUs. However, there is an urgent need to restrict the                         image_to_string mode with LSTM (Long-Short Term
processing of photos with some sensitive data in order to avoid                Memory) recursive model is used to recognize text in each
the potential risk of inappropriate usage of private information.              detected area. The given approach is subsequently compared
                                                                               with a simplified text recognition method, in which the step of
    The privacy detection on photos is a worth considering                     preliminary text detection by the EAST detector is omitted.
problem [3, 4] that has already reached a certain level of                     Instead, Tesseract is used both in text recognition mode and in
maturity [5, 6, 7]. The demand for handling this issue is                      automatic page segmentation mode.
justified by the need to distinguish personal photos that cannot
be transferred to the third parties in terms of privacy policy,                    After that, to classify personal data in the extracted text, it
and public information that can be sent to a remote server for                 is proposed to use a neural network, which is trained based on
further deep processing and analysis. Moreover, the separate                   the input sequence of words recognized in the training set of
processing of public and private photos improves the accuracy                  scanned documents [13]. One-hot encoding is used to
and computational efficiency of algorithms.                                    represent the input data as a feature vector. To be more exact,
                                                                               a dictionary of the V most frequently used words in the
    It is noticeable that the vast majority of private images                  training set is created, and each text is represented as a V-
mainly contain such characteristics like human faces, textual                  dimensional binary vector, where the v-th component of the
data (identification data and credit card numbers) and other
                                                                               vector is 1 only if the v-th word from the dictionary is
general objects (private cars and buildings) [3, 8]. Therefore,
                                                                               presented in the input text ( so-called bag-of-words model)
this work proposes a unified approach for personal data
detection in photo gallery using well-known methods of face                    [19, 20].


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Image Processing and Earth Remote Sensing


                                                           Scanned documents processing
                                          Text                                                           Feature
                                                                        Text recognition
                                        detection                                                       extraction

                                                                     Text classification
                                                              private                      public


                                                                                                                          Remote server
                       Mobile device


                                                     Personal images                       Public images


                                                    Large clusters                          Small clusters

                                                                       Face clustering

                                       Face detection                                                  Facial features
                                                                                                         extraction
                                                              Processing of photos with faces
  Fig. 1. Proposed pipeline for photo privacy detection.

    To solve the binary classification problem, it is proposed to                       The procedure for combining selected individuals into
use a computationally efficient implementation of a fully                           clusters supposes the assignment of each i-th facial image (i =
connected neural network, which has already shown high                              1, ..., N) to one of C ≥ 1 group, where C is usually unknown.
performance in a similar problem of sentiment analysis [19].                        Hence, one can apply either traditional agglomerative
To train the above-mentioned network, we created a balanced                         clustering algorithms or rank linkage [15, 16] and graph CNNs
corpus of 700 images [13]. The positive class is presented by                       [17]. An image is considered to be private if it contains faces
350 images of driving license and medical insurance cards,                          from sufficiently large clusters. In other words, a person
passports and invoices from extension of the MIDV dataset                           presents at least Kmin times on different types of photos, where
[21], whereas negative class consists of photos from publicly
                                                                                    Kmin is a hyper-parameter of our method. That assumption is
available datasets for text classification tasks DIQA [22] and
                                                                                    based on the idea that the user’s gallery contains his own face
Ghega [23]. This approach is sometimes as accurate as more
complex methods based on CNNs and LSTMs. Moreover, it                               and faces of his close friends on the substantial part of photos.
outperforms well-known traditional methods for detecting                                            III. EXPERIMENTS AND RESULTS
personal data, for example, the keyword spotting method [13].
                                                                                        In this section we present the experimental results of a
B. Detection of Personal Photos Based on Face Clustering                            comparative analysis of the well-known text classification.
    As scanned documents are not the only option for personal                       Moreover, the comparison of clustering methods applied to
data in the gallery, it is proposed to select images that contain                   facial features extracted with various CNN is given. Finally,
faces of the user himself, his close friends and relatives [1,                      we analyze the performance of our approach to split user’s
24]. To detect such kind of personal photos, it is proposed to                      photos into to private and public images.
apply the following approach. At first, the facial regions are                      A. Detection of Scanned Documents
detected in all photographs using well-known methods for                                At first, we compare various approaches for text extraction
face detection like cascade classifiers or MTCNN [25]. Since                        in terms of traditional keyword spotting method, which aims to
there are no labels of people in the user's photo gallery, the                      search specially selected words (“passport”, “card”, etc.) [13]
task can be reformulated as a face clustering problem [16, 24].                     in recognized text. Namely, we compare simultaneous
For doing this, D-dimensional feature vectors are extracted [9,                     detection of text on images and its recognition using Tesseract
11] for each of N > 0 selected facial images by using a CNN,                        with the approach when text regions are preliminary detected
pre-trained to identify faces from a large (external) datasets                      by EAST detector and text is recognized by Tesseract OCR
like VGGFace-2, MS-Celeb, etc.                                                      engine. In addition to traditional keyword spotting, three neural
                                                                                    network models are compared:


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                          172
Image Processing and Earth Remote Sensing

     Recurrent model, which fed a sequence of 400 words                             the face is not detected, a square region with the size
from a dictionary of V = 5000 frequently encountered words as                        chosen as a 1.5-times distance between eyes is
input for the vector representation (embedding) with the size of                     extracted.
the attribute space 256. Next, we use the LSTM layer with 128
hidden components, the dropout layer with a drop rate of 0.5.                     Subset of labeled faces in the wild (LFW) dataset [27]
                                                                                   used to test face identification algorithms [11]. It
     CNN, consisting of one-dimensional convolutional                             includes photos of those subjects, who has at least two
layer (with 32 neurons, core size of 7 and ReLU activation                         images in the original LFW dataset and at least one
function), maxpooling and dropout layers (with a drop rate of                      video in the YouTube Faces (YTF) collection.
0.5). As the first layer of the model, a vector representation
                                                                                 Firstly, hierarchical agglomerative clustering is considered
(embedding) of 256 was also used.
                                                                             for the distance L2 between normalized feature vectors with the
     Fully connected network with 2 hidden layers of 16                     following types of linkage: single linkage, average linkage,
neurons with hyperbolic tangent activation. The V-dimensional                complete linkage, weighted linkage, centroid linkage and
vector encoded as described in Subsection IIA (bag-of-words)                 median linkage from the SciPy library. Further, the rank-order
is considered as input for the model.                                        clustering [15] was examined as it was specially developed for
                                                                             organizing faces in photo albums. It uses special rank linkage,
   The last fully connected layer of each model used the                     which is further used to compute distance measure. Then this
sigmoid activation. To train classifiers, TensorFlow and Keras               approach was compared to the approximate rank-order
frameworks were used. All classifiers were trained over 20                   algorithm [28], in which only the top-k neighbors are taken
epochs using the RMSprop optimizer.                                          into consideration rather than the complete list of neighbors.
    A quantitative comparison of all methods described above                 This approach makes the actual rank of neighbors irrelevant
is presented in Table I. The results were obtained using a 5-                because the importance is shifted towards the presence /
fold cross-validation.                                                       absence of shared nearest neighbors. Finally, we examined
                                                                             clustering method based on the graph CNN [29, 30]. Each
    TABLE I. RESULS FOR CLASSIFICATION OF SCANNED DOCUMENTS                  element of the feature matrix is considered as a separate vertex
                                                                             of the graph. Using the cosine distance, k nearest neighbors are
                                                                Error
                 Model       Precision      Recall   F-score
                                                                rate
                                                                             found for each element of the dataset. Thus, by connecting
              Keyword                                                        between neighbors, a similarity graph for the entire dataset is
                              0.83          0.62      0.70      0.276        obtained. Instead of processing such graph directly, subgraphs-
              spotting
 Tesseract
              LSTM            0.97          0.93     0.94       0.043        proposals are first generated, on the basis of which the
              CNN             0.88          0.77     0.82       0.161        resulting clusters are subsequently built.
              Fully-
                              0.98          0.94     0.95       0.028
              connected                                                         To extract facial features, traditional pre-trained models
              Keyword                                                        downloaded from the official websites of their developers were
                              0.90          0.75     0.81       0.161
              spotting                                                       considered:
Proposed
              LSTM            0.93          0.99     0.95       0.038
 (EAST+
Tesseract)
              CNN             0.89          0.79     0.83       0.144             VGGFace (VGGNet-16) [31] extracts 4096-D vectors;
              Fully-
                              1.00          0.97     0.98       0.015
              connected                                                           VGGFace2 (ResNet-50) [9] extracts 2048-D vectors;

    Here the use of EAST text detector to identify areas with                     MobileNet [24] extracts 1024-D vectors;
text was a reasonable solution. While the error rate attained                     InsightFace (ArcFace) [32] extracts 512-D vectors;
using only Tesseract is more than 27%, the proposed
preliminary detection of text using the EAST detector reduces                     FaceNet (Inception ResNet v1) [10] extracts 512-D
this error to approximately 16%. In addition, we can conclude                      vectors.
that the proposed implementation with the EAST text detector                     Table III contains the Rand index (ARI), mutual
increases the average accuracy by approximately 2%. A fully-                 information index (AMI), homogeneity and completeness. In
connected network achieves best results with accuracy that                   addition, the average number K of selected clusters to the
exceeds even traditional LSTM. Moreover, such an                             number of groups C and the b-cubed F-measure, traditional for
implementation 15% more accurately determines the image                      assessing the quality of face clustering, are calculated.
class of the document in comparison with the traditional
keyword spotting.                                                                Considering the results, clustering applied to facial features
                                                                             extracted with ResNet-50 (VGGFace2) and Inception ResNet
B. Face Clustering                                                           v1 (FaceNet) perform more accurate results according to most
   We used the publicly available facial datasets:                           of the metrics compared to other models. Although MobileNet
                                                                             is slightly inferior, it takes twice less time to extract face
     Gallagher collection person dataset [26], which contains
                                                                             embeddings compared to VGGFace2 and FaceNet. InsightFace
      589 images with 931 labeled faces of 32 various people.
                                                                             features in most cases shows slightly worse capacity to define
      As only eyes positions are available in this dataset, to
                                                                             clusters. In addition, the weighted linkage demonstrates higher
      gather faces MTCNN [25] was preliminarily used to
                                                                             F-score for both datasets in comparison with other clustering
      detect faces and choose the subject with the largest
                                                                             methods (over 92%).
      intersection of facial region with given eyes region. If


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                    173
Image Processing and Earth Remote Sensing


                                                        TABLE II.             CLUSTERING RESULTS FOR GALLAGHER DATASET

                            CNN             Time, sec      K/C        ARI          AMI        Homogeneity           Completeness     F-score

                         VGGFace2            32.17         1.25       0.480        0.627          0.794                  0.635       0.706
                          VGGFace            21.72         1.50       0.439        0.569          0.764                  0.585       0.671
    Rank-order            MobileNet          22.71         2.09       0.674        0.678          0.965                  0.611       0.725
                         InsightFace         27.84         1.59       0.502        0.530          0.729                  0.716       0.625
                           FaceNet           24.54         1.53       0.674        0.681          0.906                  0.633       0.760
                         VGGFace2            0.016         3.06       0.267        0.568          0.553                  0.752       0.631
                          VGGFace            0.024         2.75       0.260        0.559          0.531                  0.763       0.623
       Single
                          MobileNet          0.022         2.72       0.280        0.586          0.562                  0.767       0.636
      linkage
                         InsightFace         0.025         2.72       0.109        0.294          0.296                  0.607       0.503
                           FaceNet           0.013         3.09       0.286        0.592          0.579                  0.762       0.642
                         VGGFace2            0.021         1.50       0.662        0.763          0.762                  0.819       0.892
                          VGGFace            0.021         2.15       0.648        0.771          0.794                  0.808       0.802
      Average
                          MobileNet          0.019         2.03       0.882        0.868          0.961                  0.822       0.891
      linkage
                         InsightFace         0.027         3.12       0.707        0.711          0.891                  0.660       0.739
                           FaceNet           0.018         2.31       0.886        0.868          0.942                  0.835       0.895
                         VGGFace2            0.032         1.09       0.859        0.867          0.911                  0.853       0.888
                          VGGFace            0.023         1.18       0.616        0.743          0.876                  0.690       0.711
     Complete
                          MobileNet          0.019         0.41       0.863        0.816          0.798                  0.861       0.836
      linkage
                         InsightFace         0.018         1.75       0.367        0.576          0.819                  0.521       0.512
                           FaceNet           0.013         0.65       0.710        0.813          0.826                  0.830       0.821
                         VGGFace2            0.033         1.50       0.891        0.898          0.946                  0.876       0.921
                          VGGFace            0.019         1.03       0.599        0.737          0.704                  0.830       0.762
     Weighted
                          MobileNet          0.018         0.75       0.751        0.788          0.792                  0.818       0.806
      linkage
                         InsightFace         0.018         1.72       0.655        0.697          0.806                  0.675       0.734
                           FaceNet           0.015         1.47       0.884        0.881          0.934                  0.857       0.902
                         VGGFace2            0.785         3.91       0.515        0.535          0.586                  0.641       0.704
                          VGGFace            1.312         3.78       0.446        0.485          0.509                  0.681       0.653
   Approximate
                          MobileNet          1.414         6.68       0.417        0.516          0.522                  0.795       0.635
    rank-order
                         InsightFace         1.220         5.78       0.324        0.324          0.471                  0.656       0.571
                           FaceNet           1.092         4.05       0.567        0.621          0.626                  0.764       0.724
                         VGGFace2            5.006         1.67       0.867        0.845          0.954                  0.793       0.859
                          VGGFace            4.741         0.78       0.641        0.536          0.627                  0.539       0.578
      GCN-D               MobileNet          6.290         0.69       0.675        0.748          0.799                  0.742       0.728
                         InsightFace         6.862         0.65       0.409        0.612          0.603                  0.682       0.637
                           FaceNet           6.164         0.91       0.636        0.726          0.751                  0.749       0.687


                                                         TABLE III.            CLUSTERING RESULTS FOR LFW DATASET


                          CNN               Time, sec         K/C        ARI        AMI      Homogeneity       Completeness        F-score
                       VGGFace2               416.73          0.96      0.719       0.781        0.980              0.911          0.862
                        VGGFace               309.44          0.82      0.675       0.748        0.812              0.762          0.746
  Rank-order
                        MobileNet             305.03          0.77      0.786       0.816        0.944              0.907          0.806
                       InsightFace            361.02          1.21      0.673       0.721        0.842              0.912          0.683
                         FaceNet              359.62          0.91      0.784       0.832        0.924              0.917          0.812
                       VGGFace2                0.47           1.66      0.969       0.940        0.998              0.951          0.917
                        VGGFace                0.64           1.86      0.854       0.876        0.962              0.931          0.847
     Single
                        MobileNet              0.60           1.52      0.744       0.871        0.930              0.951          0.854
    linkage
                       InsightFace             0.68           2.08      0.837       0.838        0.951              0.911          0.804
                         FaceNet               0.50           1.63      0.967       0.935        0.993              0.952          0.912
                       VGGFace2                0.69           1.49      0.966       0.945        0.998              0.955          0.926
                        VGGFace                0.61           1.36      0.946       0.933        0.988              0.953          0.911
    Average
                        MobileNet              0.64           1.48      0.968       0.943        0.997              0.954          0.923
    linkage
                       InsightFace             0.73           1.37      0.887       0.873        0.972              0.920          0.831
                         FaceNet               0.67           1.54      0.960       0.937        0.997              0.949          0.918
                       VGGFace2                0.57           1.13      0.744       0.935        0.992              0.951          0.910
                        VGGFace                0.62           0.99      0.621       0.873        0.966              0.921          0.821
   Complete
                        MobileNet              0.62           1.06      0.852       0.925        0.980              0.953          0.894
    linkage
                       InsightFace             0.55           0.90      0.756       0.793        0.926              0.889          0.720
                         FaceNet               0.53           1.07      0.748       0.929        0.986              0.951          0.900


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                       174
Image Processing and Earth Remote Sensing

                                        TABLE III.           CLUSTERING RESULTS FOR LFW DATASET (CONT.)
                        VGGFace2             0.63             1.37       0.893     0.941             0.998                0.952               0.923
                         VGGFace             0.61             1.28       0.925     0.925             0.984                0.950               0.901
    Weighted
                         MobileNet           0.59             1.44       0.961     0.940             0.996                0.952               0.919
     linkage
                        InsightFace          0.67             1.42       0.879     0.864             0.972                0.913               0.820
                          FaceNet            0.64             1.44       0.935     0.938             0.997                0.950               0.919
                        VGGFace2             9.49             1.42       0.803     0.877             0.924                0.952               0.923
                         VGGFace             7.12             1.30       0.621     0.706             0.893                0.816               0.724
  Approximate
                         MobileNet           7.06             1.79       0.610     0.741             0.864                0.912               0.740
   rank-order
                        InsightFace          12.32            1.57       0.684     0.711             0.849                0.908               0.685
                          FaceNet            12.72            1.13       0.782     0.859             0.932                0.937               0.844
                        VGGFace2             30.33            0.84       0.075     0.395             0.814                0.711               0.512
                         VGGFace             28.47            0.69       0.044     0.235             0.866                0.669               0.456
     GCN-D               MobileNet           31.23            0.86       0.332     0.665             0.882                0.825               0.639
                        InsightFace          30.18            0.74       0.802     0.732             0.874                0.875               0.666
                          FaceNet            31.79            0.92       0.141     0.543             0.828                0.770               0.588


   Agglomerative clustering with average linkage performs                    are initially private and they are marked as private by
the second most accurate results (approximately 90%).                        algorithm. If Kmin=3, then 5% of private photos will be moved
Furthermore, connectivity graph-based method demonstrates                    to public set. With an increase of Kmin, the trend for type 1 error
poor results on the given data. The use of rank distance is                  is going upwards unstably and ends up with 2%. At the same
impractical due to the rather low values for each metric and its             time, the probability to assign public images to private
quadratic complexity. Even though the approximation of rank-                 decreases and reaches 0%.
order clustering takes less time to split data into groups
compared to the original method, the results still do not                         In the final experiment, we compared the results given by
outperform those of traditional agglomerative algorithms.                    various descriptors on LFW (Table IV). “0” class consists of
                                                                             3263 private images, whereas public class “1” includes 474.
    Moreover, we analyzed the dependence between the                         Here, images containing faces from clusters that include
minimum number of faces in cluster to set it private (Kmin) and              Kmin=3 or more facial images, were considered personal. Here
the type 1 and type 2 error rates for the LFW subset (Fig. 2).               all face descriptors lead to a fairly high quality of detection, but
Since ground truth labels in terms of private and public photos              zero probability of missing personal data was not achieved. In
for that dataset were not provided, we determined them as                    this case, the best results are obtained using VGGFace2
follows. All objects from classes, the number of photos in                   (ResNet-50) and FaceNet models.
which is greater than or equal to Kmin, were considered to be
private. The remaining images were assigned to public images.                            TABLE IV.           CLASSIFICATION RESULTS FOR LFW
We used agglomerative clustering with weighted linkage and                         Feature     FPR      FNR                  Recall    F1-    Error
VGGFace2 descriptor as it provided best results according to                                                    Precision
                                                                                  extractor                                           score    rate
conducted experiments.                                                           VGGFace2      0.051    0.019     0.738      0.978    0.842   0.047
                                                                                  VGGFace      0.055    0.276     0.655      0.723    0.688   0.084
                                                                                  MobileNet    0.054    0.168     0.687      0.831    0.752   0.069
                                                                                 InsightFace   0.115    0.281     0.474      0.719    0.571   0.137
                                                                                   FaceNet     0.056    0.044     0.712      0.952    0.816   0.055

                                                                                                        IV. CONCLUSION
                                                                                 The task of personal photos detection is difficult in terms of
                                                                             finding an effective solution due to its inherent subjectivity. In
                                                                             this paper, it is assumed that personal data contains confidential
                                                                             textual information and images with the user, his close friends
                                                                             and relatives. This assumption allows to highlight personal
                                                                             photos accurately and impose restrictions on their processing.
                                                                             To highlight such data, a novel approach was proposed in the
                                                                             current work (Fig. 1). It is proposed to use the EAST text
                                                                             detector and recognize text in the detected areas with Tesseract
                                                                             OCR library to classify scanned documents. It has been
                                                                             experimentally shown that a simple fully-connected neural
                                                                             network for text encoded using bag-of-words [13] exceeds
                                                                             more complex network architectures, such as CNN, by more
                                                                             than 10% and achieves high accuracy in detecting personal
Fig. 2. The dependence between the minimal number Kmin of photos in a        documents. In addition, in agglomerative clustering with a
personal cluster and type1/type 2 error rates, LFW dataset.                  weighted linkage performed higher results in extracting groups
                                                                             of user’s faces, friends and relatives (Tables II and III).
    According to the results, zero rate of missing private photos
is achieved with Kmin=2. It means that all photos from dataset


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                175
Image Processing and Earth Remote Sensing

                         ACKNOWLEDGMENT                                            the IEEE Conference on Computer Vision and Pattern Recognition
                                                                                   (CVPR), pp. 5551-5560, 2017.
    The paper was prepared within the framework of the                        [15] C. Zhu, F. Wen and J. Sun, “A rank-order distance based clustering
Academic Fund Program at the National Research University                          algorithm for face tagging,” CVPR IEEE, pp. 481-488, 2011.
Higher School of Economics (HSE University) in 2019-2020                      [16] Y. Shi, C. Otto and A. K. Jain, “Face clustering: representation and
(grant No 19-04-004) and by the Russian Academic Excellence                        pairwise constraints,” IEEE Transactions on Information Forensics and
Project «5-100».                                                                   Security, vol. 13, no. 7, pp. 1626-1640, 2018.
                                                                              [17] L. Yang, X. Zhan, D. Chen, J. Yan, C. C. Loy and D. Lin, “Learning to
                             REFERENCES                                            cluster faces on an affinity graph,” Proceedings of the IEEE Conference
                                                                                   on Computer Vision and Pattern Recognition, pp. 2298-2306, 2019.
[1]  I. Grechikhin and A.V. Savchenko, “User Modeling on Mobile Device
     Based on Facial Clustering and Object Detection in Photos and Videos,”   [18] A.V. Savchenko, "Probabilistic neural network with complex
     Iberian    Conference on         Pattern    Recognition  and    Image         exponential activation functions in image recognition," IEEE
     Analysis, Springer, Cham, pp. 429-440, 2019.                                  Transactions on Neural Networks and Learning Systems, vol. 31, no. 2,
                                                                                   pp. 651-660, 2020.
[2] I. Goodfellow, Y. Bengio and A. Courville, “Deep learning,” MIT press,
     2016.                                                                    [19] F. Chollet, "Deep learning with Python," Manning Publications, 2017.
[3] L. Tran, D. Kong, H. Jin and J. Liu, “Privacy-CNH: A framework to         [20] A.V. Savchenko and E.V. Miasnikov, “Event recognition based on
     detect photo privacy with convolutional neural network using                  classification of generated image captions,” International Symposium on
     hierarchical features,” Thirtieth AAAI Conference on Artificial               Intelligent Data Analysis (IDA), pp. 418-430, 2020.
     Intelligence (AAAI), pp. 1317-1323, 2016.                                [21] V.V. Arlazarov, K. Bulatov, T. Chernov and V.L. Arlazarov, “MIDV-
[4] H. Zhong, A.C. Squicciarini, D.J. Miller and C. Caragea, “A Group-             500: a dataset for identity document analysis and recognition on mobile
     Based Personalized Model for Image Privacy Classification and                 devices in video stream”, Computer Optics, vol. 43, no. 5, pp. 818-824,
     Labeling,” International Joint Conferences on Artificial Intelligence         2019. DOI: 10.18287/2412-6179-2019-43-5-818-824.
     (IJCAI), vol. 17, pp. 3952-3958, 2017.                                   [22] P. Ye and D. Doermann, “Document image quality assessment: A brief
[5] A. Tonge and C. Caragea, “Dynamic deep multi-modal fusion for image            survey”, 12th International Conference on Document Analysis and
     privacy prediction,” The World Wide Web Conference (WWW), pp.                 Recognition, IEEE, pp. 723-727, 2013.
     1829-1840, 2019.                                                         [23] A. Bartoli, G. Davanzo, E. Medvet and E. Sorio, “Improving features
[6] A. Tonge and C. Caragea, “Image privacy prediction using deep neural           extraction for supervised invoice classification,” Proceedings of the 10th
     networks,” ACM Transactions on the Web (TWEB), vol. 14, no. 2, pp.            IASTED International Conference, vol. 674, no. 040, p. 401, 2010.
     1-32, 2020.                                                              [24] A.V. Savchenko, “Efficient facial representations for age, gender and
[7] C. Sitaula, Y. Xiang, S. Aryal and X. Lu, “Unsupervised deep features          identity recognition in organizing photo albums using multi-output
     for privacy image classification,” Pacific-Rim Symposium on Image and         ConvNet,” PeerJ Computer Science, e197, 2019.
     Video Technology, pp. 404-415, 2019.                                     [25] K. Zhang, Z. Zhang, Z. Li and Y. Qiao, “Joint face detection and
[8] J. He, B. Liu, D. Kong, X. Bao, N. Wang, H. Jin and G. Kesidis,                alignment using multitask cascaded convolutional networks,” IEEE
     “Puppies: Transformation-supported personalized privacy preserving            Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016.
     partial image sharing,” 46th Annual IEEE/IFIP International Conference   [26] A.C. Gallagher and T. Chen, “Clothing cosegmentation for recognizing
     on Dependable Systems and Networks (DSN), IEEE, pp. 359-370, 2016.            people” IEEE Conference on Computer Vision and Pattern Recognition,
[9] Q. Cao, L. Shen, W. Xie, O.M. Parkhi and A. Zisserman, “Vggface2: A            pp. 1-8, 2008.
     dataset for recognising faces across pose and age,” 3th International    [27] G.B. Huang, M. Mattar, T. Berg and E. Learned-Miller, “Labeled faces
     Conference on Automatic Face & Gesture Recognition (FG), IEEE, pp.            in the wild: A database forstudying face recognition in unconstrained
     67-74, 2018.                                                                  environments,” 2018.
[10] F. Schroff, D. Kalenichenko and J. Philbin, “FaceNet: A unified          [28] C. Otto, D. Wang and A.K. Jain, “Clustering millions of faces by
     embedding for face recognition and clustering,” Proceedings of the            identity,” IEEE transactions on pattern analysis and machine
     IEEE conference on computer vision and pattern recognition, pp. 815-          intelligence, vol. 40, no. 2, pp. 289-303, 2017.
     823, 2015.                                                               [29] L. Yang, D. Chen, X. Zhan, R. Zhao, C.C. Loy and D. Lin, “Learning to
[11] A.V. Savchenko and N.S. Belova, "Unconstrained face identification            cluster faces via confidence and connectivity estimation,” arXiv preprint
     using maximum likelihood of distances between deep off-the-shelf              arXiv:2004.00445, 2020.
     features," Expert Systems with Applications, vol. 108, pp. 170-182,      [30] L. Yang, D. Chen, X. Zhan, R. Zhao, C.C. Loy and D. Lin, “Learning to
     2018.                                                                         cluster faces on an affinity graph,” Proceedings of the IEEE Conference
[12] R. Smith, “An overview of the Tesseract OCR engine” Ninth                     on Computer Vision and Pattern Recognition, pp. 2298-2306, 2019.
     International Conference on Document Analysis and Recognition            [31] O.M. Parkhi, A. Vedaldi and A. Zisserman, “Deep face recognition,”
     (ICDAR), IEEE, vol. 2, pp. 629-633, 2007.                                     Britich Machine Vision Conference (BMVC), 2015.
[13] L. Kopeykina, A.V. Savchenko, “Automatic privacy detection in            [32] J. Deng, J. Guo, N. Xue and S. Zafeiriou, “Arcface: Additive angular
     scanned document images based on deep neural networks,” Proceedings           margin loss for deep face recognition,” Proceedings of the IEEE
     of International Russian Automation Conference (RusAutoCon), IEEE,            Conference on Computer Vision and Pattern Recognition, pp. 4690-
     pp. 1-6, 2019.                                                                4699, 2019.
[14] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He and J. Liang,
     "EAST: an efficient and accurate scene text detector," Proceedings of


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                 176