<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Photo Privacy Detection based on Text Classification and Face Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lyudmila Kopeykina</string-name>
          <email>lnkopeykina@mail.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Savchenko</string-name>
          <email>avsavchenko@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Algorithms and Technologies for Network Analysis, National Research University Higher School of Economics</institution>
          ,
          <addr-line>Nizhny Novgorod</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Nizhny Novgorod</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>171</fpage>
      <lpage>176</lpage>
      <abstract>
        <p>- Nowadays, the photo privacy detection is becoming an acute task due to a wide spread of mobile devices with photos published on social networks. As a photo might contain private or sensitive data, there is an urgent need to accurately determine them and impose restrictions on their processing. In this paper we focus on the task of personal data detection in a photo gallery. A novel two-stage approach is proposed. At first, text of scanned documents is processed based on an EAST text detector, and extracted text is recognized using Tesseract and neural network classifier. At the second stage, face clustering is implemented for the remaining photos to identify large groups of people (friends, relatives) whose photos also refer to personal data and must be processed directly on a mobile device. The remaining images can be sent to a remote server for processing with higher accuracy. The experimental results of text recognition and face clustering methods using various convolutional networks for facial features extraction are presented.</p>
      </abstract>
      <kwd-group>
        <kwd>photo privacy detection</kwd>
        <kwd>face clustering</kwd>
        <kwd>text detection and classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        The photo gallery of a typical mobile device contains
unique information about its user and reflects his or her
preferences [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As a result, image-processing methods can be
applied to build visual recommender engines [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Such deep
learning-based methods usually require significant computing
resources and should be implemented on a remote server with
GPUs. However, there is an urgent need to restrict the
processing of photos with some sensitive data in order to avoid
the potential risk of inappropriate usage of private information.
      </p>
      <p>
        The privacy detection on photos is a worth considering
problem [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] that has already reached a certain level of
maturity [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. The demand for handling this issue is
justified by the need to distinguish personal photos that cannot
be transferred to the third parties in terms of privacy policy,
and public information that can be sent to a remote server for
further deep processing and analysis. Moreover, the separate
processing of public and private photos improves the accuracy
and computational efficiency of algorithms.
      </p>
      <p>
        It is noticeable that the vast majority of private images
mainly contain such characteristics like human faces, textual
data (identification data and credit card numbers) and other
general objects (private cars and buildings) [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]. Therefore,
this work proposes a unified approach for personal data
detection in photo gallery using well-known methods of face
classification [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ] and text recognition (optical character
recognition, OCR) [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. In particular, to detect scanned
personal documents, it is proposed to sequentially use the
EAST text detector [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the Tesseract OCR library [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
the neural network classification of recognized text on images.
To detect personal photos containing faces of the user himself,
his close friends and relatives, the well-known methods of face
clustering [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ] are applied to face embeddings extracted
with CNNs (convolutional neural networks) [
        <xref ref-type="bibr" rid="ref18 ref2">2, 18</xref>
        ].
      </p>
      <p>The rest of the paper is organized as follows: In Section II
we describe the proposed approach in detail. Section III
includes experimental study of privacy detection methods.
Finally, in Section 4 the conclusion and future plans are
discussed</p>
    </sec>
    <sec id="sec-2">
      <title>II. MATERIALS AND METHODS</title>
      <p>In this paper we concentrate on the following task. It is
required to assign an image from photo album to one of two
possible classes: private or public. The proposed approach is
shown in Fig. 1. Let us discuss the most important parts of this
pipeline in the rest of this section.</p>
      <p>
        As a part of scanned documents detection, it is proposed to
consider various methods of text recognition. Firstly, image
areas containing textual information are detected using the
EAST algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Further, Tesseract OCR in
image_to_string mode with LSTM (Long-Short Term
Memory) recursive model is used to recognize text in each
detected area. The given approach is subsequently compared
with a simplified text recognition method, in which the step of
preliminary text detection by the EAST detector is omitted.
Instead, Tesseract is used both in text recognition mode and in
automatic page segmentation mode.
      </p>
      <p>
        After that, to classify personal data in the extracted text, it
is proposed to use a neural network, which is trained based on
the input sequence of words recognized in the training set of
scanned documents [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. One-hot encoding is used to
represent the input data as a feature vector. To be more exact,
a dictionary of the V most frequently used words in the
training set is created, and each text is represented as a
Vdimensional binary vector, where the v-th component of the
vector is 1 only if the v-th word from the dictionary is
presented in the input text ( so-called bag-of-words model)
[
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>Scanned documents processing</title>
        <p>Text recognition</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Text classification private public</title>
    </sec>
    <sec id="sec-4">
      <title>Feature extraction</title>
    </sec>
    <sec id="sec-5">
      <title>Personal images</title>
    </sec>
    <sec id="sec-6">
      <title>Public images</title>
    </sec>
    <sec id="sec-7">
      <title>Large clusters</title>
    </sec>
    <sec id="sec-8">
      <title>Small clusters</title>
    </sec>
    <sec id="sec-9">
      <title>Face clustering</title>
      <sec id="sec-9-1">
        <title>Processing of photos with faces</title>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Facial features extraction</title>
      <p>Text
detection</p>
    </sec>
    <sec id="sec-11">
      <title>Face detection</title>
      <p>
        To solve the binary classification problem, it is proposed to
use a computationally efficient implementation of a fully
connected neural network, which has already shown high
performance in a similar problem of sentiment analysis [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
To train the above-mentioned network, we created a balanced
corpus of 700 images [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The positive class is presented by
350 images of driving license and medical insurance cards,
passports and invoices from extension of the MIDV dataset
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], whereas negative class consists of photos from publicly
available datasets for text classification tasks DIQA [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and
Ghega [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. This approach is sometimes as accurate as more
complex methods based on CNNs and LSTMs. Moreover, it
outperforms well-known traditional methods for detecting
personal data, for example, the keyword spotting method [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
B. Detection of Personal Photos Based on Face Clustering
      </p>
      <p>
        As scanned documents are not the only option for personal
data in the gallery, it is proposed to select images that contain
faces of the user himself, his close friends and relatives [
        <xref ref-type="bibr" rid="ref1 ref24">1,
24</xref>
        ]. To detect such kind of personal photos, it is proposed to
apply the following approach. At first, the facial regions are
detected in all photographs using well-known methods for
face detection like cascade classifiers or MTCNN [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Since
there are no labels of people in the user's photo gallery, the
task can be reformulated as a face clustering problem [
        <xref ref-type="bibr" rid="ref16 ref24">16, 24</xref>
        ].
For doing this, D-dimensional feature vectors are extracted [
        <xref ref-type="bibr" rid="ref11 ref9">9,
11</xref>
        ] for each of N &gt; 0 selected facial images by using a CNN,
pre-trained to identify faces from a large (external) datasets
like VGGFace-2, MS-Celeb, etc.
      </p>
      <p>R
e
m
o
t
e
s
e
r
v
e
r</p>
      <p>
        The procedure for combining selected individuals into
clusters supposes the assignment of each i-th facial image (i =
1, ..., N) to one of C ≥ 1 group, where C is usually unknown.
Hence, one can apply either traditional agglomerative
clustering algorithms or rank linkage [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] and graph CNNs
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. An image is considered to be private if it contains faces
from sufficiently large clusters. In other words, a person
presents at least Kmin times on different types of photos, where
Kmin is a hyper-parameter of our method. That assumption is
based on the idea that the user’s gallery contains his own face
and faces of his close friends on the substantial part of photos.
      </p>
    </sec>
    <sec id="sec-12">
      <title>III. EXPERIMENTS AND RESULTS</title>
      <p>In this section we present the experimental results of a
comparative analysis of the well-known text classification.
Moreover, the comparison of clustering methods applied to
facial features extracted with various CNN is given. Finally,
we analyze the performance of our approach to split user’s
photos into to private and public images.</p>
      <sec id="sec-12-1">
        <title>A. Detection of Scanned Documents</title>
        <p>
          At first, we compare various approaches for text extraction
in terms of traditional keyword spotting method, which aims to
search specially selected words (“passport”, “card”, etc.) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
in recognized text. Namely, we compare simultaneous
detection of text on images and its recognition using Tesseract
with the approach when text regions are preliminary detected
by EAST detector and text is recognized by Tesseract OCR
engine. In addition to traditional keyword spotting, three neural
network models are compared:
Tesseract
Proposed
(EAST+
Tesseract)
        </p>
        <p>Model
Keyword
spotting
LSTM
CNN
Fullyconnected
Keyword
spotting
LSTM
CNN
Fullyconnected
0.83
0.97
0.88
0.98
0.90
0.93
0.89
1.00
0.62
0.93
0.77
0.94
0.75
0.99
0.79
0.97
0.70
0.94
0.82
0.95
0.81
0.95
0.83
0.98</p>
        <p>Error
rate
0.276</p>
        <p> Recurrent model, which fed a sequence of 400 words
from a dictionary of V = 5000 frequently encountered words as
input for the vector representation (embedding) with the size of
the attribute space 256. Next, we use the LSTM layer with 128
hidden components, the dropout layer with a drop rate of 0.5.</p>
        <p> CNN, consisting of one-dimensional convolutional
layer (with 32 neurons, core size of 7 and ReLU activation
function), maxpooling and dropout layers (with a drop rate of
0.5). As the first layer of the model, a vector representation
(embedding) of 256 was also used.</p>
        <p> Fully connected network with 2 hidden layers of 16
neurons with hyperbolic tangent activation. The V-dimensional
vector encoded as described in Subsection IIA (bag-of-words)
is considered as input for the model.</p>
        <p>The last fully connected layer of each model used the
sigmoid activation. To train classifiers, TensorFlow and Keras
frameworks were used. All classifiers were trained over 20
epochs using the RMSprop optimizer.</p>
        <p>A quantitative comparison of all methods described above
is presented in Table I. The results were obtained using a
5fold cross-validation.</p>
        <p>Here the use of EAST text detector to identify areas with
text was a reasonable solution. While the error rate attained
using only Tesseract is more than 27%, the proposed
preliminary detection of text using the EAST detector reduces
this error to approximately 16%. In addition, we can conclude
that the proposed implementation with the EAST text detector
increases the average accuracy by approximately 2%. A
fullyconnected network achieves best results with accuracy that
exceeds even traditional LSTM. Moreover, such an
implementation 15% more accurately determines the image
class of the document in comparison with the traditional
keyword spotting.</p>
      </sec>
      <sec id="sec-12-2">
        <title>B. Face Clustering</title>
        <p>We used the publicly available facial datasets:
</p>
        <p>
          Gallagher collection person dataset [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], which contains
589 images with 931 labeled faces of 32 various people.
As only eyes positions are available in this dataset, to
gather faces MTCNN [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] was preliminarily used to
detect faces and choose the subject with the largest
intersection of facial region with given eyes region. If




the face is not detected, a square region with the size
chosen as a 1.5-times distance between eyes is
extracted.

        </p>
        <p>
          Subset of labeled faces in the wild (LFW) dataset [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
used to test face identification algorithms [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. It
includes photos of those subjects, who has at least two
images in the original LFW dataset and at least one
video in the YouTube Faces (YTF) collection.
        </p>
        <p>
          Firstly, hierarchical agglomerative clustering is considered
for the distance L2 between normalized feature vectors with the
following types of linkage: single linkage, average linkage,
complete linkage, weighted linkage, centroid linkage and
median linkage from the SciPy library. Further, the rank-order
clustering [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] was examined as it was specially developed for
organizing faces in photo albums. It uses special rank linkage,
which is further used to compute distance measure. Then this
approach was compared to the approximate rank-order
algorithm [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], in which only the top-k neighbors are taken
into consideration rather than the complete list of neighbors.
This approach makes the actual rank of neighbors irrelevant
because the importance is shifted towards the presence /
absence of shared nearest neighbors. Finally, we examined
clustering method based on the graph CNN [
          <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
          ]. Each
element of the feature matrix is considered as a separate vertex
of the graph. Using the cosine distance, k nearest neighbors are
found for each element of the dataset. Thus, by connecting
between neighbors, a similarity graph for the entire dataset is
obtained. Instead of processing such graph directly,
subgraphsproposals are first generated, on the basis of which the
resulting clusters are subsequently built.
        </p>
        <p>To extract facial features, traditional pre-trained models
downloaded from the official websites of their developers were
considered:</p>
        <p>
          VGGFace (VGGNet-16) [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] extracts 4096-D vectors;
VGGFace2 (ResNet-50) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] extracts 2048-D vectors;
        </p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>MobileNet [24] extracts 1024-D vectors;</title>
      <p>
         InsightFace (ArcFace) [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] extracts 512-D vectors;
FaceNet (Inception ResNet v1) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] extracts 512-D
vectors.
      </p>
      <p>Table III contains the Rand index (ARI), mutual
information index (AMI), homogeneity and completeness. In
addition, the average number K of selected clusters to the
number of groups C and the b-cubed F-measure, traditional for
assessing the quality of face clustering, are calculated.</p>
      <p>Considering the results, clustering applied to facial features
extracted with ResNet-50 (VGGFace2) and Inception ResNet
v1 (FaceNet) perform more accurate results according to most
of the metrics compared to other models. Although MobileNet
is slightly inferior, it takes twice less time to extract face
embeddings compared to VGGFace2 and FaceNet. InsightFace
features in most cases shows slightly worse capacity to define
clusters. In addition, the weighted linkage demonstrates higher
F-score for both datasets in comparison with other clustering
methods (over 92%).</p>
      <p>CLUSTERING RESULTS FOR GALLAGHER DATASET</p>
      <p>GCN-D
Rank-order</p>
      <p>Agglomerative clustering with average linkage performs
the second most accurate results (approximately 90%).
Furthermore, connectivity graph-based method demonstrates
poor results on the given data. The use of rank distance is
impractical due to the rather low values for each metric and its
quadratic complexity. Even though the approximation of
rankorder clustering takes less time to split data into groups
compared to the original method, the results still do not
outperform those of traditional agglomerative algorithms.</p>
      <p>Moreover, we analyzed the dependence between the
minimum number of faces in cluster to set it private (Kmin) and
the type 1 and type 2 error rates for the LFW subset (Fig. 2).
Since ground truth labels in terms of private and public photos
for that dataset were not provided, we determined them as
follows. All objects from classes, the number of photos in
which is greater than or equal to Kmin, were considered to be
private. The remaining images were assigned to public images.
We used agglomerative clustering with weighted linkage and
VGGFace2 descriptor as it provided best results according to
conducted experiments.
are initially private and they are marked as private by
algorithm. If Kmin=3, then 5% of private photos will be moved
to public set. With an increase of Kmin, the trend for type 1 error
is going upwards unstably and ends up with 2%. At the same
time, the probability to assign public images to private
decreases and reaches 0%.</p>
      <p>In the final experiment, we compared the results given by
various descriptors on LFW (Table IV). “0” class consists of
3263 private images, whereas public class “1” includes 474.
Here, images containing faces from clusters that include
Kmin=3 or more facial images, were considered personal. Here
all face descriptors lead to a fairly high quality of detection, but
zero probability of missing personal data was not achieved. In
this case, the best results are obtained using VGGFace2
(ResNet-50) and FaceNet models.</p>
      <p>
        The task of personal photos detection is difficult in terms of
finding an effective solution due to its inherent subjectivity. In
this paper, it is assumed that personal data contains confidential
textual information and images with the user, his close friends
and relatives. This assumption allows to highlight personal
photos accurately and impose restrictions on their processing.
To highlight such data, a novel approach was proposed in the
current work (Fig. 1). It is proposed to use the EAST text
detector and recognize text in the detected areas with Tesseract
OCR library to classify scanned documents. It has been
experimentally shown that a simple fully-connected neural
network for text encoded using bag-of-words [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] exceeds
more complex network architectures, such as CNN, by more
than 10% and achieves high accuracy in detecting personal
documents. In addition, in agglomerative clustering with a
weighted linkage performed higher results in extracting groups
of user’s faces, friends and relatives (Tables II and III).
      </p>
      <p>According to the results, zero rate of missing private photos
is achieved with Kmin=2. It means that all photos from dataset</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENT</title>
      <p>The paper was prepared within the framework of the
Academic Fund Program at the National Research University
Higher School of Economics (HSE University) in 2019-2020
(grant No 19-04-004) and by the Russian Academic Excellence
Project «5-100».</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Grechikhin</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          , “
          <source>User Modeling on Mobile Device Based on Facial Clustering and Object Detection in Photos and Videos,” Iberian Conference on Pattern Recognition and Image Analysis</source>
          , Springer, Cham, pp.
          <fpage>429</fpage>
          -
          <lpage>440</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , “Deep learning,” MIT press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          and J. Liu, “
          <article-title>Privacy-CNH: A framework to detect photo privacy with convolutional neural network using hierarchical features,”</article-title>
          <source>Thirtieth AAAI Conference on Artificial Intelligence (AAAI)</source>
          , pp.
          <fpage>1317</fpage>
          -
          <lpage>1323</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            <surname>Squicciarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Miller</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Caragea</surname>
          </string-name>
          , “
          <article-title>A GroupBased Personalized Model for Image Privacy Classification</article-title>
          and Labeling,”
          <source>International Joint Conferences on Artificial Intelligence (IJCAI)</source>
          , vol.
          <volume>17</volume>
          , pp.
          <fpage>3952</fpage>
          -
          <lpage>3958</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tonge</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Caragea</surname>
          </string-name>
          , “
          <article-title>Dynamic deep multi-modal fusion for image privacy prediction,”</article-title>
          <source>The World Wide Web Conference (WWW)</source>
          , pp.
          <fpage>1829</fpage>
          -
          <lpage>1840</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tonge</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Caragea</surname>
          </string-name>
          , “
          <article-title>Image privacy prediction using deep neural networks</article-title>
          ,
          <source>” ACM Transactions on the Web (TWEB)</source>
          , vol.
          <volume>14</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sitaula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aryal</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          , “
          <article-title>Unsupervised deep features for privacy image classification,”</article-title>
          <source>Pacific-Rim Symposium on Image and Video Technology</source>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>415</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          and G. Kesidis, “Puppies:
          <article-title>Transformation-supported personalized privacy preserving partial image sharing</article-title>
          ,
          <source>” 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)</source>
          , IEEE, pp.
          <fpage>359</fpage>
          -
          <lpage>370</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , “
          <article-title>Vggface2: A dataset for recognising faces across pose and age</article-title>
          ,
          <source>” 3th International Conference on Automatic Face &amp; Gesture Recognition (FG)</source>
          , IEEE, pp.
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schroff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          , “
          <article-title>FaceNet: A unified embedding for face recognition and clustering</article-title>
          ,
          <source>” Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.S.</given-names>
            <surname>Belova</surname>
          </string-name>
          ,
          <article-title>"Unconstrained face identification using maximum likelihood of distances between deep off-the-shelf features," Expert Systems with Applications</article-title>
          , vol.
          <volume>108</volume>
          , pp.
          <fpage>170</fpage>
          -
          <lpage>182</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          , “
          <article-title>An overview of the Tesseract OCR engine” Ninth International Conference on Document Analysis and Recognition (ICDAR), IEEE</article-title>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kopeykina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          , “
          <article-title>Automatic privacy detection in scanned document images based on deep neural networks</article-title>
          ,
          <source>” Proceedings of International Russian Automation Conference (RusAutoCon)</source>
          , IEEE, pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>"EAST: an efficient and accurate scene text detector,"</article-title>
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>5551</fpage>
          -
          <lpage>5560</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , “
          <article-title>A rank-order distance based clustering algorithm for face tagging</article-title>
          ,
          <source>” CVPR IEEE</source>
          , pp.
          <fpage>481</fpage>
          -
          <lpage>488</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Otto</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          , “
          <article-title>Face clustering: representation and pairwise constraints</article-title>
          ,
          <source>” IEEE Transactions on Information Forensics and Security</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>1626</fpage>
          -
          <lpage>1640</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Loy</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , “
          <article-title>Learning to cluster faces on an affinity graph</article-title>
          ,
          <source>” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2298</fpage>
          -
          <lpage>2306</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          ,
          <article-title>"Probabilistic neural network with complex exponential activation functions in image recognition,"</article-title>
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          , vol.
          <volume>31</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>651</fpage>
          -
          <lpage>660</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <article-title>"Deep learning with Python,"</article-title>
          <source>Manning Publications</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.V.</given-names>
            <surname>Miasnikov</surname>
          </string-name>
          , “
          <article-title>Event recognition based on classification of generated image captions</article-title>
          ,”
          <source>International Symposium on Intelligent Data Analysis (IDA)</source>
          , pp.
          <fpage>418</fpage>
          -
          <lpage>430</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.V.</given-names>
            <surname>Arlazarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulatov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chernov</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.L.</given-names>
            <surname>Arlazarov</surname>
          </string-name>
          , “
          <article-title>MIDV500: a dataset for identity document analysis and recognition on mobile devices in video stream”</article-title>
          ,
          <source>Computer Optics</source>
          , vol.
          <volume>43</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>818</fpage>
          -
          <lpage>824</lpage>
          ,
          <year>2019</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179-2019-43-5-
          <fpage>818</fpage>
          -824.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ye</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Doermann</surname>
          </string-name>
          , “
          <article-title>Document image quality assessment: A brief survey”</article-title>
          ,
          <source>12th International Conference on Document Analysis and Recognition</source>
          , IEEE, pp.
          <fpage>723</fpage>
          -
          <lpage>727</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bartoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Davanzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Medvet</surname>
          </string-name>
          and E. Sorio, “
          <article-title>Improving features extraction for supervised invoice classification</article-title>
          ,
          <source>” Proceedings of the 10th IASTED International Conference</source>
          , vol.
          <volume>674</volume>
          , no.
          <issue>040</issue>
          , p.
          <fpage>401</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          , “
          <article-title>Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet,” PeerJ Computer Science</article-title>
          ,
          <year>e197</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , “
          <article-title>Joint face detection and alignment using multitask cascaded convolutional networks</article-title>
          ,
          <source>” IEEE Signal Processing Letters</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1499</fpage>
          -
          <lpage>1503</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.C.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          , “
          <article-title>Clothing cosegmentation for recognizing people”</article-title>
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mattar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berg</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Learned-Miller</surname>
          </string-name>
          , “
          <article-title>Labeled faces in the wild: A database forstudying face recognition in unconstrained environments</article-title>
          ,”
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Otto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.K.</given-names>
            <surname>Jain</surname>
          </string-name>
          , “
          <article-title>Clustering millions of faces by identity</article-title>
          ,
          <source>” IEEE transactions on pattern analysis and machine intelligence</source>
          , vol.
          <volume>40</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>303</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.C.</given-names>
            <surname>Loy</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , “
          <article-title>Learning to cluster faces via confidence and connectivity estimation</article-title>
          ,” arXiv preprint arXiv:
          <year>2004</year>
          .00445,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.C.</given-names>
            <surname>Loy</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , “
          <article-title>Learning to cluster faces on an affinity graph</article-title>
          ,
          <source>” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2298</fpage>
          -
          <lpage>2306</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>O.M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , “
          <article-title>Deep face recognition</article-title>
          ,
          <source>” Britich Machine Vision Conference (BMVC)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xue</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          , “Arcface:
          <article-title>Additive angular margin loss for deep face recognition</article-title>
          ,
          <source>” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>4690</fpage>
          -
          <lpage>4699</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>