<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Video-based age and gender recognition in mobile applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A S Kharchevnikova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A V Savchenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Myasnitskaya str. 20, Moscow, Russia, 101000</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>227</fpage>
      <lpage>235</lpage>
      <abstract>
        <p>In this paper we develop the age and gender recognition mobile system using deep convolutional neural networks for mobile applications. The brief literature survey on the age/gender problem in retail applications is presented. The comparative analysis of classifier fusion algorithms to aggregate decisions for individual frames is provided. In order to improve the age and gender identification accuracy we implement the video-based recognition system with several aggregation methods. We provide the experimental comparison for IJB-A, Indian Movies, Kinect and EmotiW2018 datasets. It is demonstrated that the most accurate decisions are obtained using the geometric mean and mathematical expectation of the outputs at softmax layers of the convolutional neural networks for gender recognition and age prediction, respectively. As a result, the off-line application of the proposed system is implemented on the Android platform.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Due to the rapid growth of interest in video processing, the modern face analysis technologies are
oriented to identify various properties of an observed person. In particular, age and gender
characteristics can be applied in retail for contextual advertising for particular group of customers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Often, people ignore the advertisements because the information is irrelevant, uninteresting for them at
the current moment. Consequently, companies incur huge losses from investing in contextual
advertising, which turns out to be inefficient and ineffective. Therefore, one of the key tasks of video
analytics in retail is to provide relevant information that meets the interests of a specific target
audience. For instance, depending on the automatically detected customer data, the application could
provide relevant information that corresponds to a specific target audience. Consequently, the
videobased age and gender recognition would improve the efficiency of the contextual advertising and
increase sales. The necessary video frames for the following recognition can be obtained from digital
screens or interactive panels in the shops. Such applications, running in real time, should perform the
recognition task at the required speed on platforms that are limited by power and memory resources.
Therefore the advanced decision for mobile platforms is required. Despite the fact that over the past
few years a large number of different algorithms for age and gender recognition have appeared [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ],
the reliability of existing solutions remains insufficient for practical application [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Unlike traditional single-image processing systems, the video analysis lets us use addition
information. For rather fast recognition algorithms one can obtain more than 100 frames of the
classified object in the dynamics within a few seconds from the video stream [
        <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
        ]. It is sufficient to
guarantee that at least several frames belong to the same class from the reference base. The intuition is
that if each classifier makes different errors, then the total errors can be reduced by an appropriate
combination of these classifiers. Thus, this research work is intended to consider the video-based age
and gender recognition task as the problem of choosing the most reliable solution using the classifier
fusion (or ensemble) methods [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6,7,8</xref>
        ]. After a set of solutions for each classifier is obtained, it is
necessary to implement a combining function to make a single decision. The most obvious strategy is
a simple vote, in which the decision is made in favor of the class with the maximum number of
predictions. This paper compares the classifier fusion obtained by traditional averaging of individual
decision rules [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with solutions based on the principle of maximum a posteriori probability [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10,11,12</xref>
        ].
      </p>
      <p>
        The rest of the paper is organized as follows. In Section 2 the brief literature survey on age and
gender image recognition is presented. In Section 3 we provide classifier fusion solutions and describe
the proposed recognition scheme. Experimental results and concluding comments are presented in
Section 4 and Section 5 respectively.
X ( t ), X ( t )  xuv( t ) , t  1,T is assigned to one of the L classes by feeding the RGB matrix of
pixels of a facial image X(t) to the CNN [
        <xref ref-type="bibr" rid="ref13 ref4">4,13</xref>
        ]. This deep neural network should be preliminarily
trained using the very large dataset of facial images with known age and gender labels. For simplicity,
we assume that the video contains only one classified person with a previously selected face area on
the frame, so on each image X (t) the face area is detected and left. Based on this, the task of
recognizing gender is a typical example of a binary classification [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Despite the age prediction is an
example of a regression problem, in practice the highest accuracy is achieved when it is assigned to
the classification problem with the definition of several age categories (L = 8 in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
      </p>
      <p>
        The challenge of automatically extracting age and gender related attributes from facial images has
received increasing attention in recent few years and the huge number of recognition algorithms has
been proposed. Early age estimation methods are based on the calculation of the relationships between
different dimensions of facial features. A detailed survey of such algorithms is presented by Kwon
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Since this solution requires an accurate calculation of the facial features location, that is a fairly
complex problem, they are unsuitable for raw images, video frames. Geng [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] proposes a method for
automatic age recognition - AGing pattErn Subspace (AGES), the concept of which is creating an
aging pattern. However, the requirements of front alignment of images impose significant restrictions
on the set of input parameters. The frequency-based approach is also known among the age
identification algorithms. For instance, a combination of biological features of the image is studied by
Guo et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] (BIF - Biologically Inspired Features).
      </p>
      <p>
        Gender recognition for a facial image is a much more simple task, because it includes only L = 2
classes. Hence, traditionally binary classifiers can be applied. Among them, such methods as SVM
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], boosting-based algorithms, and neural networks are widely used.
      </p>
      <p>
        Unfortunately, the accuracy of traditional methods of computer vision and pattern recognition does
not meet the requirements of practical application. With regard to the effectiveness of the convolution
neural networks (CNN) implementation, in particular, to classification challenges, Levi [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] provides
new insights into the process of solving age and gender recognition problems by applying this method.
After that, several other papers have proved the efficiency of deep CNNs in these tasks [
        <xref ref-type="bibr" rid="ref21 ref24 ref25">21,24,25</xref>
        ].
Specifically, deep VGG-16 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], trained to recognize gender and age by image, is described in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Hence, we will use this deep learning approach in order to recognize age and gender for video data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed Algorithm</title>
      <p>
        The output of the CNN is usually obtained in the Softmax layer that provides the estimation of
posterior probabilities Pl X (t) for the t-th frame belonging to the l-th class label from the reference
base [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]:
      </p>
      <p>Pl X (t)  softmax zl (t) </p>
      <p>, l  1,2,..., L
exp( zl (t))
L
 exp( z j (t))
j1
(1)
where zl(t) is the output of the l-th neuron in the last (usually fully connected) layer of the neural
network. The decision is made in favor of a class with a maximum a posteriori probability (MAP) (1).</p>
      <p>
        Due to the influence of diverse factors such as unknown illumination, quick change of camera
angle, low resolution of video camera, etc., making a decision based on the MAP approach for every
frame is usually inaccurate. Therefore, we will use the fusion of decisions for individual frames to
increase recognition accuracy. The review and analysis of publications in the field of data processing
shows that the synthesis of classifier fusion is one of the most effective approaches to increasing the
accuracy and stability of classification [
        <xref ref-type="bibr" rid="ref24 ref26 ref27">24,26,27</xref>
        ]. According to aggregation algorithms, several
criteria are used, each of which is able to assign a class label after that general classification result is
formed on the basis of some principle [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the task of video recognition, firstly the traditional
problem of automatic image recognition with CNN is solved for each incoming X(t) frame and then all
individual solutions are combined into one common decision for a specific video recording. The most
obvious approach is to use more complex algorithms for constructing classifier fusion based on
algebraic methods [
        <xref ref-type="bibr" rid="ref26 ref8">8, 26</xref>
        ]. Most of these algorithms (such as weighted majority committee, bagging
and boosting [
        <xref ref-type="bibr" rid="ref10 ref28">10,28</xref>
        ]) require a sufficient representative training sample. Unfortunately, in many
image recognition cases, the existing database contains an insufficient number of standards for each
class. In the present paper it is proposed to use known statistical methods of synthesis solutions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
that do not require the test sample. So, we examine the following criteria [
        <xref ref-type="bibr" rid="ref11 ref29">29, 11</xref>
        ]:
1. Simple voting, in which each classifier votes on the class it predicts, and the class receiving the
largest number of votes is the ensemble decision., in which the final decision is made in favor of
the class [
        <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
        ]:
      </p>
      <p>
        T
l *  argmax  (l * (t)  l)
l  1, L t  1
(2)
2. Arithmetical mean of posterior probability estimates (1), i.e. the sum rule [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
      </p>
      <p>
        1 T
l *  argmax  Pl X (t)
l  1, L T t  1
(3)
3. If we follow the "naive" assumption about the independence of all frames [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], then the decision
should be taken according to the geometric mean of posterior probabilities, or the product rule
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
      </p>
      <p>T T
l *  argmax  Pl X (t)  argmax  log Pl X (t)
l  1, L t1 l  1, L t  1
(4)</p>
      <p>In addition, we recall that the age prediction task is an essential regression problem. Hence, in this
case it is possible to compute an expected value (mathematical expectation):</p>
      <p>L
l*   Pl X ( t ) l
l 1
(5)</p>
      <p>The general data flow in the proposed video-based age and gender recognition system is presented
in Fig. 1.</p>
      <p>
        The first step implies supplying images from video camera to the input of the system. Isolated
frames are selected from the video stream with a fixed frequency (about 10-20 times per second) in the
frame selection block. Then it is important to fix and leave only the face area that is performed in th e
corresponding block. Face detection is conducted using the cascade method of Viola-Jones and the
Haar features from the OpenCV library [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. To speed up the work, known procedures for tracking a
person identified in previous frames can be used [
        <xref ref-type="bibr" rid="ref30">30,35</xref>
        ].
      </p>
      <p>
        At the next stage all the received images of persons on one frame are reduced to a single scale. In
addition, often the subtraction of the mean image (Image mean subtraction) from each face image is
applied [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The next step is supposed to provide the recognition of each frame, where the CNN is
used. As the result, the estimates of posterior probabilities are obtained from the softmax layer (1).
Recognition is performed using the model the Tensorflow library functionality. Based on the data that
is the output of the classifier fusion block (2)-(4), a final recognition solution is implemented in favor
of the corresponding class.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental results</title>
      <p>The experimental study of the proposed age/gender recognition algorithm scheme (Fig. 1)
implementation is carried out in IDE Pycharm using “Python 3.6”. The characteristics of the machine:
Intel Core i5-2400 CPU, 64-bit operating system Windows 7, with video card NVIDIA GeForce GT
440. The described approach taking into account the experimental results is also implemented on
Android platform. The user interface of this application is presented in (Fig. 2).</p>
      <p>According to the provided GUI the main part of the screen is designed to display video. There also
is a text field to demonstrate the recognition and aggregation solution results in real time mode. By
default it is defined as "No face detected", this indicates that the face has not been detected. Otherwise,
the gender and age of the person on the frame are displayed.</p>
      <p>
        The choice of datasets for recognition accuracy and performance testing is an inherently
challenging problem. The reason for this is that just the limited number of databases provides such
personal information as age, gender or both about a person on an image. Furthermore the video-based
approach is considered in this work. In this case the databases that are used to train and test described
CNNs architectures could not be applied. Hence, in this research testing the accuracy of recognition is
conducted using the facial datasets IARPA Janus Benchmark A (IJB-A) [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], Indian Movie [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ],
EURECOM Kinect [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and EmotiW2018[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], which gender and age information is available for, and
the video frames of a single track are stored. The first dataset consists of 2,043 videos, where only the
gender information is available. The data distribution for this dataset is presented in Fig 3a.
      </p>
      <p>The next database is a collection of video frames assembled from Indian films. In total there are
about 332 different videos and 34,512 frames of one hundred Indian actors, whose age is divided into
four categories:”Child”, ”Young”, ”Middle” and ”Old”. In this example, the verbal description of age
is replaced by provided specific age intervals: 1-12, 13-30, 31-50, 50+ respectively. In the following
experiments the intersections of the recognition results at the given intervals will be estimated. The
data distribution for the dataset can be found in Fig. 3b. The Kinect dataset contains 104 videos with
52 people (14 women and 38 men). The database provides information about the gender and the year
of birth that simplifies the estimation of age (Fig. 3c). The EmotiW2018 database is a collection of
videos taken from various films and serials. When implementing the algorithm, age is considered in
the range with addition and subtraction of 5 years, since it is necessary to identify the accuracy of the
intersection with the recognized age interval. The gender of the actor on the video is provided, as well
as his age. In total, the database consists of 1,165 videos. Since this dataset does not contain the final
video images but video files, it became necessary to split the video into frames and subsequently
detect the face area. Information about data in EmotiW2018 is presented in Fig. 3d.
(a) IJB-A</p>
      <p>(b) Indian Movie
(c) Kinect (d) Emotiw2018</p>
      <p>Figure 3. Data distribution in experimental datasets.</p>
      <p>
        We compare two publicly available CNNs architectures: Age net and Gender net models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and
deep VGG-16 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] neural network trained for age/gender prediction [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Moreover, we implement
image normalization techniques, namely, mean image subtraction to cope with illumination effects,
low camera resolution, etc.
      </p>
      <p>First of all, average inference time of individual CNNs for testing machine on Android platform is
presented in Table 1. The best results are shown in bold.</p>
      <sec id="sec-3-1">
        <title>Gender/Age net VGG-16</title>
        <p>As expected, the most resource-intensive model is deep vgg16, which occupies almost 10 times
more memory space than the Gender/Age nets. This fact imposes significant restrictions on the storage
of such architecture, as well as on the recognition speed. It took about four minutes to make a decision
for one frame using the vgg16 model.</p>
        <p>Evaluation of the CNN quality and aggregation algorithms is carried out with the help of
accuracy metric, since the problem of gender determination is a classic example of binary
classification, and age recognition is considered to the investigation of several intervals-the
multiclass classification. Accuracy is the proportion of correct algorithm responses.</p>
        <p>In this paper, all proposed classifier solution methods are compared with the traditional
recognition solution for each frame (2). Next, to visualize the results, we introduce the following
abbreviations for discussed aggregation techniques:</p>
        <p>FBF – frame by frame (2).</p>
        <p>SV – simple voting (3).</p>
        <p>SR – sum rule (4).</p>
        <p>PR – product rule (5).</p>
        <p>ME – mathematical expectation (6).</p>
        <p>The comparison of CNN architectures and classifier fusion algorithms for gender task is presented
in Table 2. The best results are in bold.</p>
        <p>Thus, based on the conducted experiments, it can be concluded that the classifier fusion
implementation increases the accuracy of the gender recognition in comparison with the traditional
approach. The difference is 3-10%. The product rule shows its efficiency among the aggregation
algorithms almost in all cases. The deep Vgg16 is more accurate than Gender_net. For instance, the
difference between models is considered to be about 9% for Kinect dataset.</p>
        <p>Age recognition results are provided in Table 3.
Gender_net
VGG-16</p>
      </sec>
      <sec id="sec-3-2">
        <title>Gender_net VGG-16</title>
      </sec>
      <sec id="sec-3-3">
        <title>Gender_net VGG-16</title>
      </sec>
      <sec id="sec-3-4">
        <title>Gender_net</title>
        <p>VGG-16
51
72
55
69
61
75
72
71</p>
        <p>SR
59
81
60
81</p>
        <p>IJB-a</p>
        <p>Kinect
73 75
84 84
Indian Movie
71 72
81 87
EmotiW2018
75 75
78 80</p>
        <p>PR
59
82
77
84
75
88
75
81</p>
        <p>Age_net
VGG-16
Age_net
VGG-16
Age_net
VGG-16</p>
        <p>SR</p>
        <p>PR</p>
        <p>ME
52
58
56
38
26
47</p>
        <p>Kinect
41
60
68
29
27
47</p>
        <p>The estimation of the mathematical expectation (5) has shown the effectiveness in determining the
age in most cases. Thus, it could be noticed that the VGG-16 architecture is ahead of Gender net and
Age net models for the age accuracy. Here we have a general trade-off between performance and
accuracy. The low accuracy of age recognition can be due to the complexity of the problem as a
whole, since this biometric characteristic depends on many factors and cannot always be uniquely
determined.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>
        The video-based age and gender recognition algorithm with the implementation of the classifier
committees is proposed in this work. The experimental results have demonstrated the increase of
recognition accuracy of the proposed algorithm when compared to traditional simple voting decision.
The method of finding the geometric mean (product rule) with normalization of the input video images
is the most accurate in gender classification task. At the same time, the most accurate age prediction is
achieved with the computation of the expected value. We have presented the results of comparing the
following CNN architectures: Age net and Gender net [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and VGG-16 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] trained for age and
gender prediction. Eventually, the accuracy of the VGG-16 architecture is about 10-20% higher for the
gender recognition and age prediction than Age and Gender net models. However, the inference time
of the VGG-16 is 4-9 times lower. A limiting factor of VGG-16 practical usage has been overcome
with optimization techniques [
        <xref ref-type="bibr" rid="ref1 ref26 ref33 ref34">1,26,33,34</xref>
        ]. As a result, a prototype of the age and gender recognition
system (Fig. 1) for retail needs has been implemented in the Android application (Fig. 2). The intuition
is that this application can improve the efficiency of contextual advertising.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The work was conducted at Laboratory of Algorithms and Technologies for Network Analysis,
National Research University Higher School of Economics and supported by RSF (Russian Science
Foundation) grant 14-41- 00039.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2016</given-names>
          </string-name>
          <article-title>Search techniques in intelligent classification systems</article-title>
          (Springer International Publishing)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chao</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ding</surname>
            <given-names>J 2013</given-names>
          </string-name>
          <article-title>Facial age estimation based on label-sensitive learning</article-title>
          and
          <source>ageoriented regression Pattern Recognition</source>
          <volume>46</volume>
          <fpage>628</fpage>
          -
          <lpage>641</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rybintsev</surname>
            <given-names>A V</given-names>
          </string-name>
          ,
          <article-title>Konushin V S and Konushin A S 2015 Consecutive gender and age classification from facial images based on ranked local binary patterns</article-title>
          <source>Computer Optics</source>
          <volume>39</volume>
          (
          <issue>5</issue>
          )
          <fpage>762</fpage>
          -
          <lpage>769</lpage>
          DOI: 10.18287/
          <fpage>0134</fpage>
          -2452-2015-39-5-
          <fpage>762</fpage>
          -769
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Wang</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>Y</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cao</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <year>2009</year>
          <article-title>Video-based face recognition: A survey World Academy of Science, Engineering</article-title>
          and Technology,
          <source>International Journal of Computer</source>
          , Electrical, Automation,
          <source>Control and Information Engineering</source>
          <volume>3</volume>
          (
          <issue>12</issue>
          )
          <fpage>2809</fpage>
          -
          <lpage>2818</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kalinovskii</surname>
            <given-names>I A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Spitsyn</surname>
            <given-names>V G</given-names>
          </string-name>
          <year>2016</year>
          <article-title>Review and testing of frontal face detectors</article-title>
          <source>Computer Optics</source>
          <volume>40</volume>
          (
          <issue>1</issue>
          )
          <fpage>99</fpage>
          -
          <lpage>111</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2016-40-1-
          <fpage>99</fpage>
          -111
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kittler</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Alkoot</surname>
            <given-names>F 2003</given-names>
          </string-name>
          <article-title>Sum versus vote fusion in multiple classifier systems</article-title>
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>25</volume>
          <fpage>110</fpage>
          -
          <lpage>115</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tresp</surname>
            <given-names>V 2001</given-names>
          </string-name>
          <article-title>Committee machines</article-title>
          .
          <source>Handbook for Neural Network Signal Processing 135-151</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Rudakov</surname>
            <given-names>K 1999</given-names>
          </string-name>
          <article-title>On methods of optimization and monotonic correction in the algebraic approach to the problem of recognition RAS Papers 314-317</article-title>
          (in Russia)
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mazurov</surname>
            <given-names>V 1990</given-names>
          </string-name>
          <article-title>Method of committees in problems of optimization and classification</article-title>
          (Moscow: Science) p
          <fpage>248</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Theodoridis</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutroumbas</surname>
            <given-names>C 2009</given-names>
          </string-name>
          <string-name>
            <surname>Pattern</surname>
          </string-name>
          <article-title>Recognition (Elsevier Inc</article-title>
          .) p
          <fpage>840</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2012</given-names>
          </string-name>
          <article-title>The choice of the parameters of the image recognition algorithm on the basis of the collective of decision rules and the principle of maximum a posteriori probability</article-title>
          <source>Computer Optics</source>
          <volume>36</volume>
          (
          <issue>1</issue>
          )
          <fpage>117</fpage>
          -
          <lpage>124</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2012</given-names>
          </string-name>
          <string-name>
            <surname>Adaptive Video</surname>
          </string-name>
          <article-title>Image Recognition System Using a Committee Machine</article-title>
          ,
          <source>Optical Memory and Neural Networks (Information Optics)</source>
          21
          <fpage>219</fpage>
          -
          <lpage>226</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A V</given-names>
          </string-name>
          <year>2018</year>
          <article-title>Trigonometric series in orthogonal expansions for density estimates of deep image features</article-title>
          <source>Computer Optics</source>
          <volume>42</volume>
          (
          <issue>1</issue>
          )
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Levi</surname>
            <given-names>G</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hassner</surname>
            <given-names>T 2015</given-names>
          </string-name>
          <article-title>Age and gender classification using convolutional neural networks</article-title>
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 34-42</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kwon</surname>
            <given-names>Y</given-names>
          </string-name>
          <source>and da Vitoria Lobo N 1994 Age classification from facial images Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 762-767</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Geng</surname>
            <given-names>X 2006</given-names>
          </string-name>
          <article-title>Learning from facial aging patterns for automatic age estimation</article-title>
          <source>Proceedings of the 14th ACM International Conference on Multimedia 307-316</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Guo</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mu</surname>
            <given-names>G</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fu</surname>
            <given-names>Y 2009</given-names>
          </string-name>
          <article-title>Human age estimation using bio-inspired features</article-title>
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 112-119</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Choi</surname>
            <given-names>S 2011</given-names>
          </string-name>
          <article-title>Age estimation using a hierarchical classifier based on global and local facial features</article-title>
          <source>Pattern Recognition</source>
          <volume>44</volume>
          (
          <issue>6</issue>
          )
          <fpage>1262</fpage>
          -
          <lpage>1281</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Makinen</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raisamo</surname>
            <given-names>R</given-names>
          </string-name>
          2008
          <article-title>Evaluation of gender classification methods with automatically detected and aligned faces</article-title>
          <source>IEEE Transactions on Pattern Analysis and Machines Intelligence</source>
          <volume>30</volume>
          (
          <issue>3</issue>
          )
          <fpage>541</fpage>
          -
          <lpage>547</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Shan</surname>
            <given-names>C 2012</given-names>
          </string-name>
          <article-title>Learning local binary patterns for gender classification on real-world face images</article-title>
          <source>Pattern Recognition Letters</source>
          <volume>33</volume>
          (
          <issue>4</issue>
          )
          <fpage>431</fpage>
          -
          <lpage>437</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rothe</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timofte R and Van L 2015</surname>
          </string-name>
          <article-title>Deep expectation of apparent age from a single image</article-title>
          <source>Proceedings of the IEEE International Conference on Computer Vision Workshops 10-15</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Simonyan</surname>
            <given-names>K</given-names>
          </string-name>
          and
          <article-title>Zisserman A 2014 Very deep convolutional networks for large-scale image recognition Preprint arXiv</article-title>
          :
          <volume>1409</volume>
          .
          <fpage>1556</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Szegedy</surname>
            <given-names>C 2015</given-names>
          </string-name>
          <article-title>Going deeper with convolutions</article-title>
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1-9</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Krizhevsky</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hinton</surname>
            <given-names>G 2012</given-names>
          </string-name>
          <article-title>ImageNet classification with deep convolutional neural networks</article-title>
          <source>Advances in neural information processing systems 1097-1105</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Esmaeili</surname>
            <given-names>M 2007</given-names>
          </string-name>
          <article-title>Creating of Multiple Classifier Systems by Fuzzy Decision Making in Human-</article-title>
          <source>Computer Interface Systems Proceedings of the IEEE Conference on Fuzzy Systems 1-7</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2002</given-names>
          </string-name>
          <string-name>
            <surname>Adaptive Video</surname>
          </string-name>
          <article-title>Image Recognition System Using a Committee Machine Optical Memory</article-title>
          and
          <string-name>
            <given-names>Neural</given-names>
            <surname>Networks</surname>
          </string-name>
          (Information Optics)
          <volume>21</volume>
          (
          <issue>4</issue>
          )
          <fpage>219</fpage>
          -
          <lpage>226</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Shan C 2010</surname>
          </string-name>
          <article-title>Face recognition and retrieval in video Video Search</article-title>
          and Mining 235-260
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Lienhart</surname>
            <given-names>R</given-names>
          </string-name>
          and
          <string-name>
            <surname>Maydt</surname>
            <given-names>J 2002</given-names>
          </string-name>
          <article-title>An extended set of Haar-like features for rapid object detection</article-title>
          <source>Proceedings of the IEEE Conference on Image Processing 1</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Dhall</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sikka</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goecke</surname>
            <given-names>R</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sebe</surname>
            <given-names>N 2015</given-names>
          </string-name>
          <article-title>The more the merrier: Analysing the affect of a group of people in images</article-title>
          <source>Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Klare</surname>
            <given-names>B 2015</given-names>
          </string-name>
          <article-title>Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A Proceedings of the IEEE Conference on Computer Vision</article-title>
          and Pattern Recognition
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <article-title>IMFDB dataset (Access mode: http://cvit</article-title>
          .iiit.ac.in/projects/IMFDB/)
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Eurecom</surname>
            <given-names>Kinect dataset</given-names>
          </string-name>
          (Access mode: http://rgb-d.eurecom.fr/)
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2017</given-names>
          </string-name>
          <article-title>Maximum-likelihood dissimilarities in image recognition with deep neural networks</article-title>
          <source>Computer Optics</source>
          <volume>41</volume>
          (
          <issue>3</issue>
          )
          <fpage>422</fpage>
          -
          <lpage>430</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-3-
          <fpage>422</fpage>
          -430
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Rassadin</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Savchenko</surname>
            <given-names>A 2017</given-names>
          </string-name>
          <article-title>Compressing deep convolutional neural networks in visual emotion recognition</article-title>
          <source>CEUR Workshop Proceedings</source>
          <volume>1901</volume>
          <fpage>207</fpage>
          -
          <lpage>213</lpage>
          [35]
          <string-name>
            <surname>Nikitin</surname>
            <given-names>M Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konushin V S and Konushin</surname>
            <given-names>A S</given-names>
          </string-name>
          <year>2017</year>
          <article-title>Neural network model for video-based face recognition with frames quality assessment</article-title>
          <source>Computer Optics</source>
          <volume>41</volume>
          (
          <issue>5</issue>
          )
          <fpage>732</fpage>
          -
          <lpage>742</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-5-
          <fpage>732</fpage>
          -742
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>