=Paper= {{Paper |id=Vol-2403/paper3 |storemode=property |title=Automatic Image Annotation with Ensemble of Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2403/paper3.pdf |volume=Vol-2403 |authors=Anastasia Timofeeva,Oleksii Kudin |dblpUrl=https://dblp.org/rec/conf/icteri/TimofeevaK19 }} ==Automatic Image Annotation with Ensemble of Convolutional Neural Networks== https://ceur-ws.org/Vol-2403/paper3.pdf
        Automatic Image Annotation with Ensemble of
              Convolutional Neural Networks

       Anastasia Timofeeva1 [0000-0001-5813-582X], Oleksii Kudin1 [0000-0002-5917-9127]
                  Zaporizhzhya National University, Zaporizhzhya, Ukraine
                 nastyatima1@gmail.com, avk256@gmail.com

       Abstract. This paper discusses the models and methods of machine learning
       that are employed to solve the problem of automatic image annotation. Today,
       the systems which have the ability to extract meaning from visual data are in-
       creasingly developed and used both in academia and industry. One of the prac-
       tically important directions within the scope of this problems is the develop-
       ment of automatic systems for understanding of visual scenes. In this paper, we
       propose a brief survey of the state-of-the-art machine learning approaches and
       methods that have been suggested for automatic image annotation. We study
       the mathematical foundations of the overviewed methods and analyze their
       strengths and limitations. Further, we develop a proof-of-concept system for the
       image annotation using convolutional neural networks and construct a neural
       network ensemble using the snapshot approach. In the image processing stage,
       we resize image for computation acceleration and use image augmentation
       method. In addition, we outline a direction for further development of image
       annotating systems based on both theoretical and experimental models.

       Keywords: Automatic Image Annotation, Convolutional Neural Networks, En-
       semble Methods, Resizing

1      Introduction

Automatic image annotation (AIA) is the process in which a computer automatically
specifies metadata to a digital image. Typically, in this process, metadata is assigned
to an image in the form of titles or keywords. Automatic image annotation has a
growing area of applications in computer vision, data meaning and web search.
   Image processing differs from text data processing. Therefore, the development of
effective methods for navigating and searching in large image databases is a complex
   The objective of the paper is to develop a concept of an automatic annotation sys-
tem that includes transfer learning and snapshot ensemble as approach of combining
neural networks.
   The novelty of the article is due to adapting pre-trained neural net VGG-16 [1] for
image annotation using transfer learning and using snapshot ensemble method to
improve generalization ability of the system.
2      Related Works

Text-based image annotation continues to be an important practical as well as funda-
mental problem in the computer vision and information retrieval. A number of ap-
proaches and surveys have been proposed in the past to address the annotation task.
    According to the research conducted in [2] AIA methods were classified into five
categories: generative model-based image annotation, nearest neighbor-based image
annotation, discriminative model-based image annotation, tag completion-based im-
age annotation, deep learning-based image annotation.
    In order to improve the annotation performance of existing AIA approaches, a hy-
brid AIA approach based on visual attention mechanism (VAM) and the conditional
random field (CRF) is developed in [3]. VAM is employed to determine salient re-
gions of the image, and CRF is applied to optimize the initial label set. The experi-
mental results confirm that the suggested hybrid AIA approach has the highest anno-
tation performance.
    The general approach uses SVM combined with KNN, as in works [3, 4]. The ob-
jective of SVM is to create a model that only targets the value of data instances with
attributes in a test set.
    Ameesh Makadia et al. in [5, 6] described Multiple Bernoulli Relevance Model,
which based on the CRM (Continuous-Space Relevance Model). The key idea is to
transform the image into a set of attribute vectors. Further, these vectors are employed
simulation of possible words for annotation. Venkatesh N. et al. [7] is created the
hybrid model SVM-DMBRM, which combines the method of reference vectors acting
as a discriminative model and the method of discrete multiple Bernoulli matching of
the generating model. In [8] authors used Latent Semantic Analysis (PLSA) and La-
tent Dirichlet Allocation (LDA) methods, where images are segmented into superpix-
els, and visual features are extracted from each superpixel region. Boosted classifiers
are then trained for each class, and the output of boosted classifiers are quantized as
boosted visual words.
    In works [9, 10, 11], convolutional neural networks (also the model AlexNet) are
employed for image annotation. A very deep convolutional neural network extracts
visual imaging features, ranging from very basic functions, such as edge detectors,
and then gradually creating more complex features, such as detecting a shape.
    Also Convolutional Neural Networks (CNNs) can be combined with Recurrent
Neural Networks (RNNs). So, in [12, 13] the LSTM model (Long short-term
memory) was used as RNN, which is responsible for creating an annotation to the
image. This model provides accurate annotations that cover the scene оf the image, as
well as information about all objects and things in this image. Here the image is de-
scribed not just by a set of keywords, but an associated sentence is generated.
    Jiwei Hu et al. in [14] designed a new hierarchical model for image annotation,
based on constructing a novel, hierarchical tree, which consists of exploring the rela-
tionships between the labels and the features dividing labels into several hierarchies
for efficient and accurate labeling.
3      Segmented and Annotated IAPR-TC12 Dataset

The IAPR-TC12 collection, is an established image retrieval benchmark composed of
about 20,000 images manually annotated with free-text descriptions with hierarchical
organization in three languages; 96,234 regions compose the segmented collection,
for which 256 labels have been used. The hierarchy was manually defined by the
authors after carefully analyzing the images, the annotation vocabulary and the vo-
cabulary of manual annotations. The vocabulary plays a key role in the annotation
process because it must cover the most of concepts that we can find in the collection
of images. At the same time, the vocabulary shouldn’t be too large because AIA per-
formance depends on the number of considered labels. The annotation vocabulary
was organized mostly using is-a relations between labels. However, relations like
part-of and kind-of were also included.
   According to the suggested hierarchy, an object can be in one of six main branch-
es: ‘animal’, ‘landscape’, ‘man-made’, ‘human’, ‘food’, or ‘other’. This is the high
level of the hierarchy. The ten more common labels are ‘sky-blue’ (5,176), ‘man’,
(3,634), ‘group-persons’ (3,548), ‘ground’ (3,284), ‘cloud’ (2,770), ‘rock’ (2,740),
‘grass’ (2,609), ‘vegetation’ (2,455), ‘woman’ (2,339), and ‘trees’ (2,291) [15].

4      System Concept

The automatic image annotation system is based on key ideas transfer learning and
snapshot ensemble as approach. Pre-trained convolutional neural net VGG-16 is used
for tagging raw image by six classes: ‘animal’, ‘landscape’, ‘man-made’, ‘human’,
‘food’, or ‘other’. The main purpose of this labels is to facilitate and improve the an-
notation process by reducing labels amount. Transfer learning is used for train VGG-
16 on the IAPR-TC12 dataset. Snapshot ensembles approach [16] is used to combine
multiple neural networks predictions, and, hence to improve the accuracy. The system
architecture that we develop is as follows: one layer of VGG-16 Keras model, then,
two full connected Dense layers. Feature vectors are evaluated by VGG-16 and feed-
ed to Dense layers for final classification.
   Snapshot Ensembling produces an ensemble of accurate and diverse models from a
single training process. At the heart of Snapshot Ensembling is an optimization pro-
cess which visits several local minima before converging to a final solution. We take
model snapshots at these various minima and average their predictions at test time.

5      Classification Experiments

The choice of correct metrics and loss functions is the significant stage in the multi-
label classification. In this stage we use binary crossentropy as loss function and bina-
ry accuracy as accuracy metric.
   Initial classification experiments (see Fig. 1) show adequate accuracy and loss rate.
                                 Fig. 1. Accuracy and loss rate

   In [15], authors present classification experiments and results using several classi-
fiers on the IAPR-TC12 dataset. The highest percentage of correct annotations ap-
proximately is 85% accuracy obtained with the random forest classifier on the dataset
with 5 classes. Therefore, we can say that the accuracy of our system is sufficient
compared to other publications.
   The multi-label confusion matrix allows to get the F1 score for each label (see
Table 1).

                 Table 1. F1 score (%) on the IAPR-TC12 dataset for each label

                        Method                                    IAPR-TC12

    Single neural net                                     [44%, 89%, 20%, 5%, 0%, 15%]
    Snapshot Ensemble                                    [52%, 89%, 23%, 17%, 0%, 20%]

  An unbalanced data set caused low scores at the Table 1. Thus further research
may be associated with building a balanced data set and developing a metric for the
multi-label classification problem.

6       Conclusion

In this paper has been suggested a concept for the development of automatic image
annotation system. Transfer learning has been used to train the VGG-16 model for the
image annotation problem. Snapshot ensemble has been employed as approach to
build the neural networks ensemble. Initial classification experiments show appropri-
ate accuracy and loss rate also multi-label confusion matrix had been evaluated.
   The prospect of future researches is in using the genetic algorithm for hyperparam-
eter optimization, е.g. number of layers, types activations, learning rate etc.

