Labeling images by interpretation from Natural Viewing
                Karen Guo                                          Danielle N Pratt                               Angus MacDonald III
          Department of Computer                               Department of Psychology                          Department of Psychology
                  Science                                       University of Minnesota                           University of Minnesota
          University of Minnesota                                 pratt308@umn.edu                                   angus@umn.edu
           guoxx431@umn.edu
                                                                   Paul R Schrater
                                                                Department of Computer
                                                                        Science
                                                                University of Minnesota
                                                                  schrater@umn.edu
ABSTRACT
In this paper, we would like to discuss the connection between
visual processing and the understanding of an image. While
the information of image viewing can be obtained from sub-
jects’ eye fixation, the understanding of an image can be ob-
tained from the subjects’ description of the given image. Fur-
thermore, we proposed a new image labeling method based
on the connection between eye fixation and image description
by humans. By generating this new kind of labeling method,
we can construct an image dataset with labels that are closer
to how humans understand the incoming image. In addition,
we would like to discuss the proof that the proposed labels
better describe the image compared to other types of labeling
systems.
                                                                                       Figure 1. Overview of the data collection concept. Comparing to the pre-
Research about the relationship between images and human                               vious image labeling methods, our method considers both objects and
                                                                                       the interaction or relation between objects. Moreover, our annotation
descriptions can be applied to several different applications.                         results are more helpful in understanding the whole image since we con-
For instance, by analyzing the pairwise similarity of user de-                         sider human eye movement while located these regions.
scriptions, we could have a measurement of the complexity
of image content. Another possible application is to use this
dataset as a criterion to find the difference in visual processing
                                                                                       computer to detect and recognize these objects. And Alexnet
of individuals with or without certain psychological charac-
                                                                                       [6] is one of the famous deep neural networks for retrieving
teristic.
                                                                                       the information from the dataset and performing well on ob-
                                                                                       ject detection and recognition tasks. This method uses one
Author Keywords                                                                        aspect of human visual processing: object recognition.
Image Representation; Scene Analysis; Computer Vision;
Vision and Scene Understanding; Visual Attention; Eye                                  However, human understanding of an image is not limited to
Fixation; Image Annotation                                                             the object recognition in the image. Humans consider not
                                                                                       only the objects in the image but also the details or distor-
INTRODUCTION                                                                           tion of them. Moreover, humans may also focus on the in-
Understanding an image is straightforward for a human: hu-                             teractions or the relations between objects or smaller entities.
mans view an image and describe both the content and what                              Take figure 1 as an example, if we apply an object recognition
is happening in the image. Teaching the computer to learn to                           method such as [10, 9] to this image, these methods result in
understand an image the way that human does is an interest-                            several person objects and their positions in the image. But
ing question since there are lots of potential applications in                         ideally, we would want to focus on descriptions beyond the
artificial intelligence fields. One of the most well-known im-                         objects, such as ”the crowd in the convention room” or ”the
age understanding methods is to recognize objects existing in                          shaking hands of two people at the front.”
the image. For example, ImageNet [1] is an image dataset that                          In this paper, we introduce a procedure of collecting image
contains thousands of object classes and is used to train the                          information from a more natural perspective of human visual
                                                                                       processing. In comparison with an object-oriented dataset,
 c 2018. Copyright for the individual papers remains with the authors Karen Guo,       we asked subjects to describe the whole image before label-
Danielle N Pratt, Angus MaDonald III, and Paul R Schrater. Copying permitted for       ing partial regions in the image. This way, we could simu-
private and academic purposes.
ExSS ’18, March 11, Tokyo, Japan.                                                      late the order of human visual processing while a new scene

                                                                                   1
                  Figure 2. The procedure of generating descriptive regions from human visual attention for image annotation.


incoming. Moreover, we involve eye fixation data as visual                mouse stays at the certain position. From their analysis in [4],
attention prior to labeling process. Both the information from            these maps are closer to the real visual attention of human
feature extraction and the eye fixation traces of the image               than the attention maps generated by image-oriented saliency
are included to calculate the important regions of the image.             detection methods such as [3] and [11].
Therefore, these regions are more critical for understanding
the image. After regenerating the region-need-to-annotate,                After obtaining the fixation points in an image, we assume
we construct an annotation interface for crowdworkers con-                that these points belong to some regions where humans would
sidering the balance between efficiency and fatigue, and sim-             focus to within their viewing processes. We approximated
ulation of human vision. Our annotation result is closer to               these regions with a mixture Gaussian model and clustered
how we view an image than previous datasets. These descrip-               these fixation points into regions. This way, we generated a
tions we collected provide not only the name of entities but              set of regions, or descriptive regions, that include information
                                                                          related to human visual attention while viewing and under-
also the relations between entities with an overall and natural
                                                                          standing an image.
understanding of the image.
In the following sections, we will discuss the details of our
annotation methods and the potential applications based on                Crowdworkers’ Annotation and Postprocessing
our dataset.                                                              After discovering the descriptive regions from aggregating
                                                                          fixation points, we next designed an annotation interface for
DATA ANNOTATION                                                           these regions. Our instructions in the annotation interface
In this section, we introduce how we combine subjects’ eye                lead users to describe the whole image first. Then the in-
fixation and their descriptions upon an image to generate our             terface provides users these descriptive regions in the given
new labeling on the image. Our stimuli images for annotation              image for labeling. With this questions order, we ensure that
are originated from MS-COCO dataset [7] as a reasonable                   the descriptions are similar to human natural viewing process.
subset containing different scenes and situations.                        Here we uploaded our annotation interface to Amazon Me-
                                                                          chanical Turk (Mturk), which is a crowdworker platform, and
Visual Attention Clusters
                                                                          collect users’ descriptions on it.
In order to involve human eye movement to our annotation
method, we first recorded 100 subjects’ mouse traces on given             In order to refine the descriptions collected from Mturk, we
images with SALICON [4] to simulate their eye movements.                  use natural language processing (NLP) tools to postprocess
SALICON is a tool to approximate visual attention via mouse               descriptions of each image. Currently one of the NLP tools
traces. They first applied Gaussian blur filter on every image            we use is Wordnet [2]. Wordnet is an English dictionary
and uploaded them to Amazon Mechanical Turk to collect                    dataset that has a tree-like structure for every word. By apply-
large-scale mouse-tracking data. The collected mouse traces               ing a dictionary to the collected descriptions, we clear the in-
on the blurred images can be transformed into simulated eye               comprehensible descriptions and merge nouns that have sim-
movement maps on the images (Figure 2. (b)). In this way,                 ilar meaning, which is defined by both the nouns and their
the visual attention map of an image can be approximated                  hypernyms that are related by wordnet. Figure 3. shows an
from a large amount of subjects’ mouse traces instead of us-              example of 10 subjects’ descriptions from Mturk and one re-
ing an eye tracking machine (Figure 2.(c)). Furthermore, in               fined description of the red box in the image. With more de-
order to emphasize on the eye fixation on an image, we col-               scriptions collected and more NLP tools involved in the fu-
lected ”fixation points” from the mouse traces. These fixation            ture, we could generate more detailed and informative anno-
points are defined and filtered by the length of time that the            tations for these descriptive regions.


                                                                      2
                  Figure 3. The procedure of generating descriptive regions from human visual attention for image annotation.


DISCUSSION AND NUMERICAL ANALYSIS                                         this criterion, we generate Ω for each image based on the col-
Our annotation method represents a different way for com-                 lected descriptions from the crowdworkers. Figure 4. shows
puters to learn and understand an image. Currently, we have               Ω generated from our current pilot dataset with 10 subjects
collected 10 subjects’ annotation results on 113 images as a              for each image. In Figure 4. (b), the image contains only one
pilot dataset. As more images are annotated using this novel              woman with a golf club and the background is clean, which
method, computers can learn to generate a more human-                     results in a higher magnitude of the simplicity weight. On
like description for a new image by applying neural network               the other hand, Figure 4. (c) contains an enormous amount
methods with structures such as [8] and [5]. Furthermore, this            of information as it has a cluttered background. The sample
dataset can be used for multiple applications such as object              description of one descriptive region can be found in Figure
detection, foreground-background separating, scene recogni-               3. Since there are many different descriptions among subjects
tion and image caption generation. Moreover, deep learning                for one region, the resulting Ω has a lower magnitude. Figure
methods can exploit these new annotations with both directed              4 (a) contains an overall example Ω of 20 images. Gener-
and generative models. Given the interest-defined regions R,              ally speaking, there maybe two set of ambiguous descriptions
labels L and a set of images I, a computer can learn the map-             of an image, but most user descriptions fall into these two
ping between them and make complex predictions between                    groups, which suggests an objectively complex image and re-
R, L and I. For example, we can learn to predict what is                  sults in a concentrating interest (higher Ω). There may also
interesting in an image, or generate novel image from labels              be 10 sets of different descriptions with each set containing
and/or regions. We could also find related images using a sin-            only one user, which would be a case of lower Ω. With more
gle input region after we learn the relation between R and I.             and more user descriptions collected, we could stably gener-
This connection can also aid in recognizing different objects             ate this interest complexity measurement Ω.
from a more contextual view.
                                                                          After measuring the interest complexity Ω of each image, a
In addition to direct applications in computer vision fields,             variety of applications in different fields can be developed or
we also want to identify human interest in an image. People               improved. For example, we could explore choices for the de-
may naturally gravitate toward only a few regions of inter-               sign of a new user interface based on the testing groups, such
est in some objectively complex images. We want to quan-                  as using a simpler image set when the test has a time limita-
tify this Interest Complexity, as we believe it is a better mea-          tion for users to understand the image. This can also be used
sure of how much information people actually extract from                 as an objective measurement of detecting a certain psycholog-
an image, independent of the image’s objective complexity.                ical characteristic. For instance, we could collect eye move-
In order to generate this Interest Complexity (Ω), we consider            ment data from individuals with and without a psychological
the users’ descriptions (or labels L) collected for regions R             characteristic. By comparing and analyzing these data with
from the Mturk platform. Different subjects’ descriptions of              our annotated dataset, we could find a more numerical way to
each image can vary widely, providing considerable informa-               distinguish whether a new individual has this characteristic.
tion for the annotation procedure. Comparing the content of               It could also aid in some clinical tests that currently required
subjects’ descriptions of the same image pairwise is a way                clients to take a subjective test, the results of which need to
to retrieve information. We could not only know whether a                 be scored by experienced professionals. Through the above-
subject is answering correctly, but also know how many sub-               mentioned process, we could run an objective measurement
jects give simultaneous descriptions. In our analysis, We use             and facilitate professionals testing. This could also possibly
spectral clustering to find the description groups that the be-           simplify the testing pipeline if image viewing is accessible
longing subjects have similar descriptions. If the image has              and more comfortable for clients.
more sparse groups, the image has descriptive regions or con-
tent that is hard to describe with the same words. Following


                                                                      3
       (a) Interest Complexity Ω generated from subjects’ descriptions.             (b) Ω = −0.46 (Simple)         (c) Ω = −0.12 (Complex)
Figure 4. simplicity weight for an image: the image with a lower weight magnitude means that the image contains more complicated content and has
less similarity among the subjects’ descriptions.


CONCLUSION                                                                  3. Goferman, S., Zelnik-Manor, L., and Tal, A.
In this paper, we present an image annotation method that                      Context-aware saliency detection. IEEE Transactions on
includes information from human visual processing. This an-                    Pattern Analysis and Machine Intelligence 34, 10
notation method shows a novel concept for a computer to un-                    (2012), 1915–1926.
derstand an image using different aspect from current image
                                                                            4. Jiang, M., Huang, S., Duan, J., and Zhao, Q. SALICON:
processing methods. By using human interest to guide anno-
                                                                               Saliency in Context. In Proceedings of the IEEE
tations, we get annotations that focus on the naturally inter-
                                                                               Computer Society Conference on Computer Vision and
esting aspects of an image and the relations between them.
                                                                               Pattern Recognition, vol. 07-12-June (2015),
Our annotations are designed to be closer to the way people
                                                                               1072–1080.
generate explanations for the content of images. We elicit fix-
ations and annotations based on the user explicitly looking at              5. Kendall, A., Badrinarayanan, V., and Cipolla, R.
an image to explain what is happening. We also discuss dif-                    Bayesian SegNet: Model Uncertainty in Deep
ferent applications and effects in a variety of fields to show                 Convolutional Encoder-Decoder Architectures for Scene
that this distinctive annotation concept is useful and required                Understanding.
for future development of fields such as artificial intelligent,
                                                                            6. Krizhevsky, A., Sutskever, I., and Geoffrey E., H.
user interface, and psychology. With more and more data be-
                                                                               ImageNet Classification with Deep Convolutional
ing collected within our method and analyzed, computers can
                                                                               Neural Networks. Advances in Neural Information
learn and achieve a more human-like perspective of the sur-
                                                                               Processing Systems 25 (NIPS2012) (2012), 1–9.
rounding entities and how they interact or relate to each other.
                                                                            7. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
ACKNOWLEDGMENTS                                                                Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
First of all, we would like to give our appreciation to Profes-                COCO: Common objects in context. In Lecture Notes in
sor Angus MacDonald III and his student Danielle Pratt of                      Computer Science (including subseries Lecture Notes in
Clinical Psychology Department for the discussion and eye                      Artificial Intelligence and Lecture Notes in
tracking data collection. We would also make our big thanks                    Bioinformatics), vol. 8693 LNCS (2014), 740–755.
to Sewon Oh and all the colleague in CoMoCo Lab of Univer-
sity of Minnesota for the contribution on pre-testing the an-               8. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
notation interface. The last but not the least, we would like to               You only look once: Unified, real-time object detection.
thank to Minnesota Supercomputing Institute (MSI) and Col-                     arXiv preprint arXiv: (2015).
lege of Liberal Art (CLA) server in University of Minnesota                 9. Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN:
for providing server computation and storage.                                  Towards real-time object detection with region proposal
                                                                               networks. Advances in neural information (2015).
REFERENCES
 1. Deng, J. D. J., Dong, W. D. W., Socher, R., Li, L.-J. L.               10. Simonyan, K., and Zisserman, A. Very Deep
    L.-J., Li, K. L. K., and Fei-Fei, L. F.-F. L. ImageNet: A                  Convolutional Networks for Large-Scale Image
    large-scale hierarchical image database. 2009 IEEE                         Recognition. ImageNet Challenge (2014), 1–10.
    Conference on Computer Vision and Pattern                              11. Yang, J., and Yang, M.-H. Top-Down Visual Saliency
    Recognition (2009), 2–9.                                                   via Joint CRF and Dictionary Learning. In CVPR
 2. Fellbaum, C. WordNet : an electronic lexical database.                     (2012).
    MIT Press, 1998.

                                                                       4