Labeling images by interpretation from Natural Viewing Karen Guo Danielle N Pratt Angus MacDonald III Department of Computer Department of Psychology Department of Psychology Science University of Minnesota University of Minnesota University of Minnesota pratt308@umn.edu angus@umn.edu guoxx431@umn.edu Paul R Schrater Department of Computer Science University of Minnesota schrater@umn.edu ABSTRACT In this paper, we would like to discuss the connection between visual processing and the understanding of an image. While the information of image viewing can be obtained from sub- jects’ eye fixation, the understanding of an image can be ob- tained from the subjects’ description of the given image. Fur- thermore, we proposed a new image labeling method based on the connection between eye fixation and image description by humans. By generating this new kind of labeling method, we can construct an image dataset with labels that are closer to how humans understand the incoming image. In addition, we would like to discuss the proof that the proposed labels better describe the image compared to other types of labeling systems. Figure 1. Overview of the data collection concept. Comparing to the pre- Research about the relationship between images and human vious image labeling methods, our method considers both objects and the interaction or relation between objects. Moreover, our annotation descriptions can be applied to several different applications. results are more helpful in understanding the whole image since we con- For instance, by analyzing the pairwise similarity of user de- sider human eye movement while located these regions. scriptions, we could have a measurement of the complexity of image content. Another possible application is to use this dataset as a criterion to find the difference in visual processing computer to detect and recognize these objects. And Alexnet of individuals with or without certain psychological charac- [6] is one of the famous deep neural networks for retrieving teristic. the information from the dataset and performing well on ob- ject detection and recognition tasks. This method uses one Author Keywords aspect of human visual processing: object recognition. Image Representation; Scene Analysis; Computer Vision; Vision and Scene Understanding; Visual Attention; Eye However, human understanding of an image is not limited to Fixation; Image Annotation the object recognition in the image. Humans consider not only the objects in the image but also the details or distor- INTRODUCTION tion of them. Moreover, humans may also focus on the in- Understanding an image is straightforward for a human: hu- teractions or the relations between objects or smaller entities. mans view an image and describe both the content and what Take figure 1 as an example, if we apply an object recognition is happening in the image. Teaching the computer to learn to method such as [10, 9] to this image, these methods result in understand an image the way that human does is an interest- several person objects and their positions in the image. But ing question since there are lots of potential applications in ideally, we would want to focus on descriptions beyond the artificial intelligence fields. One of the most well-known im- objects, such as ”the crowd in the convention room” or ”the age understanding methods is to recognize objects existing in shaking hands of two people at the front.” the image. For example, ImageNet [1] is an image dataset that In this paper, we introduce a procedure of collecting image contains thousands of object classes and is used to train the information from a more natural perspective of human visual processing. In comparison with an object-oriented dataset, c 2018. Copyright for the individual papers remains with the authors Karen Guo, we asked subjects to describe the whole image before label- Danielle N Pratt, Angus MaDonald III, and Paul R Schrater. Copying permitted for ing partial regions in the image. This way, we could simu- private and academic purposes. ExSS ’18, March 11, Tokyo, Japan. late the order of human visual processing while a new scene 1 Figure 2. The procedure of generating descriptive regions from human visual attention for image annotation. incoming. Moreover, we involve eye fixation data as visual mouse stays at the certain position. From their analysis in [4], attention prior to labeling process. Both the information from these maps are closer to the real visual attention of human feature extraction and the eye fixation traces of the image than the attention maps generated by image-oriented saliency are included to calculate the important regions of the image. detection methods such as [3] and [11]. Therefore, these regions are more critical for understanding the image. After regenerating the region-need-to-annotate, After obtaining the fixation points in an image, we assume we construct an annotation interface for crowdworkers con- that these points belong to some regions where humans would sidering the balance between efficiency and fatigue, and sim- focus to within their viewing processes. We approximated ulation of human vision. Our annotation result is closer to these regions with a mixture Gaussian model and clustered how we view an image than previous datasets. These descrip- these fixation points into regions. This way, we generated a tions we collected provide not only the name of entities but set of regions, or descriptive regions, that include information related to human visual attention while viewing and under- also the relations between entities with an overall and natural standing an image. understanding of the image. In the following sections, we will discuss the details of our annotation methods and the potential applications based on Crowdworkers’ Annotation and Postprocessing our dataset. After discovering the descriptive regions from aggregating fixation points, we next designed an annotation interface for DATA ANNOTATION these regions. Our instructions in the annotation interface In this section, we introduce how we combine subjects’ eye lead users to describe the whole image first. Then the in- fixation and their descriptions upon an image to generate our terface provides users these descriptive regions in the given new labeling on the image. Our stimuli images for annotation image for labeling. With this questions order, we ensure that are originated from MS-COCO dataset [7] as a reasonable the descriptions are similar to human natural viewing process. subset containing different scenes and situations. Here we uploaded our annotation interface to Amazon Me- chanical Turk (Mturk), which is a crowdworker platform, and Visual Attention Clusters collect users’ descriptions on it. In order to involve human eye movement to our annotation method, we first recorded 100 subjects’ mouse traces on given In order to refine the descriptions collected from Mturk, we images with SALICON [4] to simulate their eye movements. use natural language processing (NLP) tools to postprocess SALICON is a tool to approximate visual attention via mouse descriptions of each image. Currently one of the NLP tools traces. They first applied Gaussian blur filter on every image we use is Wordnet [2]. Wordnet is an English dictionary and uploaded them to Amazon Mechanical Turk to collect dataset that has a tree-like structure for every word. By apply- large-scale mouse-tracking data. The collected mouse traces ing a dictionary to the collected descriptions, we clear the in- on the blurred images can be transformed into simulated eye comprehensible descriptions and merge nouns that have sim- movement maps on the images (Figure 2. (b)). In this way, ilar meaning, which is defined by both the nouns and their the visual attention map of an image can be approximated hypernyms that are related by wordnet. Figure 3. shows an from a large amount of subjects’ mouse traces instead of us- example of 10 subjects’ descriptions from Mturk and one re- ing an eye tracking machine (Figure 2.(c)). Furthermore, in fined description of the red box in the image. With more de- order to emphasize on the eye fixation on an image, we col- scriptions collected and more NLP tools involved in the fu- lected ”fixation points” from the mouse traces. These fixation ture, we could generate more detailed and informative anno- points are defined and filtered by the length of time that the tations for these descriptive regions. 2 Figure 3. The procedure of generating descriptive regions from human visual attention for image annotation. DISCUSSION AND NUMERICAL ANALYSIS this criterion, we generate Ω for each image based on the col- Our annotation method represents a different way for com- lected descriptions from the crowdworkers. Figure 4. shows puters to learn and understand an image. Currently, we have Ω generated from our current pilot dataset with 10 subjects collected 10 subjects’ annotation results on 113 images as a for each image. In Figure 4. (b), the image contains only one pilot dataset. As more images are annotated using this novel woman with a golf club and the background is clean, which method, computers can learn to generate a more human- results in a higher magnitude of the simplicity weight. On like description for a new image by applying neural network the other hand, Figure 4. (c) contains an enormous amount methods with structures such as [8] and [5]. Furthermore, this of information as it has a cluttered background. The sample dataset can be used for multiple applications such as object description of one descriptive region can be found in Figure detection, foreground-background separating, scene recogni- 3. Since there are many different descriptions among subjects tion and image caption generation. Moreover, deep learning for one region, the resulting Ω has a lower magnitude. Figure methods can exploit these new annotations with both directed 4 (a) contains an overall example Ω of 20 images. Gener- and generative models. Given the interest-defined regions R, ally speaking, there maybe two set of ambiguous descriptions labels L and a set of images I, a computer can learn the map- of an image, but most user descriptions fall into these two ping between them and make complex predictions between groups, which suggests an objectively complex image and re- R, L and I. For example, we can learn to predict what is sults in a concentrating interest (higher Ω). There may also interesting in an image, or generate novel image from labels be 10 sets of different descriptions with each set containing and/or regions. We could also find related images using a sin- only one user, which would be a case of lower Ω. With more gle input region after we learn the relation between R and I. and more user descriptions collected, we could stably gener- This connection can also aid in recognizing different objects ate this interest complexity measurement Ω. from a more contextual view. After measuring the interest complexity Ω of each image, a In addition to direct applications in computer vision fields, variety of applications in different fields can be developed or we also want to identify human interest in an image. People improved. For example, we could explore choices for the de- may naturally gravitate toward only a few regions of inter- sign of a new user interface based on the testing groups, such est in some objectively complex images. We want to quan- as using a simpler image set when the test has a time limita- tify this Interest Complexity, as we believe it is a better mea- tion for users to understand the image. This can also be used sure of how much information people actually extract from as an objective measurement of detecting a certain psycholog- an image, independent of the image’s objective complexity. ical characteristic. For instance, we could collect eye move- In order to generate this Interest Complexity (Ω), we consider ment data from individuals with and without a psychological the users’ descriptions (or labels L) collected for regions R characteristic. By comparing and analyzing these data with from the Mturk platform. Different subjects’ descriptions of our annotated dataset, we could find a more numerical way to each image can vary widely, providing considerable informa- distinguish whether a new individual has this characteristic. tion for the annotation procedure. Comparing the content of It could also aid in some clinical tests that currently required subjects’ descriptions of the same image pairwise is a way clients to take a subjective test, the results of which need to to retrieve information. We could not only know whether a be scored by experienced professionals. Through the above- subject is answering correctly, but also know how many sub- mentioned process, we could run an objective measurement jects give simultaneous descriptions. In our analysis, We use and facilitate professionals testing. This could also possibly spectral clustering to find the description groups that the be- simplify the testing pipeline if image viewing is accessible longing subjects have similar descriptions. If the image has and more comfortable for clients. more sparse groups, the image has descriptive regions or con- tent that is hard to describe with the same words. Following 3 (a) Interest Complexity Ω generated from subjects’ descriptions. (b) Ω = −0.46 (Simple) (c) Ω = −0.12 (Complex) Figure 4. simplicity weight for an image: the image with a lower weight magnitude means that the image contains more complicated content and has less similarity among the subjects’ descriptions. CONCLUSION 3. Goferman, S., Zelnik-Manor, L., and Tal, A. In this paper, we present an image annotation method that Context-aware saliency detection. IEEE Transactions on includes information from human visual processing. This an- Pattern Analysis and Machine Intelligence 34, 10 notation method shows a novel concept for a computer to un- (2012), 1915–1926. derstand an image using different aspect from current image 4. Jiang, M., Huang, S., Duan, J., and Zhao, Q. SALICON: processing methods. By using human interest to guide anno- Saliency in Context. In Proceedings of the IEEE tations, we get annotations that focus on the naturally inter- Computer Society Conference on Computer Vision and esting aspects of an image and the relations between them. Pattern Recognition, vol. 07-12-June (2015), Our annotations are designed to be closer to the way people 1072–1080. generate explanations for the content of images. We elicit fix- ations and annotations based on the user explicitly looking at 5. Kendall, A., Badrinarayanan, V., and Cipolla, R. an image to explain what is happening. We also discuss dif- Bayesian SegNet: Model Uncertainty in Deep ferent applications and effects in a variety of fields to show Convolutional Encoder-Decoder Architectures for Scene that this distinctive annotation concept is useful and required Understanding. for future development of fields such as artificial intelligent, 6. Krizhevsky, A., Sutskever, I., and Geoffrey E., H. user interface, and psychology. With more and more data be- ImageNet Classification with Deep Convolutional ing collected within our method and analyzed, computers can Neural Networks. Advances in Neural Information learn and achieve a more human-like perspective of the sur- Processing Systems 25 (NIPS2012) (2012), 1–9. rounding entities and how they interact or relate to each other. 7. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., ACKNOWLEDGMENTS Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft First of all, we would like to give our appreciation to Profes- COCO: Common objects in context. In Lecture Notes in sor Angus MacDonald III and his student Danielle Pratt of Computer Science (including subseries Lecture Notes in Clinical Psychology Department for the discussion and eye Artificial Intelligence and Lecture Notes in tracking data collection. We would also make our big thanks Bioinformatics), vol. 8693 LNCS (2014), 740–755. to Sewon Oh and all the colleague in CoMoCo Lab of Univer- sity of Minnesota for the contribution on pre-testing the an- 8. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. notation interface. The last but not the least, we would like to You only look once: Unified, real-time object detection. thank to Minnesota Supercomputing Institute (MSI) and Col- arXiv preprint arXiv: (2015). lege of Liberal Art (CLA) server in University of Minnesota 9. Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: for providing server computation and storage. Towards real-time object detection with region proposal networks. Advances in neural information (2015). REFERENCES 1. Deng, J. D. J., Dong, W. D. W., Socher, R., Li, L.-J. L. 10. Simonyan, K., and Zisserman, A. Very Deep L.-J., Li, K. L. K., and Fei-Fei, L. F.-F. L. ImageNet: A Convolutional Networks for Large-Scale Image large-scale hierarchical image database. 2009 IEEE Recognition. ImageNet Challenge (2014), 1–10. Conference on Computer Vision and Pattern 11. Yang, J., and Yang, M.-H. Top-Down Visual Saliency Recognition (2009), 2–9. via Joint CRF and Dictionary Learning. In CVPR 2. Fellbaum, C. WordNet : an electronic lexical database. (2012). MIT Press, 1998. 4