Image and Video Tag Aggregation Olga Kanishcheva[0000−0002−4589−092X] and Natalia Sharonova[0000−0002−8161−552X] National Technical University ”Kharkiv Polytechnic Institute”, Kharkiv, Ukraine kanichshevaolga@gmail.com nvsharonova@ukr.net Abstract. In this paper, we explore the task of tag aggregations for the video and image files. In our work, we describe recent achievements image description generation, discuss tags and tagging. We present the result of our experiments based on the lexical resources and natural language processing approaches. In our work, we use the auto-tagging program from Imagga company as the generating program. As data, we use 5 videos which were split into shots for future processing. We identified two subtasks for tag aggregation: 1) creation of general set tags for the whole video file and 2) creation separate sets of tags for each video shots. Our detailed tag analysis and experiment results showed that pipeline with NLP methods received good results for the whole video, but for each shot, in future work, we need to use such resources as Nasari vectors, SensEmbed vectors or word2vec. We present all our experiments with graphs, tables etc. Our future work will be related to aggregation of video descriptions. Keywords: Image Description · Video Description · Natural Language Processing · Aggregation of Video Tags · Evaluation Measures · Au- tomatic Image Description Generation · Keywords Aggregation · Tag Aggregation 1 Introduction Online social networks are providing more and more convenient services to their users. Today, social networks have grown to be one of the most important sources for people, they are involved in all aspects of their lives. Meanwhile, every online social network user is a contributor to such large amounts of information. Online users like to share their experiences and to express their opinions on virtually all events and issues. Among a large amount of online user-generated data, we are particularly interested in peoples opinions or sentiments towards specic topics and events. In social web there appeared many aspects for exploration of large multimedia datasets that have previously been unavailable. Popular social websites, such as Flickr, Photobucket, Picasa etc. contain a massive amount of visual photographs, which have been collectively tagged and annotated by members of the respective community. Other sources of images are different professional stock image marketplaces. Stock photos are made by professional or semi-professional photographers and are usually contained in search databases. They can be purchased and delivered online. Each of these photos independent of a source should have relevant tags, sometimes these tags from human, sometimes this is auto-tags. The problems in the tags area are tag generation, tag translation/tag disambiguation, image classification/clusterization based on tags etc. However, along with the growth in the number of images on the Internet, the number of video content growth also. The advances in computer and network infrastructure together with the fast evolution of multimedia data has resulted in the growth of attention to the digital videos development. The scientific com- munity has increased the amount of research into new technologies, with a view to improving the digital video utilization: its archiving, indexing, accessibility, acquisition, store and even its processing and usability. Image and video pro- cessing are very close to each other and should be explored in parallel because the video can be divided into slots, where each slot represents an image. The separate trend which is related to images and video is automatic gen- eration of image description (full sentences). However, the problem of video description generation has several properties that make it especially difficult. Authors in the work [1] wrote that “Besides the significant amount of image information to analyze, videos may have a variable number of images and can be described with sentences of different length. Furthermore, the descriptions of videos use to be high-level summaries that not necessarily are expressed in terms of the objects, actions, and scenes observed in the images. There are many open research questions in this field requiring deep video understanding. Some of them are how to efficiently extract important elements from the images (e.g. objects, scenes, actions), to define the local (e.g. fine-grained motion) and global spatio-temporal information, determine the salient content worth to describe, and generate the final video description. All these specific questions need the attention of computer vision, machine translation and natural language under- standing communities in order to be solve”. One of our goals is the construction of a system that optimizes the number of tags describing video resources, without any loss of sense. By using the textual information, a user is facilitated on the one hand to locate a specific video and on the other hand is able to comprehend rapidly the basic points. The specificity of this problem lies in the fact that we have many tags from the auto-tagging program, and the keywords are necessary for the whole video and also for shots of these videos, because a user may need to find a fragment of some video. Therefore, this task is divided into two subtasks: 1) the optimization of keywords for the whole video file; 2) the aggregation of tag sets for separate shots. The paper is organized as follows. In Section 2, we show the most recent related works for tag aggregation and in Section 3 we describe the Imaggas auto-tagging program that we use. The general approach, our methods for tag aggregation and results we analyze in Section 4. Section 5 concludes this paper. 2 Background and Related Work The last three years have been associated with increased interest in the field of generating descriptions and keywords/tags for images. This task involved both large companies like Google, Microsoft and small, which are profiled on certain domain areas such as Clarifai (clarifai.com), Imagga (imagga.com) etc. Also in this area, preliminary studies and prerequisites for such intensive development have been made, for example, the emergence of special collections of images, such as ImageNet and Microsoft COCO etc. All this has allowed achieving that in 2014, research scientists on the Google Brain team trained a machine learn- ing system to automatically produce captions that accurately describe images. Further development of that system led to its success in the Microsoft COCO 2015 image captioning challenge, a competition to compare the best algorithms for computing accurate image captions, where it tied for first place. In the paper “Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge” [2], published in IEEE Transactions on Pattern Analysis and Machine Intelligence authors showed some successful examples. Today the program works with 94% accuracy, its a very good result. Another company Microsoft also has excellent results in this field. This com- pany presents some results of image auto-tagging and image captioning. You can see that each image from Fig. 1 has Description field. This field consists of tag set and caption with the value of confidence. The Tags field also contains important tags with the relevant value of confidence. In work [3], authors present a survey of models, datasets and evaluation mea- sures for automatic description generation form images. In this paper presents approaches for tag-tagging as for description generation. For example, Kulkarni et al. System for labeling and generation sentences. They wrote that all models in this category achieve this using the following general pipeline architecture: 1. Computer vision techniques are applied to classify the scene type, to detect the objects present in the image, to predict their attributes and the relationships that hold between them, and to recognize the actions taking place. 2. This is followed by a generation phase that turns the detector outputs into words or phrases. These are then combined to produce a natural language description of the image, using techniques from natural language generation (e.g., templates, n-grams, grammar rules). In paper [4] authors propose a tag-based framework that simulates human abstractors ability to select significant sentences based on key concepts in a sen- tence as well as the semantic relations between key concepts to create generic summaries of transcribed lecture videos. Their approach extractive summariza- tion method uses tags (viewer- and author-assigned terms) as key concepts. They use Flickr tag clusters and WordNet synonyms to expand tags and detect the semantic relations between tags. This method could select sentences that have a greater number of semantically related key concepts. Fig. 1. The examples from Microsoft (https://azure.microsoft.com/en- us/services/cognitive-services/computer-vision/). 3 Auto-Tagging of Images The company IMAGGA (imagga.com) has developed an original technology for image auto-tagging by English keywords. The technology is based on machine learning and assigns to each image a set of keywords depending on shapes that are recognized in the image. For each learned item the system ”sees” in an image, appropriate tags are suggested. In addition, the system proposes more tags based on multiple models that it has learned. They relate the visual characteristics of each image with associated tags of similar images in ImageNet or big external manually created data sets (e.g. Flickr). The intuition and motivation are that more tags serve better in searching because users may express their requests by different wordforms. The platform developers believe they have found the right practical way to offer best possible image annotation solution for a lot of use-cases. Fig. 2. Imagga’s auto-tagging platform with automatically generated tags and their relevance scores: lake 100%, glacier 81,05%, mountain 68,47%, landscape 52,97%, forest 41,49%, water 36,14%, snow 34,98%. Quite often, when the image contains a close-up object, Imagga’s platform assigns correctly the most relevant tags to the central object (Fig. 2). In the right part of Fig. 2 keywords are ordered according to their relevance score. Associating external tags imports additional keywords in the annotation of Imagga’s images. 4 Experiments and Results We used five fragments of films for our experiments: Batmobile, FC Barcelona, Hunger Games, Meghan Trainor, Remi Gaillard. All these films were divided into shots. The structure of these files you can see on the Table 1. We received sets of tags for all videos with using auto-tagging program from Imagga company. Table 1. Information about test data sets. Name of film Number of shots Number of tags ”Batmobile” 24 1,524 ”FC Barcelona” 57 1,570 ”Hunger Games” 60 1,555 ”Meghan Trainor” 154 6,161 ”Remi Gaillard” 58 1,936 4.1 Preprocessing stage. The authors in the paper [5] showed the approach for refinement of image tags. Such cleaning of tags is very necessary for images from social photo services. We used only part of these methods [5] for tag aggregation. First, we need to delete the duplicate tags, the next step it is necessary to tackle plural and inflected forms. At the first stage, we remove duplicates. There are quite a lot of them from 68% to 92% of all video tags. Then we process phrases which are found among the keywords. For example, if we have tags such as jelly, fish and jelly fish, then we leave only jelly and fish tags, and the jelly fish tag remove from the tag set. This choice is based on the fact that single tags have a higher score, and therefore more relevant to the image. At the next step, we get delete of inflective forms, such as playing, played etc. For English, this can be done using the popular Porter stemmer. It is giving the opportunity for to remove the repeat words. At the end of this process, we delete the keywords which are characterizing the color (example, blue, red etc.). They occur quite often, but they are not needed for video tagging. The effect from our refinement of image tags we will describe below. The Fig. 3 shows results for files Batmobile, FC Barcelona, Hunger Games, Meghan Trainor, Remi Gaillard. The graphs show that the preliminary processing stage is sufficient to signif- icantly reduce the number of tags that describe the whole video. However, we have each tag has a relevant score, which allowed us to evaluate its effect on the number of tags. In [6] it was investigated that tags that have a score more or equals 20 are the most significant for the image. We selected the keywords that score ranges from 20 to 100. Thus, we have 44 tags for the file ”Batmobile”, 32 ”FC Barcelona”, 70 ”Hunger Games”, 57 ”Meghan Trainor”, 68 ”Remi Gaillard”. These tags are presented in Tables 2, 3, 4, 5, 6. We think that our results make it possible to consider this task to be per- formed at a fairly good level. But if the first task about the total number of tags for the whole video file is clear, then the solution to the task of tag aggregation for separate shots cannot be solved so simply, because we have not repetitions. Consider the results for the movie ”Batmobile” (Fig. 4). The pre-processing stage showed that for the video ”Batmobile” we have 1,524 tags for all shots, and after all filters, we received 1,399 tags, i.e. reduced only 8% by the total. Each color represents one shot and for the file ”Batmobile” we have 24 shots. We analyzed our results and decided that we can use only tags with the relevant score>20%, but we got the results shown in Fig. 5. This figure shows that we have 6 shots that do not have any tags with score> 20% at all. This is not good since all fragments must have keywords. Of the remaining 18 shots, 11 of them have less than 10 tags, which is also few. All this showed that in this case, we cannot use the score value as a defining characteristic. It also makes changes to the generation of a common set of tags for the whole video, since some shots will not be represented in the final set. The low score is primarily due to the features of the algorithm work of the program Imagga auto-tagging Fig. 3. Quantity of tags that are considered redundant after pre-processing stage. Table 2. Set of tags with score >=20% for Batmobile file. Tag Score Tag Score Tag Score car 100,00 businessman 32,00 team 23,53 man 40,94 convertible 31,88 wheel 23,39 container 39,63 automobile 31,82 sitting 23,17 petri dish 39,46 happy 29,57 meeting 22,93 vehicle 38,84 professional 28,92 job 22,57 dish 38,46 adult 28,75 success 22,22 people 37,02 smiling 28,14 attractive 21,86 business 35,38 person 27,89 women 21,66 auto 34,56 corporate 27,45 speed 21,64 businesswoman 34,22 work 26,39 tow truck 21,10 wheeled vehicle 33,65 transportation 26,21 drive 20,82 male 33,47 limousine 26,18 beaker 20,64 truck 33,40 crockery 24,66 suit 20,32 office 33,19 portrait 24,45 teamwork 20,06 caucasian 32,57 businesspeople 24,02 Table 3. Set of tags with score >=20% for FC Barcelona file. Tag Score Tag Score Tag Score swimsuit 70,54 caucasian 29,42 model 22,69 king 49,68 structure 28,36 summer 22,50 grass 44,63 people 27,00 happy 22,19 building 43,42 player 26,51 golfer 21,56 bikini 42,30 male 25,56 architecture 21,30 field 34,65 adult 24,84 torch 21,14 beach 34,32 person 24,79 body 21,13 man 33,45 attractive 24,60 ocean 20,79 greenhouse 32,91 smiling 24,00 silhouette 20,16 rival 32,71 sea 23,33 maillot 20,14 sexy 29,98 art 23,23 Table 4. Set of tags with score >=20% for Hunger Games file. Tag Score Tag Score Tag Score wheat 100,00 rural 29,18 hair 23,51 curtain 100,00 summer 29,00 adult 23,50 blind 95,77 water 28,72 landscape 23,43 furnishing 88,92 farm 28,06 icon 23,33 shower curtain 82,75 plant 27,66 straw 22,96 cereal 71,41 minaret 27,61 structure 22,85 protective covering 67,64 portrait 27,30 obstruction 22,53 fence 58,38 face 26,71 container 22,35 picket fence 56,33 blond 26,46 sign 22,28 field 45,64 seed 26,25 design 22,24 crystal 41,96 fountain 26,20 male 22,20 ice 40,13 building 26,10 river 21,71 covering 39,68 attractive 25,98 natural 21,53 menorah 39,44 source of illumination 25,31 solid 21,45 candle 38,91 candlestick 24,47 glass 21,43 dam 37,31 crop 24,45 sky 21,40 grain 37,13 man 24,25 symbol 21,25 barrier 34,24 person 24,20 model 21,15 clock 34,16 corn 24,18 coral fungus 21,14 photograph 32,18 people 24,11 caucasian 21,04 candelabrum 31,56 timepiece 23,97 bread 20,59 agriculture 30,35 pretty 23,74 looking 20,21 harvest 29,87 cleaning implement 23,70 mosquito net 29,86 suit 23,64 Table 5. Set of tags with score >=20% for Meghan Trainor file. Tag Score Tag Score Tag Score art 63,63 plaything 28,47 blind 23,68 blond 56,92 happy 28,39 cute 23,47 candle 50,18 smile 28,37 body 22,93 dress 47,26 makeup 28,35 quill 22,58 toiletry 46,45 shower curtain 27,80 lady 22,44 nipple 41,64 gymnastics 27,71 eyes 22,31 attractive 36,84 pajama 27,70 nightwear 22,15 hair 34,83 source of illumination 27,41 clothing 21,98 portrait 33,98 gown 27,03 letter opener 21,68 face 32,89 silhouette 26,91 light 21,42 cap 32,28 sexy 26,32 net 20,74 model 32,22 hair spray 25,90 expression 20,52 pretty 31,89 swing 25,19 man 20,51 adult 31,36 complexion 24,88 cheerful 20,45 curtain 30,82 fashion 24,79 mechanical device 20,40 person 30,61 lipstick 24,13 sunset 20,30 people 29,73 top 24,05 human 20,19 caucasian 29,06 swimsuit 23,99 lips 20,05 texture 28,54 furnishing 23,71 pen 20,03 Table 6. Set of tags with score >=20% for Remi Gaillard file. Tag Score Tag Score Tag Score curtain 87,58 maillot 34,72 sexy 23,96 shower curtain 75,19 seaside 33,90 coastline 23,79 swimsuit 73,50 truck 33,24 van 23,76 beach 68,46 coast 31,06 outdoors 23,75 furnishing 68,17 shepherd dog 31,03 grass 23,47 blind 67,44 sky 30,84 structure 23,18 shore 58,56 pool 30,50 minibus 22,33 swimming 47,97 fountain 29,09 adult 22,30 sea 47,63 swimming trunks 29,03 sport 22,27 protective covering 47,01 walk 29,00 road 21,93 vessel 46,64 landscape 28,85 vehicle 21,68 ship 45,68 summer 28,41 jump 21,49 car 44,25 vacation 28,33 art 21,45 boat 44,13 fireboat 28,28 ball 21,30 sand 42,51 german shepherd 28,23 swing 20,98 dune 42,10 travel 27,98 bus 20,92 bikini 41,88 dog 27,50 person 20,92 screen 40,04 male 27,09 minivan 20,92 ocean 39,66 man 26,95 fun 20,70 golf 39,63 covering 26,03 holiday 20,50 garment 37,09 surfing 25,51 people 20,36 water 36,47 sun 24,26 seashore 20,10 door 35,92 clothing 24,08 Fig. 4. Quantity of tags that are considered redundant after preprocessing stage of each shot for Batmobile film. program. But it is getting better every year and so we think that in the future all shots will have tags with a high score. Fig. 5. Quantity of tags for each shot for the Batmobile film. In future, we decided to make some experiments with Nasari vectors, SensEm- bed vectors or word2vec for the aggregation of keywords for a single shot. 5 Conclusion In this work we i) analyzed of systems which generate tags/descriptions for video and images; ii) compared the results of related works; iii) explored meth- ods for generating natural language descriptions for images/video; iv) created short overview of the generation of video and image descriptions and explored main problems of this task. We concentrated on the problem of tag (keywords) aggregation into a single description of the object. Multimedia collections in- tegrate electronic text, graphics, images, sound, and video. Their objects are usually annotated by keywords which characterize, describe or refer to cate- gories in certain classifications. These tags help to distinguish the objects and often form folksonomies: user-generated categories for organizing digital content. In this work, we showed how works the preprocessing stage for tag optimization of keywords sets for video fragments, using NLP techniques, lexical resources to tag aggregation. We presented the statistical information about our experiments and results. References 1. Peris A., Bolaos M., Radeva P., Casacuberta F.: Video Description Using Bidirec- tional Recurrent Neural Networks. In: Proceedings of the International Conference on Artificial Neural Networks ICANN 2016: Artificial Neural Networks and Machine Learning, pp. 3-11, Barcelona, Spain (2016). https://doi.org/10.10007/978-3-319- 44781-0.1 2. Vinyals O., Toshev A., Bengio S., Erhan D. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 652-663 (2017). 3. Kulkarni G., Premraj V., Dhar S., Li S., Choi Y., Berg A. C., Berg T. L. Baby talk: Understanding and generating simple image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891-2903 (2011). 4. Hyun Hee Kim, Yong Ho Kim. Generic Speech Summarization of Transcribed Lec- ture Videos: Using Tags and Their Semantic Relations. Journal of the Association for Information Science and Technology 67(4), 366–379 (2016) 5. Kanishcheva O., Angelova G. A Pipeline Approach to Image Auto-Tagging Refinement. In: Proceedings of the 7th Balkan Conference on Informat- ics Conference, Craiova, Romania, ACM New York, NY, USA (2015) https://doi.org/10.1145/2801081.2801108 6. Kanishcheva O., Angelova G. About Emotion Identification in Visual Sentiment Analysis. In: Proceedings of the 10th International Conference on ”Recent Advances in Natural Language Processing” RANLP 2015, , pp. 258-265, Hissar, Bulgaria (2015)