UDC 004.855 DEVELOPING A SEMANTIC IMAGE MODEL USING MACHINE LEARNING BASED ON CONVOLUTIONAL NEURAL NETWORKS Philip Andona[0000-0001-6546-0826], Andrii Hlybovetsb[0000-0003-4282-481X], Volodymir Kuryliak b a Institute of Software Systems of the National Academy of Sciences of Ukraine, 03187, Kyiv, 40 Akademika Glushkova Avenue b National University “Kyiv-Mohyla Academy”, 04070, 2 Skovorody vul., Kyiv 04070, Ukraine У роботі описано основні напрямки досліджень у сфері побудови моделей автоматизації комп’ютерного розпізнавання сутності цифрового зображення. Введено поняття семантичної моделі зображення та описано реалізацію моделі машинного навчання для вирішення задачі автоматичної побудови такої моделі для вхідного зображення. Семантична модель складається зі списку об’єктів, які показано на зображенні, та їх зв’язків. Розроблена модель була порівняна з іншими рішеннями для цієї самої проблеми і показала кращі результати в усіх, за винятком одного, випадків. Ефективність роботи моделі обґрунтована використанням останніх досягнень машинного навчання, зокрема ЗНМ, TL, моделей Faster R-CNN i VGG16. Значна частина зв’язків представлених на зображенні є просторовими зв’язками, таким чином, для кращої роботи моделі, потрібно використовувати цей факт у її проектуванні, що і було зроблено. Ключові слова: семантична модель зображення, машинне навчання, комп’ютерний зір, згорткові нейронні мережі, зв’язки на зображенні. В работе описаны основные направления исследований в области построения моделей автоматизации компьютерного распознавания сущности цифрового изображения. Введено понятие семантической модели изображения и описано реализацию модели машинного обучения для решения задачи автоматического построения такой модели для входного изображения. Семантическая модель состоит из списка объектов, которые показаны на изображении, и их связей. Разработанная модель была сравнена с другими решениями для этой самой проблемы и показала лучшие результаты во всех, за исключением одного, случаев. Эффективность работы модели обоснована использованием последних достижений машинного обучения, в частности СНС, TL, моделей Faster R-CNN i VGG16. Значительная часть связей представленных на изображении есть пространственными связями, таким образом, для лучшей работы модели, нужно использовать этот факт в ее проектировании, что и было сделано. Ключевые слова: семантическая модель изображения, машинное обучение, компьютерное зрение, сверточные нейронные сети, связи на изображении. Understanding (interpretation) of images is one of the most urgent tasks of artificial intelligence. Image understanding we can use in the wide range of possible applications. Visual information is one of the most popular and widespread type of information. As its number increases, the issue of processing automation becomes extremely relevant. Simple tasks, such as automatic annotation of images or search for similar images, can be considered solved, but in terms of a truly deep understanding of the image, modern methods are far from ideal. In recent years, many researches has been done to approximate the level of understanding of images by humans and automated systems, in particular, to solve problems such as creating a text description or finding objects on the image. However, in order for a computer to fully interpret ("understand") the content of an image (what is represented on it), it is necessary to obtain a formal structured representation of all the information contained in the image. The format and structure of such a presentation requires research. We call this task a semantic image model. This paper describes the main areas of research in the field of developing computer models for the computer vision. The article provide the concept of the semantic image model. During development of the semantic image model the newest results in computer vision such as CNN, TL, Faster R-CNN, and VGG16 were used. The semantic model consists of a list of objects represented in the image and their relationships. These relations represented as triplets such as “object, relation, and subject”. Our model as input received image and as outcome produce a list of triplets. Neural network trained on a collection Visual Genome. As a quality evaluation metric authors used “precision from K”. The performance of the model is justified by the use of the latest works in this area. A lot of the relations between objects on the image are spatial links/ This fact were used during model development. Key words: semantic image model, machine learning, computer vision, convolutional neural networks, image links. Introduction Understanding (interpretation) of images is nowadays one of the most important tasks of artificial intelligence [1]. The interest of many researchers to the topic is conditioned by a wide range of its possible applications. Visual information is one of the most popular and common type of information, on the Internet in particular. As the amount of visual information increases, the issue of its automatic processing becomes extremely relevant. Such simple tasks, as automatic annotation of images or search for similar images, can be considered solved [2, 3, 4, 5], however, in terms of a truly deep understanding of images, modern methods are far from perfect. In recent years, much research has been done to approximate levels of understanding of images – by humans and by automated systems, in particular, to solve such problems as creating a text description or finding some objects in the 35 Copyright © 2020 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0). image [1, 5]. However, in order that a computer could fully interpret ("understand") the content of an image (i.e. objects represented in an image), it is crucial for it to receive a formal structured representation of all the information an image contains. The format and structure of such a representation requires research. We name this task a semantic image model. A semantic image model includes a list of the objects represented in the image and their relationships. Therefore, the aim of this paper is to represent our vision of automated generation of image description. We introduce here the term "semantic model" to formalize the representation of an image in the picture. The presence of the term "semantic" means that such a representation should help to reveal the essence of the image based on an approach similar to the mechanism of human interpretation of images. And the word "model" sets requirements for clarity and formalization of the structure of the representation. These comments are important because our goal is to automate the process of computer image processing. There are several approaches aimed at solving the problem of "understanding images" or related to it partially. Among the main lines of research in the field of developing automated models for computer recognition of the contents of digital images, there are related tasks of image classification [6], text description of images [7], determining the relationships in the image [8]. Recent studies have focused on promising approaches of determining relationships in an image with the help of visual phrases "Recognition Using Visual Phrases" [9], "Visual Relationship Detection With Language Priors" [10], "Deep Relational Network" [8]. The latter one is designed specifically to effectively use the statistical characteristics of interdependence between objects and their relationships in the image in machine learning on the basis of convolutional neural networks. In this paper, we are going to articulate our approach to solving the problem of building a semantic image model. The model is machine learning (ML) oriented. It must receive an image at the input and return a set of subjects, objects, and their relationships at the output. Development of a semantic model In ML, expert knowledge of the subject area allows you to better understand the problem and come to the best possible solution. That is why a model design development begins with data analysis and finding the right collections. We are also going to start by finding the right data collection. Training set. In order to solve our task, we used the Visual Genome collection [11]. It is created and successfully applied to resolve various computer vision challenges related to improving the "understanding" of images. On the official website [12] one can find training collections, in particular, for such tasks as finding objects, defining relationships between them, creating a text description. There are about 108,000 images. It contains a list of objects present in an image, with their positions and relationships between them, for each of the images. Therefore, this data is exactly what is necessary to solve our task. The Visual Genome data format is rather redundant; it contains a lot of noise and, in large, is not so easy to work with. Therefore, many researchers use modifications of this collection or convert it to another format. We particularly used the format proposed in [8]. In it, all the data is divided into training and test sets. Lists of object classes and connections between objects are provided separately. For each image, there is its location (img_path); classes of objects depicted on it (classes); positions of objects that act as subjects (ix2) and as objects (ix2); bounding rectangulars for each localized object (boxes); and also a class of relationships for each subject-object pair (rel_classes). This information can be visualized by displaying an image, all its localized objects with their classes, and specifying all the relationships between them. Convolutional neural networks. In this work, we are going to use convolutional neural networks (CNNs) with transfer learning (TL). They were developed for finding the optimal solution of image processing challenges, applying the methods of neural networks. The model we used for this work is VGG16 [3]. This is a СNN, which was designed to solve the problem of classifying the ImageNet collection into 1000 classes. It had only 7.5% error rate. This result was mostly ensured by the use of deep architecture. VGG16 contains 16 hidden layers, which is a lot, even by today's standards. Over time, it has proven the ability to be easily transformed for solving new tasks and has become the de facto standard for TL in solving tasks of computer vision. After VGG16, several other networks were proposed and they showed even better results for the same task, but they are less successful for TL, in part, because the interpretation of their results can sometimes be quite difficult. 35 VGG16 is included in many machine learning libraries and frameworks, which makes it significantly easier to apply. In PyTorch, there is a special mechanism for loading the data in the process of work with machine learning models, its function is to name the Dataset class [13]. In the course of our work we used this mechanism for data loading as well as for images normalization. This is a required element for working with the model [14]. To test the model, you can send there an image, i.e. from the Visual Genome training set. For this, you need to set the model to evaluation mode. The result of the model's work will be an estimation of the probability that the image in the picture belongs to a certain class of images. Accordingly, it is necessary to receive the information on the position of the largest element and find which class corresponds to the found index. As the model was used to classify images, we received one value, not several, as it was in the Visual Genome training set. It is also worth mentioning that VGG16 uses a different set of classes (ImageNet classes), that is why, in order to get the classes of objects that were in the training set, it will have to be retrained. Adding a RoI Pool layer. The first modification we made to the VGG16 model was that we added a RoI Pool layer. RoI Pool (a regions of interest layer) is a Pool-layer of a CNN of the fixed size (it is a parameter of the layer) for a certain region of the previous layer of a CNN [5]. It was proposed in the Faster R-CNN model [15]. This layer allows to select certain features from a specific region of the layer and to clearly locate objects. In Faster R-CNN, it was used for objects detection. We, however, need to predict the classes of relationships, so the regions where these relationships are found can be potentially useful. The question is how to determine that the relationship is in a particular region? In order to receive an answer to that question, one can first analyze a simpler problem. Knowing the positions of the objects (from the training set), we can apply to them the RoI Pool layer. As we can use a fixed number of objects, for example, k, for image identification, then applying this layer will result in k sets of images. So, if earlier the model returned classes of one object at an output, now we can return classes of k number of objects. The VGG16 model returns the result as a matrix with a size (1, 1000), as there are 1000 classes for which it was trained. For each of these classes, we get understanding of the probability that this class is the class of the object contained in the image. Now, when it returns k classes, the result will have the dimension (k, 1000). However, we need to predict classes from the training set, not ImageNet classes. Therefore, in the model, we need to replace the last fully connected layer with the layer that will return the correct number of classes and retrain the model. It is worth noting that we leave all weights of the model, except for the weights of this last layer, as they were in the trained VGG16 model and are not going to optimize them (carry out the back propagation of the error), as these weights correspond to the layers that are not directly related to classification, and are responsible for defining features (feature extraction). We have replaced the existing Adaptive Average Pool layer with a RoI Pooling layer, so that the majority of the model elements remain unchanged. In order not to have to change the FC layers that follow after the Pool layer, we also used the same template size as in the Pool layer (7x7). All this allowed us to get a new model to solve our problem with little effort. Since we are making changes to the forward pass on the network, we had to create a new model and redefine the forward method in it. The code for this model is shown in Listing 1. Listing 1. Network using VGG16 with the help of TL 35 Copyright © 2020 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0). For the model to work, first one needs to create all the layers so that it would be possible to use them during the forward pass. Therefore, in the constructor, we first get the VGG16 model with trained weights (line 5) and turn off the gradient propagation for all its parameters (lines 6-7). We leave the set of features layers unchanged (line 9), and add a new RoI Pool layer with the required template size (line 10). We only change the last layer in the classifier set (lines 12- 14). The changed layer will have the number of output layers that equal to the number of object classes of the training set (obj_num), which will be defined in the constructor. It is worth mentioning that the last layer of the classifier set is the only one that has the value for the requires_grad parameters equal to true (this is the default value), so this will be the only layer for which weights will be optimized. Listing 2. Updated class for data loading after changes to VGG16. In addition, it is necessary to modify the class to load the training set, since one still needs to rotate the positions of the objects (bounding rectangulars). The modified form is shown in Listing 2. It was necessary to add 0 to each list of bounding boxes in lines 31-33. This is conditioned by the specifics of RoI Pool layer work with object positions. The training set was divided into 2 sets: a training one and a test one. For this purpose the class 8, defined in the listing, and also the DataLoader class were used. The class code is shown in Listing 3. Listing 3. Dividing the training set into a training one and a test one With all the necessary elements, it is possible to start the learning process. To do this, we defined the function that can simultaneously train and test, depending on what part of the training set it receives. This approach has been adapted on the basis of official recommendations [16]. The function calculated the values received from the forward pass on the network as well as losses, made a return loss offer and optimized the parameters. The rest of the code was responsible for checking the training phase, saving execution time, accumulating losses and saving the model that showed the best result. The function has several parameters that have not yet been mentioned. The first one is the loss function criterion. Since we needed to predict multiple classes, we used MultiLabelMarginLoss, an SVM-based loss function, but for multiple classes. Optimizer and scheduler are the objects that will optimize the values of the network weights after the back propagation of the error. The optimizer added to the weights a gradient value with a certain step, and the scheduler changed the learning rate to increase the efficiency of learning. The code to run the training process is shown in Listing 4. 35 Listing 4. Running the model’s training Adding the remaining layers. Returning to the question of receiving classes of relationships, it is necessary to make a resume of the information we have at the input, of what we have during the operation of the network now and what should be in the output. As far as we receive objects at the input of the model, and we want to get classes of relationships between them at the output, the original data format can be represented as a relationship between each pair of objects. In order to be able to conduct optimization, we need to encode relationship classes in a one-hot vector. It means that if we have k objects and n possible classes of relationships, the result will be the matrix of dimension k x (k-1) x n. The advantage of this method is that in this way we come to the task of multiclass classification, and, consequently, it is possible to use for its solving the experience of the previous section. However, there are some difficulties. We used the RoI Pool layer to classify objects in order to obtain visual features for each object, but now we need to get visual features for each possible relationship between objects. To solve this problem, we are going to use the combined features of a subject and an object. To get them, it is necessary to find where the bounding boxes of a subject and an object overlap. This is a simple operation, it must be performed for each pair of objects (except an object itself, as an object in the training sample cannot have relationships with itself) and in this way we get bounding boxes for each relationship we need to predict. According to the VGG16 architecture, there should be 3 FC layers after the Pool layer, in order the model could be trained to classify relationships. After that, it is possible to add the last FC-layer, whose output dimension will be equal to the number of classes of relationships, and we will again receive a multiclass classifier. The performance of the proposed model can be improved if we use spatial features in addition to visual ones. We borrowed this idea from the previously mentioned work [8] and adapted it to our solution. Adding spatial information facilitates work of the network, because a large part of the relationships are spatial (“under”, “on”, “near”, etc.) and clearly distinguishing them, we simplify their correct definition. It is difficult to identify these types of relationships with only the visual features of the combined region of a subject and a object. Therefore, using a new type of information, we improve the performance of the model. There are several ways to represent spatial relationships between objects, and for us the most suitable one was their representation in the form of masks. Image masks for an object and a subject are created, so that only the pixels that are present in the bounding box of each of them have a value of 1, the rest equal 0. The two masks received as a result of this appear to be a representation of spatial relationships. The advantage of using this method is a simple possibility to present these masks as an input layer for a CNN and use the strengths of a CNN to train the model to determine the spatial relationships between them. Since the input layer will contain both masks, the layer depth will be 2 and all convolution operations will take place on both layers. The fact that both of them have a common area will be taken into account. To design the convolutional layers that will be used for spatial features, we also took the VGG16 convolutional layers as a basis. There were 3 of them, and after them there was one FC-layer. Now that we have two kinds of features, we need to combine them somehow. The standard solution in this case is to add another fully connected layer that will receive both sets of features and combine them. This means that before the last FC layer, which was mentioned earlier, there will be another FC-layer, which will deal with combining the results of the layers for visual and spatial features. For training the final model the same code that was used to train the intermediate model was applied. Results To assess the quality of work of the developed model, we used the "recall at K" metric. As it was already mentioned, it was proposed in [10] and became the standard for solving this class of tasks. It determines the proportion of correctly predicted values among the best K predictions. Basically, for K, values 50 and 100 are used. Thus, we get two metrics: recall at 50 and recall at 100. The developed model receives objects and their positions at the input and returns relationships between them at the output. In order to assess the quality of its work according to the mentioned metrics, we need to use the test data set from the Visual Genome collection. However, the task to receive both objects and relationships between them only from an input image is much more interesting. Moreover, it is one of the conditions we set before developing the model. To solve this problem, we used the existing model for finding objects and their positions, and used the results we 35 Copyright © 2020 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0). Figure 1. Dependence of recall on iterations of the Predicate Detection task received as the input data for our model. Thus, the work of the model consisted of two stages. A similar approach was used in a number of other works [8, 9]. We used the Faster R-CNN, mentioned earlier, as a model for finding objects. At the present time, its gets state- of-the-art results in solving such tasks[15]. So, instead of one task, here were two. The first one is to predict the relationships between predefined objects. The second one is to predict subject-object-relationship sets. These tasks are called Predicate Detection and Relationship Prediction, respectively. Some works have also suggested solutions to these problems, and we are comparing our results with them. Figure 1 shows graphs of the dependence of the selected metrics on the number of iterations for each of the tasks, respectively. It can be seen that in both cases, the highest value was received at the 5th iteration and after that the model started to retrain itself. Table 1 compares the best results our model had with the test set and the results of researches conducted by other authors. In order to better understand what kind of results the developed model returns and analyze their features, ore can consider some results generated by the developed model and shown in Table 2. For each image, we offer both the result generated with known objects (predicate detection) and without them (relationship prediction). For the second case, it was necessary to send the input image to the Faster R-CNN detector, in the first place, and then send the results to the model. For each image, we selected 20 predicted triplets with the highest value of the evaluation function. Table 1. Comparing the results of the developed model Model Predicate Detection Relationship Prediction Recall at 50 Recall at 100 Recall at 50 Recall at 100 [16] 2015 0.97 1.91 - - [4] 2016 47.87 47.87 13.86 14.70 [13] 2018 80.78 81.90 17.73 20.88 Our model 78.12 87.53 18.56 22.05 35 Table 2. Some examples of the model’s work Image Predicate Detection Relationship Prediction sunglasses - on - person shirt - on - person watch - on - person shirt - on - person2 jeans - on - person shirt - on - person1 building - behind - dog jeans - on - person2 person - wear - jeans glasses - on - person2 person - wear - sunglasses jeans - on - person person - wear - watch glasses - on - person building - behind - person jeans - on - person1 glasses - on - person glasses - on - person1 dog - in the front of - building jacket - on - person2 person - wear - glasses jacket - on - person1 person - in the front of - building person1 - wear - jacket bag1 - behind - glasses person2 - wear - jacket bag - behind - glasses jacket - on - person glasses - next to - bag person - wear - jacket glasses - next to - bag1 person1 - wear - jeans person - hold - bag person1 - wear - shirt watch - near - glasses person - wear - jeans person - hold - bag1 person2 - wear - shirt bag1 - next to - dog person2 - wear - jeans shirt - on - person shirt - on - person1 jeans - on - person shirt1 - on - person shorts - on - person shirt - on - person person - wear - jeans shirt1 - on - person1 person - wear - shorts pants - on - person1 person - wear - shirt pants - on - person shirt - above - shorts person - wear - shirt cabinet - above - bag person1 - wear - shirt shirt - above - jeans person - wear - shirt1 cabinet - behind - person person1 - wear - shirt1 shorts - below - shirt person1 - wear - pants shirt - behind - bag person - wear - pants bag - above - shorts shirt - above - pants person - hold - bag shirt1 - above - pants jeans - below - shirt pants - below - shirt bag - above - jeans pants - below - shirt1 shorts - next to - bag person - next to - person1 person - in the front of - cabinet person1 - next to - person bag - below - cabinet shirt - behind - shirt1 bag - near - shirt shirt1 - behind - shirt jeans - on - chair2 plant - on - table1 jeans - on - chair1 plant - on - table jeans - on - chair lamp - on - table1 chair1 - next to - chair table - on - street chair1 - next to - chair2 table1 - on - street chair2 - next to - chair chair - on - street chair - next to - chair2 table1 - has - plant chair - next to - chair1 lamp - on - table chair2 - next to - chair1 chair - next to - table1 bed - near - chair chair - next to - table bed - near - chair2 table - has - plant bed - near - chair1 lamp - behind - plant chair1 - near - bed lamp - next to - chair chair - near - bed street - below - lamp chair2 - near - bed street - under - table bed - has - jeans street - under - table1 35 Copyright © 2020 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0). jeans - on - bed plant - next to - lamp chair2 - has - jeans table1 - has - lamp chair - has - jeans street - under - chair chair1 - has - jeans table - has - lamp shoes - on - person shirt1 - on - person1 person - wear - jeans shirt1 - on - person3 person - wear - hat shirt - on - person2 person - wear - shoes shirt - on - person3 jeans - on - person shirt - on - person1 hat - above - shoes shirt1 - on - person2 hat - on - person shirt - on - person jeans - above - shoes trees - behind - elephant roof - above - person trees - behind - person hat - over - jeans trees - behind - person2 jeans - near - hat trees - behind - person3 shoes - beneath - jeans person2 - wear - shirt1 person - on - elephant person - wear - shirt shoes - on the right of - hat building - behind - person2 roof - above - shoes shirt1 - on - person person - under - roof building - behind - person elephant - next to - person person3 - wear - shirt1 roof - above - jeans person2 - wear - shirt shoes - on - elephant person - wear - shirt1 roof - above - elephant building - behind - person As we can see, the predictions received for the first image during the first task are better. They describe the image in more detail and contain various relationships, such as "hold", "wear", "in front of", "next to", while the results for the second task have only "on" and "wear". It becomes obvious that the use of the detector, instead of the data from the training set directly, negatively affects the results. Along with that, sometimes such simple predictions can indicate good quality of an image, like in the case with images 2 and 3. Here, both tasks were solved properly, although, in general, they also use only spatial relationships and a predicate "wear". Kinds of relationships that one can receive depend on the type of an image. This is also a problem of the training set itself. The "on" and "wear" relationships were the most popular in the training set, so the model tends to anticipate them. Another factor is that we clearly used spatial features in the model, so the fact that most predictions are related to the positions of the objects is not surprising. Regarding the third image, it should be added that the developed model is able to distinguish in this image several objects of the same class. In particular, we can see that it successfully found relationships for three chairs. In the last image, it is important to pay attention at the difference between the predictions for different tasks. In the case of predicate detection, the focus is on the person sitting on the elephant and everything around it; while the results for relationships prediction do not even contain an elephant, but describe several people in the background. This once again proves that for the second task, the quality definitely depends on the quality of the detector’s work. Conclusions This work introduces the concept of a semantic image model and describes the implementation of the machine learning model for solving tasks of automatic development of such a model for an input image. A semantic model consists of a list of objects represented in an image and their relationships. In addition, this work considered other possible and existing approaches to representing a semantic model. The analysis of these approaches has shown that the relationship-based approach is the most detailed and the most suitable one for solving various kinds of tasks related to understanding of images. The developed model was compared to other solutions for the same problem and showed the best results in all but one case. The efficiency of the model’s operation is based on the use of the latest achievements of machine learning, in particular, CNN, TL, Faster R-CNN and VGG16 models. On the basis of the results of the model, we can come to the conclusion that the model’s performance in solving such type of tasks greatly depends on the quality of the training set. The more varied and complete it is, the better the results will be. Moreover, a very significant part of the relationships shown in the image are spatial relationships, so in order to ensure better results, one needs to take this fact into account in the process of development, which was done in this work. The use of a multi-stage architecture leads to the fact that the quality of performance on the last stages depends a 36 lot on the quality of the previous ones. This is what happened when the model used the Faster R-CNN detector. Hence, we can come to the conclusion that in order to improve the results, one can work on improving the quality of solving the task of objects detection. List of References 1. Karpathy A., Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions [Electronic resource]. – available at: https://cs.stanford.edu/people/karpathy/deepimagesent/ 2. A visual proof that neural nets can compute any function [Electronic resource]. – available at: http://neuralnetworksanddeeplearning.com/chap4.html 3. Simonyan K., Zisserman A.Very Deep Convolutional Networks for Large-Scale Image Recognition [Electronic resource]. – available at: https://arxiv.org/pdf/1409.1556.pdf 4. Image Captioning [Electronic resource]. – Mode of access: http://shikib.com/captioning.html 5. Dai J. R-FCN: Object Detection via Region-based Fully Convolutional Networks [Electronic resource]. – available at: https://arxiv.org/pdf/1605.06409.pdf 6. VGG16 – Convolutional Network for Classification and Detection [Electronic resource]. – available at: https://neurohive.io/en/popular- networks/vgg16/ 7. Vinyals O. Show and Tell: A Neural Image Caption Generator [Electronic resource]. – available at: https://arxiv.org/pdf/1411.4555.pdf 8. Dai B. Detecting Visual Relationships with Deep Relational Networks [Electronic resource]. – available at: https://arxiv.org/pdf/1704.03114.pdf 9. Sadeghi M. Recognition Using Visual Phrases [Electronic resource]. – available at: http://vision.cs.uiuc.edu/phrasal/recognition_using_visual_phrases.pdf 10. Lu C. Visual Relationship Detection with Language Priors [Electronic resource]. – available at: https://arxiv.org/pdf/1608.00187.pdf 11. Krishna R. Visual Genome Connecting Language and Vision Using Crowdsourced Dense Image Annotations [Electronic resource]. – available at: https://arxiv.org/pdf/1602.07332.pdf 12. Visual Genome [Electronic resource]. – available at: https://visualgenome.org 13. Data Loading and Processing Tutorial [Electronic resource]. – available at: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html 14. TorchVision Models [Electronic resource]. – available at: https://pytorch.org/docs/stable/torchvision/models.html 15. Ren S. Faster R-CNN: Towards Real-Time Object Detectionwith Region Proposal Networks [ Electronic resource]. – available at: https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf 16. Chilamkurthy S. Transfer Learning Tutorial [Electronic resource]. – available at: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html About the authors: Andon Philip I., Academician of the National Academy of Science (NAS) of Ukraine, Director of the Institute of Software Systems of the NAS of Ukraine, 03187, Kyiv, 40 Akademika Glushkova Avenue. The number of scientific publications in Ukrainian editions - 400. The number of scientific publications in foreign indexed editions - 10. http://orcid.org/0000-0001-6546-0826. Hlybovets Andriy M., Doctor of Technical Sciences, Dean of the Faculty of Computer Science, National University of Kyiv-Mohyla Academy, 04070, Kyiv, 2 Skovorody street. Number of scientific publications in Ukrainian editions - 40. Number of scientific publications in foreign indexed editions - 5. https://orcid.org/0000-0003-4282-481X Kurylyak Volodymyr V., Master of the Faculty of Computer Science of National University of the Kyiv-Mohyla Academy, 04070, Kyiv, 2 Skovorody street. Number of scientific publications in Ukrainian editions - 0. Number of scientific publications in foreign indexed editions - 0 Contact person: Hlybovets Andriy M., +380674094355, a.glybovets@ukma.edu.ua 36 Copyright © 2020 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0).