Dataset for NLP-enhanced image classification Dmytro Dashenkov1, Kirill Smelyakov1 and Nataliia Sharonova2 1 Kharkiv National University of Radio Electronics, 14 Nauky Ave., Kharkiv, 61166, Ukraine 2 National Technical University "KhPI", Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract In this paper we present a multi-modal image and text dataset. The dataset is based on images from the Open Images dataset and text descriptions of the class names obtained from Wikipedia. We provide an exemplary model for labeling images trained on top of the dataset. Lastly, we explore the applicability of this or similarly compiled datasets for various computer vision tasks, in particular for image classification with aid of a natural language processing model. With the help of the compiled dataset, we construct an image tagging model. The model represents a typical example of multi-class multi-label classification task. Using a pretrained model, we fine-tune a neural network classifier for adding one-word tags to the images based on the objects depicted in the images. We explore the performance of the classifier and argue for the benefits of the multi-modal datasets for this task as well as other vision tasks. Keywords 1 Computer vision, image classification, natural language processing, multimodal learning 1. Introduction Image classification is one of the typical tasks for computer vision algorithms. As such, many different approaches to the task have formed. In recent years, neural network-based approaches dominate the area. In particular, models using convolutions and attention mechanisms are popular and show great results. In this work, we suggest a novel approach to solving the image classification task by using the latest findings in the field of natural language processing and combining them with the conventional models for image classification. For this, we have assembled a multi-modal dataset. The structure and more details on the dataset are presented further in this paper. The dataset is available publicly on GitHub. We also theorize as to what kinds of models might be built on top of this dataset. Many multimodal datasets with text and images are built for the task of image description or for generating images from text. Such tasks require distinct and precise annotations for each image. Unlike those tasks, image classification works with a predefined set of ground truth labels. This gives us the ability to consider using general class descriptions as labels rather than individual image descriptions. Because of this simplification, we’re able to assemble the dataset with less effort. Practitioners who apply the results of this work will be able to amend the dataset just as easily without spending resources on human annotators. The end goal of this research is to come up with an approach to image classification that would be scalable, i.e., given a certain pretrained state, have an ability to receive new image classes with smaller amounts of extra training. By the virtue of being able to extract additional data from the class labels, rather than merely treating them as non-informative flags, we hope to achieve a better performance for the classes added after the main training stage (meta-training). Also, the smaller training stages for the added classes (fine-tuning) should benefit from the knowledge extracted from the class descriptions. Such a technique may improve the rate at which new COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: dmytro.dashenkov@nure.ua (D. Dashenkov); kyrylo.smelyakov@nure.ua (K. Smelyakov); nvsharonova@ukr.net (N. Sharonova) ORCID: 0000-0001-9797-1863 (D. Dashenkov); 0000-0001-9938-5489 (K. Smelyakov); 0000-0002-8161-552X (N. Sharonova) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) classes may be added to a trained model. Such an achievement will benefit many practitioners, who take advantage of pretrained models to solve narrower cases. This paper presents the preparatory yet important stage of the research, collecting and cleaning data for the models to learn upon. Our goal with this paper is to demonstrate the approach of data collection, present a complete usable and useful dataset, and illustrate the use of said dataset on a concrete problem. The demonstrative problem we have chosen is the image tagging task—generating multiple one-word tags that describe an image. Such an algorithm may prove useful for many practical scenarios, such as image search. 2. Related works Currently, one can see significant progress in image classification using neural networks [1-3]. Including analysis of images with high spatial resolution [1], application of trending technologies of transfer learning and machine learning [2], use of multispectral data analysis [3]. In addition to these important areas of development, a number of private results have been obtained, which can significantly improve the efficiency of classification. Thus, in [4], a new combined transfer learning technique for image segmentation based on the combination of image weighting and kernel training is proposed to improve performance on heterogeneous data. In [5], an effective model of voice labeling of images using neural networks is proposed, the application of which can significantly improve the accuracy of neural network training when considering non-trivial data. In [6], an aggregate network with context-sensitive learning for hyperspectral image classification is proposed, which can effectively reduce the influence of initial graph error on the classification result. In [7], a model for semi-supervised classification of hyperspectral images using spatial-spectral information is proposed to improve classification efficiency under conditions of limited data sampling. In [8], features, models and algorithms for volumetric image classification using multisample learning and extreme value theorem are described. In [9], a deep learning platform for converting image reconstruction into pixel classification for efficient local processing of a digital image is presented. In [10], models and algorithms for kernel-based constrained energy minimization for hyperspectral classification of mixed pixels are described. In [11], an algorithm for iteratively increasing the training sample to improve the accuracy of image classification is proposed. At the same time, solutions for the classification task in NLP are being actively developed. Thus, the work [12] presents a unified understanding of deep NLP models for text classification at different levels of perception and detail. Work [13] proposes solutions in the sense of using deep learning architectures based on transducers for specific conditions of their application. In [14] a practically oriented model of automatic classification of sexism in social networks (Twitter network) is proposed. In [15], a mechanism for embedding user ID into pre-trained language models for document-level tone classification is proposed. In [16], a new method of MBTI classification based on the influence of class components is proposed. The method is used for subsequent prediction of personality type. In [17], the authors proposed a multitasking learning model based on multiscale CNN and LSTM for sentiment classification. In [18], the authors try to combine supervised machine learning and NLP algorithms into one method, which is called SECRET (Semantically Enhanced Classification of REal-world Tasks). This method does the classification by combining the semantic information of the labels with the available data. In [19] the state of the art of models and algorithms for classifying user-generated content from social networks in real time is described. Article [20] presents the results of an analysis of the application of text expansion methods combined with the latest data classification algorithms. Article [21] proposes an innovative method for the operation of a recommendation system for breast cancer diagnosis using patients' medical histories. The mechanism of machine learning and word embedding in the classification of the disease diagnosis is applied. In [22] the limitations of transducers for classification of clinical documents are presented and analyzed. State of the art image classification analysis suggests a potential for improvement in classification performance with the methodology suggested in this work. Combining the convolutional neural networks and attention-based neural networks with the NLP models allows to compensate for the drawbacks of the generally accepted approach of image enhancement [23, 24] with further application of a neural network classifier [25, 26]. Simplistic approaches to natural language processing, such as statistical models, for example, Markov chains, can lead to significant results when applied to texts of limited scope, as shown in [27]. For more complex texts, the text styles can provide a useful heuristic for selecting the appropriate light- weight algorithm [28]. This ability to choose simpler NLP models allows us more flexibility to provide greater performance for many specific cases. Additionally, methods presented in [29, 30] for processing images of various sources provide a basis for a framework for heuristic-based and neural network-based approaches to the visual component of classification. Overall, by using heuristic approach to both NLP and vision, many specific examples can be solved without using the more performant yet more resource intensive neural network-based approaches. However, in the general case, as well as for situations where determining the correct heuristics-based solution is impossible, neural network-based solutions are prevalent. Existing multimodal datasets involving text and image data typically consist of images annotated with text description of what is happening on the image. Such datasets include: • COCO (Microsoft Common Objects in Context) [31] • Flickr30k [32] • Conceptual Captions [33]. COCO (Microsoft Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset [31]. The dataset consists of images, polygon annotations for select objects, and a few statements about objects on the images in text form. Flickr30k is a dataset of over 30 thousand images from Flickr. Each image is annotated with five sentences written by human annotators. The images are limited to educational use only [32]. Unlike in the COCO dataset, the five sentences are alternative descriptions of the image, not just class names. As seen from the example, some images have details that can be seen differently by different people. This feature of the dataset may infuse data with more variability. Conceptual Captions is, similarly to Flickr30k, a dataset with annotated images. However, this dataset provides captions generated automatically by correlating images with text at the data source [33]. The dataset includes over 3 million captioned images. However, when considering the image classification task, a typical dataset consists of images annotated with either one or several labels. Such datasets include ImageNet [34], MNIST [35], and CIFAR-10 and CIFAR-100 [36]. Each of the labels represent an object present on the image, or, sometimes, an action performed on the image by humans. At the core of these datasets are images. Even barring the text labels, classes, or action descriptions, the datasets may provide great value to researchers, e.g. in an unsupervised learning setting. With these and other datasets, the list of models built for image classification is vast. At the time of writing, some of the more efficient models include transformer-based models, such as CoCa [37] and ViT-G/14 [38], residual neural network-based models, such as FixResNet-101 [39], and EffNet-L2 [40], which is based on the EfficientNet [41] scaling mechanism and the approach of minimizing training loss sharpness along with loss itself. CoCa and ViT models use approaches derived from the initial Transformer model [42]. The attention mechanism is applied to convolutions derived from the input image. Both models perform well in few- shot scenarios and are suitable for fine-tuning. FixResNeXt-101 model derives from the ResNet model [43]. The model is capable of high results on the classification task, having a lower number of parameters than the transformer-based models. EffNet-L2 is a modification of other EfficientNet [41] models, that utilizes sharpness-aware minimization. Like all EfficientNet models, it is capable of scaling, thus may use less parameters than other model types. Some of the most used benchmarks for image classification are based on datasets ImageNet [34] and CIFAR-100 [36]. It is impractical to compare specific models to one another if they are fine-tuned for different benchmarks. Thus, we choose the top performers in the three categories of models, transformers, ResNets and EfficientNets. See table 1 for benchmark values the ImageNet and CIFAR- 100 values aggregates per model type with mentions of specific model names [42, 43]. For transformer- based models, consider CoCa [37] and ViT-B-16 [38]. For ResNet-based models, consider FixResNeXt-101 [39] and BiT-L ResNet [44]. Finally, for EfficientNet-based models, consider EfficientNet-L2 [45] and EffNet-L2 SAM [40]. Table 1 Performance of some classification models Model type ImageNet CIFAR-100 Transformer 91.0 (CoCa) 94.2 (ViT-B-16) ResNet 86.4 (FixResNeXt-101) 93.51 (BiT-L ResNet) EfficientNet 90.2 (EfficientNet-L2) 96.08 (EffNet-L2 SAM) 3. Methods and Materials In order to assemble such a dataset, we compiled several data sources. The choice of data sources is based on several factors, such as: • Images have to be of a relatively high resolution. Many datasets use low-resolution images. Such techniques work great for purposes of education, assembling proof-of-concept algorithms, basic demonstrations, etc. For purposes of solving real-world problems, we require high-resolution images. In the end, we settled on having images at least 360 by 480 pixels. This allows for fine details to be present on the images, as compared to low-resolution datasets, such as ImageNet [34], MNIST [35], and CIFAR-100 [36]. • Images have to be clearly labeled. Labels designate classes of objects on the image. There may be multiple object classes per image. There should not be any action labels, i.e. descriptions of actions, situations, etc. that are performed on the image. • Text descriptions of the image classes have to be as informative as possible. • Text descriptions are tokenized in order to simplify the preparation for NLP algorithms. • Text descriptions must contain from 400 to 512 tokens. The upper limit comes from the common size limit for many NLP models, such as BERT [47]. • All data collected for the dataset has to be distributed under permissive open-source licenses. Accounting for the listed requirements, we turned to the Open Images [48] dataset. The dataset provides 1.9 million images labeled with over 600 “boxable” classes, i.e. classes of objects present on an image that could be shown with a bounding box. However, the data we’re interested in for the purposes of our dataset is not the bounding boxes but the presence of a given class on an image. Class distribution is, while not uniform, is even enough to be sure that, given some thoughtful data sampling, the vision models will be able to learn all classes equally well. Figure 1 shows the histogram of occurrences of classes in the dataset. As seen on the graph, most classes tend to have between one hundred and ten thousand images. There are a few classes that have less than ten images. In the training process, those classes could be excluded to later serve as the few-shot examples. The boxable classes can also be used for bounding box labeling task. With that in mind, we borrow the labeled images from the Open Images dataset. For the purpose of obtaining class descriptions in text form, we fetch Wikipedia articles by the name of the class. If a total match exists, we use the article. If there is a redirected article, we use that article. In case of ambiguity, we manually select the article that fits the context of images the most. For example, the label “Stool” has more than one article matching the name. We manually select the one that describes a piece of furniture and proceed with it. Once the article is obtained, we fetch the first few paragraphs. The goal here is to at least have the definition of the word. The table of contents, citations, links, and other markup elements are ignored. Lastly, the definitions are tokenized, so that instead of working with whole texts, we are able to work with sequences of words that represent said text. The resulting definition for “Stool” reads: “A stool is a raised seat commonly supported by three or four legs but with neither armrests nor a backrest in early stools and typically built to accommodate one occupant As some of the earliest forms of seat stools are sometimes called backless chairs despite how some modern stools have backrests Folding stools can be collapsed into a flat compact form typically by rotating the seat in parallel with fold-up legs”. Note the absence of any punctuation in the example above. Some language models work with simple punctuation, such as commas, periods, etc. [39], while others don’t. For our dataset, we’re going with the simpler option of removing the punctuation. Partly, because we target the dataset to more simple language models, that may not need punctuation as they operate on words and word combinations, rather than whole sentences and text in general. Figure 1: Histogram of the numbers of images per class in the dataset. The x-axis represents the count of images. The y-axis represents the number of classes with roughly this number of images The resulting dataset is published on GitHub publicly [49]. Instructions on accessing the dataset are available on GitHub under the name “ImageD Dataset”. The repository contains all the text data mentioned in this paper. The image data can be accessed by downloading the images from the Open Images dataset. For convenience, the repository also contains scripts for downloading images that can be copied or used as Python libraries. We do not redistribute the images from the Open Images dataset, but merely access them. Models trained for vision tasks can be graded via several different metrics. In our research, we have developed a model for labeling images with several tags. This demonstration model is evaluated via the typical metrics, precision, recall, and the F-score. The metrics are calculated with the following formulas: 𝑡𝑝 (1) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , 𝑝 𝑡𝑝 (2) 𝑟𝑒𝑐𝑎𝑙𝑙 = , 𝑎𝑝 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙 (3) 𝐹=2 , 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 where tp—true positive answers, p—all positive answers, ap—actual positive examples. The metrics do not mean anything without the appropriate context.. Only attached to a certain dataset, in our case, ImageD, do the metrics receive any meaning. But, as a rule of thumb, greater is better. In binary classification problems, precision and recall must be adequately balanced, since random guessing would produce a high recall, nearly 1, and precision of nearly 0.5, resulting in F-score nearly 0.75. In our case of multi-class classification, this issue does not manifest. For the tagging model, used to demonstrate the capabilities of the datasheet, we use a pre-trained ResNet 101 32x4d model [50]. The model is capped with a head which accepts the model’s output embedding as its input and generates the vector of probabilities. Each probability value corresponds to a single tag. Figure 2 illustrates the general architecture of the model. Figure 2: Structure of the tagging model. As seen on the diagram, the head consists of a single fully-connected layer with a sigmoid activation. The resulting vector is then treated as a probability vector for the tags. 4. Experiment Example learning experiment: convert text annotations into tags and train a model to tag images. Given an integrated dataset of labeled images with text descriptions, we may build a variety of machine learning models. For the purposes of demonstrating the applicability of this dataset, we have developed a model that, given an image, tags it with several one-word tags. The process of experiment preparation goes as follows: 1. Prepare one-word tags by tokenizing the class descriptions and selecting the rarely-occurring words. 2. Digitize tags by transforming them into binary unit vectors. 3. Assemble tag vectors for each image in the dataset. The preparation effectively converts sets of labels for each image into sets of tags encoded in such a way that a neural network can be trained on them. Such a model, when used, would take an image and mark it with a set of tags. This can be useful for, e.g. building a primitive search engine that looks up images by code words. In order to tokenize the class descriptions, we use the widely used “nltk” Python library. For each description, we receive the tokenized version that consists of word stems, endings, other grammatical parts, and punctuation. First, we remove the punctuation, as well as any non-word tokens, such as numbers. Then, we filter the tokens to remove those that occur too often. This is done by gathering the statistics of word occurrences in different class descriptions and only retaining a top P percentile. P is a hyperparameter of the model. For the purposes of demonstration, P was selected to be 0.2. This means that, if the word occurs in over 0.2% of the class definitions, we remove it from the list. All other words become the tags for the model to train upon. This gives us 10718 tags to train on, i.e. the output layer will have 10718 neurons. Next, the tags are digitized, i.e. turned from words into a digital representation. For this, we sort the tags alphabetically and build a vector for each of the tags. The vector has the length of N, where N is the total number of tags. The vector is filled with zeros, except for one value, which is set to one. The position of the unit value corresponds to the order of the tag and uniquely represents the tag. With the tags digitized, we now may label the images with them. To do so, we lookup all the classes associated with a given image and, given their descriptions as tags, find the set of digital representations of the tags. Then, we combine the tag vectors via the bitwise OR operation. This yields us vectors filled with ones and zeros, representing the tags associated with a given image. Now, we proceed to training the model. For this, we selected a pre-trained core model that will provide us the needed level of accuracy with a needed level of performance. The choice of such a model has a significant impact on the final accuracy of our model. For the needs of this research, we have selected a small yet quite powerful model ResNet 101 32x4d model [50]. The model accepts images of resolution 224 per 224 pixels as input. The process of enabling the model to solve our tagging task involves fine-tuning. In order to shorten training time and avoid pre-trained weights losing their efficiency, we substitute the final fully- connected layer of the model and freeze the rest of the layers. This means that the error propagation process will not alter any parameters except for the last layer. The fresh fully-connected layer is shaped in such a way that it receives the output of the other layers of the ResNet model as the input and produces a vector of size N as the output, where N is the total number of tags. As calculated previously, N is equal to 10718. The outputs of the final layer are passed through a sigmoid function to determine the probability of a given class being selected. If the output is high enough, the image is labeled with the tag associated with the given output. To determine if the model considers the image to belong to a given class, we choose a threshold value for the output layer. All the outputs that are less than the threshold value are discarded and all the values that are equal to or greater than the threshold are considered to be confidence levels for the given class. Since the negative outcome (a zero value) is much more likely than a positive outcome (a one value) for any given output neuron, we take the threshold value quite low to be 0.3. This allows us to pick up on weak signals from the model when no confidence level is high enough. However, to avoid overselection, i.e. selecting too many classes per example, we also ignore all the values that are not in the top four classes per example, regardless of their confidence level. This is a tradeoff between the more extensive search for objects of the image and the accuracy. The more possible tags the model can find, the more false-positive errors it can make, and the more in-depth search for objects on the image can be performed. Then, the model was trained to output the tag vectors on the data we assembled. The loss change during the training process is depicted in the graph in figure 3. Figure 3: Change of the loss function value during the model training. The X axis represents the epoch number. The Y axis represents the MSE loss. For this graph, loss was sampled once per epoch every last batch of an epoch For training we used the SGD optimizer with the mean squared error (MSE) loss function. SDG is the optimizer used during the training of the initial ResNet model. MSE loss fits best because of its simplicity when working with multi-class multi-label classification. The fine-tuned model performs well enough on the test dataset, with the precision value of 71.11, the recall value of 74.2 and the F-score of 72.62. Note that, for calculating precision and recall, we used the following algorithm: • an example is considered true positive if the four or less classes produced by the model are present among the ground truth labels; • an example is considered false positive if the model produced at least one label that is not found among the ground truth labels; • an example is considered false negative if the model did not produce a class that was present among the ground truth; for calculating this value, we omit the four classes per example rule. This approach is known as micro-averaged precision and recall calculation, as opposed to macro- averaged and example-based calculation, both of which consider precision and recall for each class individually. Note that in the process of training, there was a noticeable growth of loss after a certain number of iterations around epoch 64. This can be explained by the weight decay after a certain number of repetitions. For better performance, data augmentation could have been used. 5. Results The trained model is able to tag images with some degree of accuracy. The performance of the model is defined by the several factors, such as: • learning hyperparameters; • data augmentation; • data shuffling; • number of tags to be learnt. For this demonstrative experiment, most learning hyperparameters were chosen empirically, with no prior cross validation. Exploring parameter hyperspace via simple validation or cross validation could enhance the training speed and resulting accuracy. The lack of data augmentation, as mentioned previously, limits the effective number of training epochs that could be run without unintentionally messing up parameters. Data augmentation is the next best thing after sourcing more data, which is the whole point of this research. Same could be said about data shuffling. The number of tags to be learnt in an important hyperparameter. As we took the P value to be 0.2, the number of tags became 10718. This is quite a large number of output values for a neural network. This means that there are much more negative cases for each output than positive ones. Thus, learning is complicated by the imbalance in the dataset. With this in mind, here are some of the examples of the model output. The examples are split into positive and negative not by comparing model output to the recorded tags, but by human evaluation of the tags selected by the model. Figure 4 demonstrates some positive examples where the model tagged the images in the expected way. a) b) c) d) Figure 4: Examples of well tagged images [48] Image a) in fig. 4. shows a bird standing in an artificial enclosure by a pond. The model produced the following tags: “animal”, “bird”, “water”, “street”. Here and later, tags are provided in the order of descending confidence. However, the confidence levels themselves do not have any special meaning beyond the model, thus we omit them. Image a) in fig. 4. shows a man making a speech in what appears to be a conference hall. The model produced the following tags: “person”, “gesture”, “hand”, “curtain”. Image c) shows a sportsman in a competition. The model produced the following tags: “key”, “sport”, “cowd”, “street”. Finally, image d) shows a group of people standing in an open area. The model produced the following tags: “person”, “theater”, “town”, “street”. Figure 5 demonstrates the negative examples of tagging. a) b) c) d) Figure 5: Examples of poorly tagged images [48] Image a) in fig. 5. shows a tennis player. The model tagged it with “soccer”, “ball”, “cup”, “open”. Note that “open” is probably related to tennis, but the other tags are not correct. Image b) in fig. 5. shows an aquarium. The model tagged it with “sea”, “ocean”, “fish”, “mammal”. Here, the model ignored more subtle clues like fragments of hands of people standing around the aquarium. Image c) shows a cat birthday card, which is an example of bad data. The model tagged it with “cat”, “animal”, “box”, “fur”. Image d) shows a woman drinking beer. The model tagged it with “person”, “glasses”, “barrel”, “street”. This is probably because of bad data from the definition of “beer”. After analyzing the results, we conclude that the dataset is generally acceptable, though some records still contain unwanted noise. To improve the quality of the data, some human supervision would be beneficial. 6. Discussions The obtained results show the clear efficiency of the approach of enhancing image data with text information. For the simple, tagging task, we managed to obtain adequate performance by simply fine- tuning an existing model on data generated by scraping Wikipedia. For the presented task, as well as for any other task, it would be beneficial to use human-generated annotations for class names instead of snippets of Wikipedia. However, it can be enough for many tasks. However, the presented experiment in training a labeling model only begins to explore the capabilities of the approach. Multi-modal data has many applications which, unlike the tagging task, cannot be solved by other means. However, multi-modal data is not always readily available, specially in more specialized domains. The approach of enhancing an image dataset with text data sourced from simple description of classes, rather than descriptions of each individual image, can bridge the gap between the required and the present datasets. We presume some advances can be made in the task of image classification with the use of multi- modal datasets, such as the one presented in this paper. The approaches may include few-shot learning improvement based on similarities of text descriptions of the new and the existing classes, text generation from the images, which also can help with classification, using image labeling for the purposes of text classification, etc. The idea behind this approach is to gain as much usable information from given data as possible with little effort. The approach targets the situations where it is not feasible to obtain more image data for one reason or another. Along with few-shot learning situations, where data for some classes is not as abundant as for others, this is useful for domain-specific tasks, where any data may be scarce. Finally, the approach may be useful for achieving or beating state-of-the-art performance with fewer trainable parameters. The need for fewer parameters is dictated by the growing computational complexity of modern machine learning models. This, in turn, leads to longer training times, slower inference, and more expensive hardware requirements. Thus achieving, or even approaching, state-of-the-art performance on vision tasks with significantly less parameters is a necessity for all the researchers and commercial users who do not have prolonged access to high-cost computational facilities. When using two models, one for vision and another for NLP, which have fewer parameters combined than typical modern vision models, applying the multi-modal approach can turn out beneficial for the combined system performance. The potential benefits of the approach with image class descriptions lie in several factors: • Outputs of the NLP models can be cached and reused during training. If the NLP model is completely frozen and only performs inference with no error backpropagation, such outputs may be generated beforehand for all the possible classes and accessed as a simple read operation. This, in fact, transfers some of the load of learning to the preparation stage and thus speeds up both learning and inference. • If the NLP model is being trained along with the vision model, its outputs can still be cached if the training is organized in such a manner that images of the same class are fed to the models in the same batch. In such a case, the batch size for the NLP model is effectively reduced to one. This speeds up the training process as well. • The two models can be trained in parallel on different devices. The overhead of combining the two outputs of the models could be less than the overhead of training a single model in a distributed cluster of devices. All the proposed approaches could be topics of further research in this sphere. For instance, providing a model ensemble that can run one of its key parts with a few times less resources, by the factor of the batch size, allows more flexibility for the researchers. Making the NLP model more lean can allow us to divert it to a CPU, while freeing costly GPU resources for the vision model. If the NLP model does require the GPU, we can easily split it onto a different machine and only combine the ensemble data after a pass is done. 7. Conclusions In this paper we presented the ImageD dataset. The dataset combines the existing labeled images from the Open Images dataset with text descriptions for each of the classes of objects found in the images. The dataset can be used for a variety of research purposes, related to image classification, description, labeling, etc. We also present a simple experiment in generating labels for images with data built on top of the said dataset. The trained model shows adequate results in labeling never before seen images. This leads us to believe that, given enough effort, such an approach could be scaled for greater efficiency and performance. It is particularly interesting how the presented dataset may be used for image classification purposes to extract more data about the images at the training stage. Some techniques of combining text information with image data for the purpose of higher performance in classification already exist. One such technique, NLP supervised learning, seems to yield great results and deserves more attention. As well as providing more information for the neural networks to train on, multimodal dataset also enable researchers to construct more complex and more scalable models. When combining the two factors of more data and more scalable models, this approach to gathering data has the power to optimize machine learning algorithms on several levels. Multimodal datasets assembled from different existing sources of information are a viable first step towards harvesting the listed benefits. However, with application of human moderation and manual data cleaning, the tool becomes yet more efficient. 8. References [1] J. Wang, Y. Zheng, M. Wang, Q. Shen and J. Huang, "Object-Scale Adaptive Convolutional Neural Networks for High-Spatial Resolution Remote Sensing Image Classification," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 283- 299, 2021, doi: 10.1109/JSTARS.2020.3041859. [2] D. Xue et al., "An Application of Transfer Learning and Ensemble Learning Techniques for Cervical Histopathology Image Classification," in IEEE Access, vol. 8, pp. 104603-104618, 2020, doi: 10.1109/ACCESS.2020.2999816. [3] T. Mao, H. Tang and W. Huang, "Unsupervised Classification of Multispectral Images Embedded With a Segmentation of Panchromatic Images Using Localized Clusters," in IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 8732-8744, Nov. 2019, doi: 10.1109/TGRS.2019.2922672. [4] A. Van Opbroek, H. C. Achterberg, M. W. Vernooij and M. De Bruijne, "Transfer Learning for Image Segmentation by Combining Image Weighting and Kernel Learning," in IEEE Transactions on Medical Imaging, vol. 38, no. 1, pp. 213-224, Jan. 2019, doi: 10.1109/TMI.2018.2859478. [5] E. Bonmati et al., "Voice-Assisted Image Labeling for Endoscopic Ultrasound Classification Using Neural Networks," in IEEE Transactions on Medical Imaging, vol. 41, no. 6, pp. 1311-1319, June 2022, doi: 10.1109/TMI.2021.3139023. [6] Y. Ding, X. Zhao, Z. Zhang, W. Cai and N. Yang, "Multiscale Graph Sample and Aggregate Network With Context-Aware Learning for Hyperspectral Image Classification," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 4561-4572, 2021, doi: 10.1109/JSTARS.2021.3074469. [7] X. Ji, Y. Cui, H. Wang, L. Teng, L. Wang and L. Wang, "Semisupervised Hyperspectral Image Classification Using Spatial-Spectral Information and Landscape Features," in IEEE Access, vol. 7, pp. 146675-146692, 2019, doi: 10.1109/ACCESS.2019.2946220. [8] R. Tennakoon et al., "Classification of Volumetric Images Using Multi-Instance Learning and Extreme Value Theorem," in IEEE Transactions on Medical Imaging, vol. 39, no. 4, pp. 854-865, April 2020, doi: 10.1109/TMI.2019.2936244. [9] K. Pawar, Z. Chen, N. J. Shah and G. F. Egan, "A Deep Learning Framework for Transforming Image Reconstruction Into Pixel Classification," in IEEE Access, vol. 7, pp. 177690-177702, 2019, doi: 10.1109/ACCESS.2019.2959037. [10] K. Y. Ma and C. -I. Chang, "Kernel-Based Constrained Energy Minimization for Hyperspectral Mixed Pixel Classification," in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-23, 2022, Art no. 5510723, doi: 10.1109/TGRS.2021.3085801. [11] X. Shang, S. Han and M. Song, "Iterative Spatial-Spectral Training Sample Augmentation for Effective Hyperspectral Image Classification," in IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022, Art no. 6005305, doi: 10.1109/LGRS.2021.3131373. [12] Z. Li et al., "A Unified Understanding of Deep NLP Models for Text Classification," in IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 12, pp. 4980-4994, 1 Dec. 2022, doi: 10.1109/TVCG.2022.3184186. [13] S. Singh and A. Mahmood, "The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures," in IEEE Access, vol. 9, pp. 68675-68702, 2021, doi: 10.1109/ACCESS.2021.3077350. [14] F. Rodríguez-Sánchez, J. Carrillo-de-Albornoz and L. Plaza, "Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data," in IEEE Access, vol. 8, pp. 219563- 219576, 2020, doi: 10.1109/ACCESS.2020.3042604. [15] X. Cao, J. Yu and Y. Zhuang, "Injecting User Identity Into Pretrained Language Models for Document-Level Sentiment Classification," in IEEE Access, vol. 10, pp. 30157-30167, 2022, doi: 10.1109/ACCESS.2022.3158975. [16] N. Cerkez, B. Vrdoljak and S. Skansi, "A Method for MBTI Classification Based on Impact of Class Components," in IEEE Access, vol. 9, pp. 146550-146567, 2021, doi: 10.1109/ACCESS.2021.3121137. [17] N. Jin, J. Wu, X. Ma, K. Yan and Y. Mo, "Multi-Task Learning Model Based on Multi-Scale CNN and LSTM for Sentiment Classification," in IEEE Access, vol. 8, pp. 77060-77072, 2020, doi: 10.1109/ACCESS.2020.2989428. [18] A. O. Akmandor, J. Ortiz, I. Manotas, B. Ko and N. K. Jha, "SECRET: Semantically Enhanced Classification of Real-World Tasks," in IEEE Transactions on Computers, vol. 70, no. 3, pp. 440- 456, 1 March 2021, doi: 10.1109/TC.2020.2989642. [19] D. Rogers, A. Preece, M. Innes and I. Spasić, "Real-Time Text Classification of User-Generated Content on Social Media: Systematic Review," in IEEE Transactions on Computational Social Systems, vol. 9, no. 4, pp. 1154-1166, Aug. 2022, doi: 10.1109/TCSS.2021.3120138. [20] H. Q. Abonizio, E. C. Paraiso and S. Barbon, "Toward Text Data Augmentation for Sentiment Analysis," in IEEE Transactions on Artificial Intelligence, vol. 3, no. 5, pp. 657-668, Oct. 2022, doi: 10.1109/TAI.2021.3114390. [21] A. A. R. Magna, H. Allende-Cid, C. Taramasco, C. Becerra and R. L. Figueroa, "Application of Machine Learning and Word Embeddings in the Classification of Cancer Diagnosis Using Patient Anamnesis," in IEEE Access, vol. 8, pp. 106198-106213, 2020, doi: 10.1109/ACCESS.2020.3000075. [22] S. Gao et al., "Limitations of Transformers on Clinical Text Classification," in IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 9, pp. 3596-3607, Sept. 2021, doi: 10.1109/JBHI.2021.3062322. [23] K. Smelyakov, M. Hvozdiev, A. Chupryna, D. Sandrkin and V. Martovytskyi, "Comparative Efficiency Analysis of Gradational Correction Models of Highly Lighted Image," 2019 IEEE International Scientific-Practical Conference Problems of Infocommunications, Science and Technology (PIC S&T), 2019, pp. 703-708, doi: 10.1109/PICST47496.2019.9061356. [24] Y. Wang, W. Song, G. Fortino, L. -Z. Qi, W. Zhang and A. Liotta, "An Experimental-Based Review of Image Enhancement and Image Restoration Methods for Underwater Imaging," in IEEE Access, vol. 7, pp. 140233-140251, 2019, doi: 10.1109/ACCESS.2019.2932130. [25] K. Smelyakov, A. Chupryna, O. Bohomolov and N. Hunko, "The Neural Network Models Effectiveness for Face Detection and Face Recognition," 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), 2021, pp. 1-7, doi: 10.1109/eStream53087.2021.9431476. [26] K. Smelyakov, M. Shupyliuk, V. Martovytskyi, D. Tovchyrechko and O. Ponomarenko, "Efficiency of image convolution," 2019 IEEE 8th International Conference on Advanced Optoelectronics and Lasers (CAOL), 2019, pp. 578-583, doi: 10.1109/CAOL46282.2019.9019450. [27] G. Krivoulya, I. Ilina, V. Tokariev and V. Shcherbak, "Mathematical Model for Finding Probability of Detecting Victims of Man-Made Disasters Using Distributed Computer System with Reconfigurable Structure and Programmable Logic," 2020 IEEE International Conference on Problems of Infocommunications. Science and Technology (PIC S&T), Kharkiv, Ukraine, 2020, pp. 573-576, doi: 10.1109/PICST51311.2020.9467976. [28] Sharonova, N., Kyrychenko, I., Tereshchenko, G., “Application of big data methods in E-learning systems”, 2021 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS-2021), 2021. – CEUR-WS, 2021, ISSN 16130073. - Volume 2870, РР. 1302-1311. [29] Gruzdo, I., Kyrychenko, I., Tereshchenko, G., Shanidze, N., "Metrics applicable for evaluating software at the design stage," 2021 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS-2021), 2021. – CEUR-WS, 2021, ISSN 16130073. - Volume 2870, РР. 916-936. [30] K. T. Chitty-Venkata, M. Emani, V. Vishwanath and A. K. Somani, "Neural Architecture Search for Transformers: A Survey," in IEEE Access, vol. 10, pp. 108374-108412, 2022, doi: 10.1109/ACCESS.2022.3212767. [31] Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P., “Microsoft COCO: Common Objects in Context” arXiv, 2014 [32] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, “From image descriptions to visual denotations”. URL: https://shannon.cs.illinois.edu/DenotationGraph/. [33] Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu, “Conceptual Captions”, Proceedings of ACL, 2018. [34] “ImageNet benchmark (Image Classification) | Papers with Code”. URL: https://paperswithcode.com/sota/image-classification-on-imagenet [35] “MNIST Dataset | Papers with Code” URL: https://paperswithcode.com/dataset/mnist [36] “CIFAR-100 (Image Classification) | Papers with Code”. URL: https://paperswithcode.com/sota/image-classification-on-cifar-100 [37] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive Captioners are Image-Text Foundation Models.” arXiv, 2022 [38] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling Vision Transformers.” arXiv, 2021 [39] H. Touvron, A. Vedaldi, M. Douze, H. Jegou, “Fixing the train-test resolution discrepancy”, Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. [40] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-Aware Minimization for Efficiently Improving Generalization.” arXiv, 2020. doi: 10.48550/ARXIV.2010.01412. [41] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” arXiv, 2019, doi: 10.48550/ARXIV.1905.11946. [42] A. Vaswani et al., “Attention Is All You Need.” arXiv, 2017. doi: 10.48550/ARXIV.1706.03762. [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition.” arXiv, 2015. doi: 10.48550/ARXIV.1512.03385. [44] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” arXiv, 2019, doi: 10.48550/ARXIV.1905.11946. [45] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, “ImageNet-21K Pretraining for the Masses.” arXiv, 2021. doi: 10.48550/ARXIV.2104.10972. [46] H. Pham, Z. Dai, Q. Xie and Q. V. Le, "Meta Pseudo Labels," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 11552-11563. [47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, 2018. doi: 10.48550/ARXIV.1810.04805. [48] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari, “The Open Images Dataset V4”, IJCV, 2020. [49] “GitHub: ddashenkov/ImageD” URL: https://github.com/ddashenkov/ImageD. [50] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated Residual Transformations for Deep Neural Networks.” arXiv, 2016. doi: 10.48550/ARXIV.1611.05431.