Object-Based Image Comparison Algorithm Development for
Data Storage Management Systems
Kirill Smelyakov 1, Oleksandr Prokopenko 1 and Anastasiya Chupryna 1
1
    Kharkiv National University of Radio Electronics, 14 Nauky Ave., Kharkiv, 61166, Ukraine


                 Abstract
                 Continuous development of big image storage management systems requires development
                 and introduction of efficient searching and image comparison algorithms. In the scope of this
                 article we analyze modern object detection models and image search algorithms based on
                 their metadata and processed images object localization area comparison. The main
                 advantages and disadvantages of such algorithms are discovered. Article offers objects data
                 extraction algorithm from image using CNN, storage model of this information in image file
                 service fields as metadata, cascade object-oriented image comparison and fast search
                 algorithm using modern solutions in the field of parallel computing. A series of experiments
                 is being set up to apply the proposed algorithms for comparing and searching images in large
                 data storages. The results of experiments, analysis of the effectiveness of the application,
                 conclusions and recommendations for the practical application of the proposed algorithms are
                 given.

                 Keywords 1
                 Image, Metadata, Image Storage, Machine Learning, Image Detecting and Classification,
                 Image Comparison and Search Algorithms, Convolutional Neural Network

1. Introduction
    Currently, one can see an exponential growth trend in the number and volume of personal,
corporate and commercial image storage. Many such repositories, such as photobanks, contain
millions of images. The efficiency of the management system for such storages directly depends on
the efficiency of comparing and searching for images during search queries. Every day, users are
more and more interested in the content of the image, and not its formal parameters. Users try to
include this information in their search queries. In such a case, the search algorithms must efficiently
search for images given the specified information about the content of the images. And such image
processing should be carried out efficiently (according to the criteria of laboriousness and accuracy)
in the presence of millions of images in storage. In this regard, the main problem is related to ensuring
the efficiency of comparing and searching images by their content in the presence of a large number
of objects of the same type in the image, the localization areas of which repeatedly overlap.
    Therefore, the purpose of the work is to ensure the efficiency of comparison and search for images
by their content (according to the criteria of labor intensity and accuracy) in big data storages for such
conditions.
    Objectives of the work: to develop methods for extracting information about objects in an image
using CNN, a model for storing this information in the service fields of image files, methods for
comparing and quickly searching for images using modern solutions in the field of parallel
computing, especially for the conditions of the presence of a large number of objects of the same type
in the image , whose localization regions repeatedly overlap.

COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland
EMAIL: kyrylo.smelyakov@nure.ua (K. Smelyakov); oleksandr.prokopenko1@nure.ua (O. Prokopenko); anastasiya.chupryna@nure.ua (A.
Chupryna)
ORCID: 0000-0001-9938-5489 (K. Smelyakov); 0000-0003-0489-6820 (O. Prokopenko); 0000-0003-0394-9900 (A. Chupryna)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Works
    Modern trends in image search by content [1] require the development of appropriate models for
their description, as well as algorithms for extracting information, comparison and search based on
content [2-4].
    Modern applications often operate with big amounts of data. Such amounts of information need to
be processed in a manner that differs from classical applications. Storage structure must be adapted to
handle huge amount of incoming data, example of such architecture is data lake, which stores all the
data in raw format, so that unstructured files, half-structured files like JSON or XML and SQL tables
are stored in one repository representing all the information used by application [5].
    Data lakes need appropriate tools to manage stored files. While changing a file is forbidden,
because precious user’s data may be damaged, one can utilize file’s metadata to store some additional
information about each single file which can be used later on for search and comparison of two files
[6]. In this regard, the works [7, 8] show how the use of clarifying metadata in solving problems of
medical diagnostics can improve the quality of classification; especially when such metadata is used
in the course of applying deep learning algorithms. The use of informative features and additional
data in the subject area can contribute to a significant increase in classification accuracy through the
use of deep learning methods [9, 10].
    In general, modern studies [10, 11] show that the effective management of image storages is
associated with the solution of several key tasks. First, with pre-processing, which is used to improve
the quality of images, clean them, filter extreme observations, normalize gray scales, and present
images in the required format [12, 13]. Second, it is associated with the preliminary detection of
regions of interest, as described in [14, 15] and the extraction of information about the content of the
image, as a rule, based on the use of CNN [9, 10, 14]. Such CNNs are able to detect areas of object
localization, save their coordinates and class [16-18]. The advantage of most modern CNNs is the
high speed and accuracy of detection in automatic mode. Their main disadvantage is that the CNN
can be trained to work with images of a certain list of classes. Images of other classes cannot be
processed by CNN.
    Afterwards, during the lifetime of processed information inside an application it can be used to
easily manage files, search for certain kinds of information stored, filter out chunks of data that is
required by the user for his specific scenario etc. [19, 20].
    Image data is harder to work with, since it is not a trivial task for a computer to find if an image is
already stored in a database or to find all images that have a person on them. Mostly, these tasks are
accomplished by machine learning models that process an image to extract specific information from
an image like depicted objects and their locations [21-23].
    At the same time, as practice shows, currently widely used models and algorithms are effective for
comparing a small number of objects in the image, the localization areas of which do not intersect
with each other. In most cases, these algorithms are based on the consideration of box coordinates
obtained after applying a convolutional neural network, and also, in some cases, on the use of local
feature detectors (FAST, ORB, etc.) in order to compare the corresponding descriptors of image
feature points [24 , 25].
    In the case when an image contains a large number of objects of the same type with localization
areas that repeatedly overlap with each other, it is often impossible to effectively use classical models
and algorithms for comparing images in the data storage during a search query, image filtering. To
effectively satisfy search queries in such conditions, it is necessary to develop a new model and
method that will ensure efficient processing of images and areas of object localization on them,
including introduction of parallel data processing schemes [26, 27].
    Considering modern developments and achievements, for this it is advisable to consider a cascade
model for comparing and searching for images:
        first (once before searching) for each image in the storage, it is advisable to extract
    information about its content and save it as metadata;
        when searching at the first stage, it is advisable to perform a quick search / filtering of images
    by metadata; the thing is that with a large amount of metadata used for a search query, the
    probability of a match decreases non-linearly. In such a situation, one can quickly and without
   resorting to more complex algorithms make an adequate sample of images. Otherwise, such a
   search can be considered as preliminary to reduce the scope of the search;
       when searching at the second stage, one should apply the analysis of the degree of overlap of
   areas of image localization, and apply the appropriate algorithm for comparing the degree of
   overlap of these areas.
   Such a cascade method will allow you to optimally organize the search for images in large data
storages. In addition, the development of search algorithms will make it possible to efficiently
compare images in related services, for example, for the purpose of real-time traffic analysis in IoT
[28] and many other applications [29].

3. Methods and Materials
   This section describes selected machine learning models, dataset, data storage model and reasons
why they were chosen in the scope of this article to conduct experiments on offered image search and
image comparison algorithms.

3.1. Data Description
    As was mentioned in the introduction in the scope of this article we will use image data to show
possibility and expediency of smart metadata utilization for big data management in data lake
architectures. First and foremost, we had to decide on image formats used for research. As it turned
out most of the popular image formats support Exif standard, thus use of Exif metadata tags is making
the solution presented in this article compatible for a vast number of file formats like JPEG, TIFF, etc.
    Generally there are two different types of tag specified by Exif which can be used to write any
arbitrary information.
    First one is a tag under Image IFD called “ImageDescription”, its own description states that
stored data is: “A character string giving the title of the image. It may be a comment such as "1988
company picnic" or the like. Two-bytes character codes cannot be used.”
    Limitations for usage of this tag are obvious, there may be a situation where we need to store some
data in form of string, describing for example name of depicted object’s class and in case if it will be
used with non-English characters it wouldn’t be properly saved, so it’s better to use second tag.
    Another tag used for saving arbitrary information is located under Photo IFD and called
“UserComment”. Its description says that this is: “A tag for Exif users to write keywords or
comments on the image besides those in <ImageDescription>, and without the character code
limitations of the <ImageDescription> tag.”
    For experiments we will use images in JPEG format, because this format is most common and a
significant part of existing datasets consist of images with this file format. It is worth mentioning that
metadata has restrictions in stored data size. Exif metadata are restricted in size to 64 kB in JPEG
images because according to the specification this information must be contained within a single
JPEG APP1 segment.
    Dataset for a project should contain some general images which depict photos that can be found in
a common person’s phone, including photos of people, pets, animals, cars, laptops, phones, etc.
Specialized datasets do not fit to achieve our goal, because we want to check the ability to run a
similar image search for real-life photos that can contain some various depicted objects against a
massive amount of other photos.
    COCO (Common Objects in COntext) dataset is a large-scale object detection, segmentation, and
captioning dataset. It is sponsored by CVD foundation, Microsoft and Facebook. Now it has more
than 200 000 labeled images. This dataset is constantly growing and improving its quantity and
quality. It has 80 classes of objects depicted in photos, which are pretty generic like person, dog, cat,
it fits the goal of our research perfectly since if we want to compare different images by depicted
objects, we would rather know that image contains a dog, than a specific dog’s breed. For sake of
simplicity, we will take a subset of 5000 images that are used for training purposes in machine
learning to make our experiments.
3.2. Machine Learning Model
    In order to make some experiments we need to find a machine learning model for object detection
on images. YOLO, an acronym for 'You only look once', is an object detection algorithm that divides
images into a grid system. Each cell in the grid is responsible for detecting objects within itself.
YOLO is one of the most famous object detection algorithms due to its speed and accuracy. The
YOLOv5 algorithm can be used for our goals, since it is using the Pytorch framework which allows it
to run on a vast majority of modern operating systems using a graphics card for the hardest part of
calculations. Also YOLOv5 contains a few types of pretrained models of different size and accuracy
which allows us to adapt to certain systems and hardware limitations.
    YOLOv5 has the One-Stage Detector architecture, an approach that predicts the coordinates of a
significant number of bounding boxes with the results of the analysis and the probability of finding
the object, and adjusting their positions afterwards. In general, such an architecture can be represented
as follows (Figure 1).


Figure 1: One-Stage Detector’s general architecture [30]

   The network scales the original image into multiple feature maps using a pass-through connection
and other architectural tricks. The resulting feature maps are reduced to a single resolution using
upsampling and concatenation. The classes and bounding boxes for the features are then predicted,
then the most likely bounding box for each feature is selected using Non-Maximum Suppression.
   Each bounding box information is represented by five values:
        Class number;
        X center coordinate;
        Y center coordinate;
        Width;
        Height.
   Coordinates, width and height are relative values between 0 and 1, which allows the image to scale
without losing object position. YOLOv5 has a confidence threshold, which can improve quality of
image comparison, since we can filter out irrelevant objects. Since bounding boxes are rectangles, it is
pretty easy to calculate the intersection area of different objects which can be done fast enough.
   This object detection model is one of the fastest and most accurate, for example it is 2-2.5 times
faster than other popular models like Faster R-CNN, SSD, RetinaNet. This speed is crucial for our use
cases when detection should happen as fast as possible.

3.3. Methods
   As it was mentioned in the introduction part we will be solving image comparison problems using
image metadata and three different comparison algorithms. First of all, it is worth mentioning that
once images are labeled and marked with depicted objects data, it becomes relatively easy to filter big
amounts of images, leaving only a subset which contains dog and person on them.
    All image comparison algorithms are using a cascade approach and giving us similarity assessment
as a coefficient between 0 and 1 including, where 0 is a comparison of images that depict sets of
objects which do not intersect and 1 is the result of comparing the image with itself.
    Let’s start with the simplest trivial algorithm which we will use to compare two photos. We have
photo A that has a set of detected objects A = { O11, O12, … O1N } and photo B that has a set of
detected objects B = { O21, O22, … O2M }.
    First step is to group objects by their class label and count the amount of objects of each class that
photos contain. Generally speaking, we will have two dictionaries where keys are class marks and
values are the number of objects on image. We want to make the result of comparing A to B the same
as comparing B to A. To achieve it, we will take the list of distinct class labels found on two images.
After this we can calculate the similarity between two images by each class label. Calculation is as
simple as it can be, one image will contain a lesser or equal number of objects with some class label
than the other, so we will divide this value by the number of objects with this class label on the other
image. We assume that all class labels have the same weight, so to calculate the final similarity
coefficient we need to sum all similarities by label and divide this sum by the number of distinct class
labels.
    This algorithm will also be used as a preliminary filter to two other algorithms, because it does not
contain any heavy calculations and can remove all images that do not have any intersection with our
target image in object terms, so it does not make any sense to compare object locations on these
images. When we are talking about hundreds of thousands of images in a database it will shorten the
set of images that should be compared to a few percent of the initial amount of images depending on
the threshold used for filtering.
    Second algorithm focuses on comparison of two images depending on objects locations. Same as
with the first algorithm we will compare objects of the same type, but now we will compare not only
their numbers on both images, but the intersection area of two objects. To calculate the coefficient
between two objects we need to calculate their intersection area which is a trivial task given that
bounding boxes are rectangles and the resulting intersection is also a rectangle.
    We can calculate each object’s area because their heights and widths are saved in the image's
metadata. Then, we can get a coefficient with a given formula: 𝑆𝑖 /(𝑆1 + 𝑆2 − 𝑆𝑖 ), where 𝑆1 - First
object’s area; 𝑆2 - Second object’s area; 𝑆𝑖 - Intersection area (Figure 2).


Figure 2: Intersection area of two rectangles

    Once, we calculate the coefficient for each pair of objects of some type. Then we must pick pairs
of objects that are maximizing the sum of coefficients. This procedure is repeated for each class label
and the general sum of coefficients is divided by the number of objects on image. This algorithm
allows to get more precise results than first, since it takes into account objects location in addition to
their classes and numbers on image.
    But, this algorithm contains some significant flaws which can result in not too precise results and
slow execution. This algorithm works at its best when there are not a lot of objects of the same type
on image. In case if we have two group photos, the exact intersection areas of pairs of people in two
photos may not be big, while the common area people taking on images is likewise. Second concern
is an execution time concern since if we have tens of objects of the same types on both images, then
search for a maximum sum of coefficients can take significant time, since this is an assignment
problem solution which has time complexity 𝑂(𝑛3 ).
    Third algorithm also takes into account objects' locations and solves flaws that the bounding box
comparison algorithm has. Instead of comparing specific objects, we will compare areas covered by
objects of a certain type on both images. To calculate intersecting areas by objects on two images we
will use a matrix with the size of an image, where the value of each pixel in the matrix is equal to a
number of images that contain an object of this type in the pixel's location.
    In order to calculate similarity coefficient by class type using this matrix we need to calculate the
amount of 2’s in the matrix, which represents intersecting areas and amount of non-zero values which
will represent the common area taken by objects on both images, then we will get a coefficient
dividing intersecting area by common area.
    Then we will have our coefficient of similarity by class type, example could be seen on Figure 3.
After calculating coefficients by all object types that were detected on images we should aggregate
these values in a general similarity coefficient between two images. For the current article we are
treating different object types as equal, so to get a similarity coefficient we are summing up
coefficients by all object types and dividing this sum by a number of distinct object types on these
images.
    Basically we are utilizing an idea of a scanner that goes through the matrix value by value and
finds out if location belongs to both images, only one of the images or none of the compared images.
So we remember locations of certain type objects on one image at first and then, going through
objects on the second image and calculating final values of areas and coefficients.


Figure 3: Matrix similarity coefficient calculation example

    Main holdup for this algorithm is correct implementation that will allow us to calculate
coefficients in acceptable time.
    First problem is comparison of images with high resolution, in case the matrix contains a few
thousands or even millions of values, it is taking a lot of time to scan through it.
    We can fix this issue with scaling. Images contain information about detected objects in their
metadata as relative values. So we can compress the matrix to a certain small size, which will be fast
enough to calculate. Compression comes with a loss of precision, but as practice shows us, the
difference in a calculated result on compressed images and full-sized images is only a few percent.
Further in an experimental part of this article, we will show that change in results is neglectable and
precision is high enough even when compression coefficient is as low as 0.2 of an original size.
    This algorithm’s execution time can be improved even further. Matrix comparison algorithms can
be paralleled with a great execution time improvement.
    First of all, concurrent execution can be implemented on multiple levels of calculations. Highest
level is an object type level, comparison of the most part of images will detect multiple types of
objects and all similarity coefficients by object type can be done independently, which means that
calculation time in perfect conditions shortens from sum of times to calculate similarity by all of
object types to time that takes one longest calculation of similarity coefficient by object type.
    Next possible improvement is concurrent scanning of the matrix. Since the matrix has rectangular
shape, we can split it into smaller segments and delegate intersecting and common area calculations to
separate processes.
   One of the most interesting ideas is to execute these calculations via graphics card, minimizing
execution time. The only concern is the time spent on data exchange between graphics card memory
and RAM, which can lose more time than concurrent execution wins. In the scope of this article
improvement with graphics card utilization won’t be implemented while concurrent calculation of
coefficients by different types will be actively used, since it allows using this algorithm and having
execution time that is acceptable for use in applications that operate with big data amounts.

3.4. Technologies
   As it was mentioned, the main used technology is object detection model YOLOv5 that is the same
as the YOLOv4 version, but moved to Pytorch framework - an open source machine learning
framework that accelerates the path from research prototyping to production deployment. YOLOv5 is
an open-source project so we will use it as a backbone [30].
   To run object detection, metadata extraction and image similarity comparison we will modify the
Python script from the YOLOv5 repository called detect.py to meet our needs and use it with a few
purposes: write object information to image metadata and run image similarity comparison. Pytorch
will run object detection on a graphics card for the best performance.
   To work with EXIF metadata we will use ExifTool that is a free and open-source software
program for reading, writing, and manipulating image, audio, video, and PDF metadata. It is platform
independent, available as both a Perl library (Image::ExifTool) and command-line application.
ExifTool is commonly incorporated into different types of digital workflows and supports many types
of metadata including Exif, IPTC, XMP, JFIF, GeoTIFF, ICC Profile, Photoshop IRB, FlashPix,
AFCP and ID3, as well as the manufacturer-specific metadata formats of many digital cameras.
   Script modifications include obligatory save of image metadata after object detection and run of
image similarity comparison depending on command-line argument.

4. Experiment
   Experimental section of this article will be used to reveal main advantages and disadvantages of
developed algorithm and assess its effectiveness and usage in real-world scenarios.

4.1. Initial Data
   We will run performance experiments and compare algorithms on a part of the COCO 2017
dataset. This part includes five thousand images and depicts 80 types of objects [31, 32]. We will
select five out of five thousand images to run our tests and compare algorithms results by execution
time and selected results (Figure 4). These images depict different types of images: bear, table with
food, crowd, family and tennis player.


Figure 4: Images to compare [31, 32]
4.2. Experiments Plan
   First of all we will run image comparison for all of the five selected images with simple
comparison, boxes comparison and matrix comparison. Then we will compare their: Execution time;
Best result; Similarity; Top-5 picks coefficients spread. We will compare received results to the
values received with comparing histograms of two images using correlation metric provided by
OpenCV library. This comparison should give us an idea on how efficient and accurate our results in
comparison to classic widely-used algorithm.
   Tests will be run with confidence threshold at level 0.5 to filter out all objects that YOLOv5 is not
confident about, since a lot of untrustworthy objects can decrease similarity coefficients drastically
while images look alike.
   Then we will focus on matrix image comparison algorithms, trying to figure out the optimal
compression ratio for calculations which will leave us with precise results that are given fast.
Preliminary we will run object detection on all tested images to have a visual hint on found objects,
they’re shown on Figure 5.


Figure 5: Detected objects on images

4.3. Hardware and software of the testing system
   Working station to execute experiments is a laptop HP Pavilion Gaming Laptop 17-cd0xxx. This
laptop features Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz as a central processing unit, has 24
gigabytes of RAM, 128 GB SSD and 1 TB HDD.
   Graphics card installed on the working station is Nvidia’s GeForce GTX 1650 built on Turing
architecture with 4GB of GDDR5. Graphics card contains 896 CUDA Cores that are used for machine
learning. CUDA primitives power data science on GPUs.
    NVIDIA provides a suite of machine learning and analytics software libraries to accelerate end-to-
end data science pipelines entirely on GPUs. This work is enabled by over 15 years of CUDA
development. GPU-accelerated libraries abstract the strengths of low-level CUDA primitives.
Numerous libraries like linear algebra, advanced math, and parallelization algorithms lay the
foundation for an ecosystem of compute-intensive applications.
    Operating system is Windows 10 Professional, which allows us to run our experiments with all
latest updates and newest drivers for the hardware. As it was mentioned earlier, to run our scripts we
use the Python 3.7 version with Pytorch framework.

5. Results
   Tables 1, 2 and 3 show general results for average execution time, selected images and highest
found similarity coefficient for tested images for developed algorithms. Table 4 shows received
results for histogram comparison by correlation metric. Execution time in table does not include time
to read and parse metadata for three offered algorithms, also box and matrix algorithms show
execution time without preliminary filtering. Execution time for histogram comparison takes into
account time to build, normalize and compare histograms for 5000 images in dataset. Time to read
image data from storage on machine is not accounted as well as time to read metadata for first
algorithms, because this time highly depends on hardware and can be improved a lot by using caching
techniques and we want to compare algorithms execution time.

Table 1
Simple algorithm execution results
                     Bear           Family with     Tennis player        Crowd          Table with
                   (285.jpg)          frisbees      (170474.jpg)      (250137.jpg)         food
                                   (100238.jpg)                                        (496954.jpg)
  Execution         718 ms            843 ms           812 ms           796 ms            796 ms
  time (ms)
   Highest             1                0.75              1            0.916667            0.75
  similarity
 coefficient
 Top-5 picks        1.0 - 1.0      0.75 - 0.70833      1.0 - 1.0     0.916666 - 0.9     0.75 - 0.5
 coefficients
   spread

Table 2
Box comparison algorithm execution results
                    Bear          Family with       Tennis player        Crowd          Table with
                  (285.jpg)         frisbees        (170474.jpg)      (250137.jpg)         food
                                 (100238.jpg)                                          (496954.jpg)
  Execution         0 ms            135 ms             109 ms           126 ms            31 ms
  time (ms)
   Highest         0.907722          0.293879          0.30171         0.318123         0.145482
  similarity
 coefficient
 Top-5 picks      0.907 - 0.45     0.293 - 0.264    0.301 - 0.268    0.318 - 0.243    0.145 - 0.123
 coefficients
   spread
Table 3
Matrix comparison algorithm execution results
                    Bear          Family with        Tennis player        Crowd          Table with
                  (285.jpg)         frisbees         (170474.jpg)      (250137.jpg)         food
                                 (100238.jpg)                                           (496954.jpg)
  Execution        141 ms           303 ms              309 ms            366 ms           455 ms
  time (ms)
   Highest         0.910818            0.42502         0.262802          0.588056         0.159144
  similarity
 coefficient
 Top-5 picks      0.91 - 0.491      0.425 - 0.331     0.262 - 0.23     0.588 - 0.426    0.159 - 0.089
 coefficients
   spread

Table 4
Histogram comparison algorithm execution results
                    Bear          Family with    Tennis player            Crowd          Table with
                  (285.jpg)         frisbees     (170474.jpg)          (250137.jpg)         food
                                 (100238.jpg)                                           (496954.jpg)
   Execution      1786 ms          1754 ms         1609 ms               1617 ms          1724 ms
   time (ms)
    Highest         0.834534          0.962971         0.937198          0.980637         0.799085
  correlation
  coefficient
  Top-5 picks     0.834 - 0.772     0.963 - 0.959    0.937 - 0.328     0.980 - 0.971    0.799 - 0.773
  coefficients
    spread

   As expected, a simple algorithm produces a lot of similar results for images that depict exact or
very close configuration of objects. Bear image comparison produced a dozen images with one
detected bear on them. One of the top results was not too close because two bears were falsely
detected as one bear. For a picture of a family with frisbees we can see an image of the frisbee team as
the best match and coefficient that is less than one because the number of detected people and frisbees
are different. Another tennis player’s image is selected as the best match for a depicted tennis player
with a tennis racket and ball. Picture of the crowd with woman with umbrella on front has similar
picture as best match, the only difference is locations and whole picture, because here we can see only
legs, while in a tested image we can see people full-height, this image probably won’t be taken as the
best match by location-dependent algorithms.
   Let’s notice that execution time for introduced image comparison algorithms were counted purely,
which means that time to read metadata or read image data to build histogram was not included in
results. This is one important point because time to read info from image metadata is much faster than
to build and normalize image’s histogram. Reading and parsing metadata takes 1.4 seconds in average
for 5000 images while reading and building histograms for the same dataset takes 22.5 seconds
average if images are downscaled first, however we can improve this time if histograms will be pre-
calculated and stored in metadata or cache.
   Next part of the results presents figures with top picks to see the best match found by each
algorithm. Also top 5 picks by each algorithm are shown for an image with a tennis player, because
received results illustrate the difference between three offered algorithms the best way (Figure 6 -
Figure 9).
Figure 6: Top matches to images selected by each algorithm (origin – dataset [32])
Figure 7: Top 5 matches to tennis player image by simple algorithm [32]


Figure 8: Top 5 matches to tennis player image by box algorithm [32]
Figure 9: Top 5 matches to tennis player image by matrix algorithm [32]

6. Discussions
    First thing to mark in results is execution time. Execution time for simple algorithm averages at
793 ms for 5000 images. Both box and matrix algorithms work with images already filtered by simple
algorithm. Box algorithm takes all images that passed 0.1 threshold for simple algorithm since it is
fast for small amounts of objects on image, its execution time averaged at 80 ms in our tests for up to
500 images. Matrix algorithm is significantly slower due to scanning images, so we are limiting the
amount of images for processing to top-100 when a simple algorithm is done. It averages at 314 ms,
the amount of processed images swinged from 50 to 100 samples. In a real application we can use this
approach to find similar images, it showed itself as efficient enough for a request execution time. For
real scenarios we can introduce further optimizations with caching recently reached files metadata,
files storage modifications, etc.
    Generally simple algorithm showed highest coefficients as expected, because it does not compute
objects location in addition to their types and numbers, of course it does not range images with the
same number of objects by their similarity, they are all equal.
    We have seen that in 4 out of 5 tests matrix algorithm showed higher similarity coefficients than
box algorithm, also different filtering options concluded in a fact that results for matrix algorithm are
closer to target image context, while box algorithm sometimes returned images that have objects in
the same locations, but do not have any object of a different type.
    During experiments all three algorithms showed that depending on the image’s content, number
and variety of detected objects their approach may be better to get images that look alike from a
human point of view, e.g. image of table with food where we had a lot of different objects detected
like fork, cup, cake, orange, is matched the best by the simple algorithm from our own point of view it
returned results closer to original than location-dependent algorithms for that image. At the same
time, box and matrix algorithm showed their dominance for images with a single detected object.
Top-picks are much closer to an original than simple algorithms.
    It is worth noting that box algorithm is useful for searching images with the same configuration, as
an example we can take picture of a family with frisbees, box algorithm’s top-5 picks have no frisbees
on them, but instead we have images that show us group of three people standing as in original
picture. In some sense this algorithm is more straightforward than the matrix algorithm, often it does
not capture context, but returns images with the biggest overlap between some of the objects, also this
algorithm depends a lot on each single object unlike other algorithms which depend more on object
types than on each single object on them.
   Similarity between two images is subjective to each person, but to our mind if we want to choose
an algorithm that returns the closest top picks, then the matrix algorithm is the one at least for tested
images. From our perspective it became a compromise between a simple algorithm which only cares
about context, but not configuration and a box algorithm which is focused mostly on a configuration
of objects on image with a little attention to context.
   Matrix algorithm became a tool that allows to assess locations of all objects of some type as a
whole rather than some individual entities, it is the slowest algorithm of three, but with a few
optimizations and tricks it can execute its task in acceptable time.
   We introduced histogram comparison algorithm as benchmark to compare some classic algorithm
that is used to get image similarity assessment. Results show that reading image's metadata is 16 times
faster than reading image’s data and building its histogram. When we are comparing time spent on
algorithms execution we see that simple algorithm is 2 times faster than histograms comparison and
box and matrix algorithms are 1.5-1.7 times faster.
   Experiments with algorithms gave us some of the desired results while leaving a lot of space for
improvement. First point that can be an objective for future research is giving each object type some
weight, e.g. people are more important than frisbees or tennis rackets, thus we can balance the impact
of not very important objects on a comparison. Also coefficients can be given to objects of each
image individually depending on the area covered by them. As a possible improvement to accuracy
comparison we can compare not only locations of objects on image, but also the model's confidence
that the sector contains objects of some type. As it was mentioned earlier, matrix algorithm
performance is one of the key points of improvement. Implementation of comparison on a graphics
card can increase the speed of similarity coefficient computation, which will allow it to run on a
bigger volume of input images. Results of this article’s experiments showed us the possibility to use
machine learning models with files metadata to manage big amounts of data in an efficient way.
Correct use of metadata and machine learning models gives us a tool to implement some non-trivial
functionality for applications that operate with a lot of data like applications with data lake
architecture where all the data is just stored in a raw format.

7. Conclusions
    In this work we did research on machine learning models and metadata use in efficient data
management and image comparison algorithms. This work was done to check the possibility and
efficiency of this approach for modern big data application architectures such as a data lake.
    Image comparison algorithms utilize image’s metadata service fields as a storage for information
about depicted objects. To accomplish this task we offered a solution that uses object detection
machine learning model YOLOv5 during image input processing to find all objects on the image and
save them in a compact way in the image's metadata.
    We developed three different algorithms of image comparison depending on objects types, their
numbers and locations on image. Simple algorithm computes similarity coefficients only based on
types and number of objects on both images. Box algorithm computes similarity coefficients based on
individual objects locations and their intersections on two images. Matrix algorithm computes
similarity coefficients based on intersection area occupied by objects of certain image type, handles
images with a huge amount of objects on them. We conducted a series of experiments to compare
their execution time and received similarity results among themselves and in comparison to well-
known histogram correlation to find out how efficient offered algorithms are in comparison.
    As a result of experiments we confirmed that image comparison algorithms based on the use of
object detection machine learning models and image files metadata service fields are efficient and
accurate enough. In comparison to well-known algorithm suggested approach showed 1.5-2x faster
execution time which means that our algorithms can be executed in systems with large image data
storages in acceptable time. Experiments also confirmed extreme efficiency of parallel computing for
boosting performance of image comparison algorithms as these algorithms for most part can be split
easily for parallel execution. One of the most promising directions to research is using GPU for
significant acceleration in executing image comparison.
   Offered algorithms showed us that each algorithm can be used to receive the best results in search
and comparison depending on objective and image configuration. In our opinion, matrix algorithm
provides results closest to human recognition since it gives us a compromise solution between
searching for images with the same context and images with similar object positioning.

8. References
[1] X. Liu, G. Cheung, C. -W. Lin, D. Zhao and W. Gao, "Prior-Based Quantization Bin Matching
     for Cloud Storage of JPEG Images," in IEEE Transactions on Image Processing, vol. 27, no. 7,
     pp. 3222-3235, July 2018, doi: 10.1109/TIP.2018.2799704.
[2] Y. Zheng et al., "Size-Scalable Content-Based Histopathological Image Retrieval From Database
     That Consists of WSIs," in IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 4,
     pp. 1278-1287, July 2018, doi: 10.1109/JBHI.2017.2723014.
[3] A. Preethy Byju, B. Demir and L. Bruzzone, "A Progressive Content-Based Image Retrieval in
     JPEG 2000 Compressed Remote Sensing Archives," in IEEE Transactions on Geoscience and
     Remote Sensing, vol. 58, no. 8, pp. 5739-5751, Aug. 2020, doi: 10.1109/TGRS.2020.2969374.
[4] X. Wang et al., "A Storage Method for Remote Sensing Images Based on Google S2," in IEEE
     Access, vol. 8, pp. 74943-74956, 2020, doi: 10.1109/ACCESS.2020.2988631.
[5] H. Dhayne, R. Haque, R. Kilany and Y. Taher, "In Search of Big Medical Data Integration
     Solutions - A Comprehensive Survey," in IEEE Access, vol. 7, pp. 91265-91290, 2019, doi:
     10.1109/ACCESS.2019.2927491.
[6] P. Kathiravelu, A. Sharma and P. Sharma, "Understanding Scanner Utilization With Real-Time
     DICOM Metadata Extraction," in IEEE Access, vol. 9, pp. 10621-10633, 2021, doi:
     10.1109/ACCESS.2021.3050467.
[7] A. G. C. Pacheco and R. A. Krohling, "An Attention-Based Mechanism to Combine Images and
     Metadata in Deep Learning Models Applied to Skin Cancer Classification," in IEEE Journal of
     Biomedical and Health Informatics, vol. 25, no. 9, pp. 3554-3563, Sept. 2021, doi:
     10.1109/JBHI.2021.3062002.
[8] A. Ebrahimian, H. Mohammadi, M. Babaie, N. Maftoon and H. R. Tizhoosh, "Class-Aware
     Image Search for Interpretable Cancer Identification," in IEEE Access, vol. 8, pp. 197352-
     197362, 2020, doi: 10.1109/ACCESS.2020.3033492.
[9] Vynokurova O., Peleshko D., Peleshko M. (2020) Hybrid Deep Convolutional Neural Network
     with Multimodal Fusion. In: Babichev S., Peleshko D., Vynokurova O. (eds) Data Stream
     Mining & Processing. DSMP 2020. Communications in Computer and Information Science, vol
     1158. Springer, Cham. https://doi.org/10.1007/978-3-030-61656-4_4
[10] K. Smelyakov, A. Chupryna, M. Hvozdiev and D. Sandrkin, "Gradational Correction Models
     Efficiency Analysis of Low-Light Digital Image," 2019 Open Conference of Electrical,
     Electronic     and     Information     Sciences    (eStream),     2019,    pp.    1-6,     doi:
     10.1109/eStream.2019.8732174.
[11] P. K. Deb, A. Mukherjee and S. Misra, "Fido: A String-Based Fuzzy Logic Mechanism for
     Content Extraction from UAV Data Lakes," in IEEE Internet of Things Magazine, vol. 4, no. 4,
     pp. 24-29, December 2021, doi: 10.1109/IOTM.001.2100084.
[12] K. Smelyakov, M. Shupyliuk, V. Martovytskyi, D. Tovchyrechko and O. Ponomarenko,
     "Efficiency of image convolution," 2019 IEEE 8th International Conference on Advanced
     Optoelectronics       and      Lasers      (CAOL),       2019,      pp.     578-583,       doi:
     10.1109/CAOL46282.2019.9019450.
[13] Rafael C. Gonzalez, Richard E. Woods Digital Image Processing, 4th. ed., Pearson/Prentice Hall,
     2018, 1168p. DOI/ISBN: 9780133356724.
[14] K. Smelyakov, D. Tovchyrechko, I. Ruban, A. Chupryna and O. Ponomarenko, "Local Feature
     Detectors Performance Analysis on Digital Image," 2019 IEEE International Scientific-Practical
     Conference Problems of Infocommunications, Science and Technology (PIC S&T), 2019, pp.
     644-648, doi: 10.1109/PICST47496.2019.9061331.
[15] I. Ruban, H. Khudov, O. Makoveichuk, I. Khizhnyak, V. Khudov, V. Podlipaiev, V. Shumeiko,
     O. Atrasevych, A. Nikitin, and R. Khudov. Segmentation of opticalelectronic images from on-
     board systems of remote sensing of the Earth by the artificial bee colony method, Eastern-
     European Journal of Enterprise Technologies, № 2/9 (98), 2019, pp. 37–45. DOI:
     https://doi.org/10.15587/1729- 4061.2019.161860.
[16] Z. Zhou, Q. M. J. Wu, S. Wan, W. Sun and X. Sun, "Integrating SIFT and CNN Feature
     Matching for Partial-Duplicate Image Detection," in IEEE Transactions on Emerging Topics in
     Computational Intelligence, vol. 4, no. 5, pp. 593-604, Oct. 2020, doi:
     10.1109/TETCI.2019.2909936.
[17] F. Fang, L. Li, H. Zhu and J. -H. Lim, "Combining Faster R-CNN and Model-Driven Clustering
     for Elongated Object Detection," in IEEE Transactions on Image Processing, vol. 29, pp. 2052-
     2065, 2020, doi: 10.1109/TIP.2019.2947792.
[18] C. -Y. Sun, X. -J. Hong, S. Shi, Z. -Y. Shen, H. -D. Zhang and L. -X. Zhou, "Cascade Faster R-
     CNN Detection for Vulnerable Plaques in OCT Images," in IEEE Access, vol. 9, pp. 24697-
     24704, 2021, doi: 10.1109/ACCESS.2021.3056448.
[19] E. Zagan and M. Danubianu, "Cloud DATA LAKE: The new trend of data storage," 2021 3rd
     International Congress on Human-Computer Interaction, Optimization and Robotic Applications
     (HORA), 2021, pp. 1-4, doi: 10.1109/HORA52670.2021.9461293.
[20] C. Giebler, C. Gröger, E. Hoos, H. Schwarz and B. Mitschang, "A Zone Reference Model for
     Enterprise-Grade Data Lake Management," 2020 IEEE 24th International Enterprise Distributed
     Object Computing Conference (EDOC), 2020, pp. 57-66, doi: 10.1109/EDOC49727.2020.00017.
[21] Q. Bai, S. Li, J. Yang, Q. Song, Z. Li and X. Zhang, "Object Detection Recognition and Robot
     Grasping Based on Machine Learning: A Survey," in IEEE Access, vol. 8, pp. 181855-181879,
     2020, doi: 10.1109/ACCESS.2020.3028740.
[22] G. Cheng, C. Yang, X. Yao, L. Guo and J. Han, "When Deep Learning Meets Metric Learning:
     Remote Sensing Image Scene Classification via Learning Discriminative CNNs," in IEEE
     Transactions on Geoscience and Remote Sensing, vol. 56, no. 5, pp. 2811-2821, May 2018, doi:
     10.1109/TGRS.2017.2783902.
[23] D. Kollias and S. Zafeiriou, "Exploiting Multi-CNN Features in CNN-RNN Based Dimensional
     Emotion Recognition on the OMG in-the-Wild Dataset," in IEEE Transactions on Affective
     Computing, vol. 12, no. 3, pp. 595-606, 1 July-Sept. 2021, doi: 10.1109/TAFFC.2020.3014171.
[24] V. Appana, T. M. Guttikonda, D. Shree, S. Bano and H. Kurra, "Similarity Score of Two Images
     using Different Measures," 2021 6th International Conference on Inventive Computation
     Technologies (ICICT), 2021, pp. 741-746, doi: 10.1109/ICICT50816.2021.9358789.
[25] J. Yu, Z. Hu and Y. Zhang, "An Image Comparison Algorithm Based on Contour Similarity,"
     2020 International Conference on Computer Network, Electronic and Automation (ICCNEA),
     2020, pp. 111-116, doi: 10.1109/ICCNEA50255.2020.00032.
[26] J. M. H. Noothout et al., "Deep Learning-Based Regression and Classification for Automatic
     Landmark Localization in Medical Images," in IEEE Transactions on Medical Imaging, vol. 39,
     no. 12, pp. 4011-4022, Dec. 2020, doi: 10.1109/TMI.2020.3009002.
[27] C. -Y. Sun, X. -J. Hong, S. Shi, Z. -Y. Shen, H. -D. Zhang and L. -X. Zhou, "Cascade Faster R-
     CNN Detection for Vulnerable Plaques in OCT Images," in IEEE Access, vol. 9, pp. 24697-
     24704, 2021, doi: 10.1109/ACCESS.2021.3056448.
[28] D. Ageyev, T. Radivilova, O. Bondarenko and O. Mohammed, "Traffic Monitoring and
     Abnormality Detection Methods for IoT," 2021 IEEE 4th International Conference on Advanced
     Information and Communication Technologies (AICT), 2021, pp. 250-254, doi:
     10.1109/AICT52120.2021.9628954.
[29] Proniuk, G., Geseleva, N., Kyrychenko, I., Tereshchenko, G., "Spatial interpretation of the notion
     of relation and its application in the system of artificial intelligence", 2019 3rd International
     Conference on Computational Linguistics and Intelligent Systems (COLINS 2019), 2019. –
     CEUR-WS, 2019, pp. 266-276.
[30] YOLOv4-architecture.        URL:      https://www.researchgate.net/figure/YOLOv4-architecture-
     34_fig6_349929458.
[31] Docs.ultralytics.com. YOLOv5. URL: https://docs.ultralytics.com.
[32] Cocodataset.org. URL: https://cocodataset.org/#home. https://docs.ultralytics.com.