<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CNN-based Classification of Car Images for Android Devices</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iveta Mrázová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgi Georgiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Physics, Charles University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The design of eficient yet robust methods for real-time image classification belongs to scorching topics in contemporary AI, particularly in the case of mobile and edge devices. Various types of convolutional neural networks seem to contribute to solving this task. Especially those architectures proposed explicitly for mobile devices, e.g., MobileNet, and EficientNet, are classified to the least time-consuming ones. This paper thoroughly reviews the structure, performance, and main characteristics of the considered network types. Based on the obtained results, we introduce a mobile-phone application to classify cars we might see on the street and search for nearby car dealerships, e.g., to buy a car similar to that one of interest. The developed application involves the TensorFlow EficientNet Lite model. Finally, we provide an outlook for a possible enhancement of the application with federated learning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;image classification</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>EficientNet</kwd>
        <kwd>TensorFlow Lite models for mobile applications</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Modern convolutional neural networks (CNNs) are</title>
        <p>known to beat human performance in many tasks.
However, their state-of-the-art architectures require
substantial computational resources. A pretty natural question
thus arises if we can also benefit from CNN image
processing capabilities when implemented on mobile devices.</p>
        <p>Recent Android products range from the Google Pixel 6
mobile phone equipped with the newest Google Tensor
processor to mobile devices that use Edge TPU (Tensor
Processing Unit) chip. Two examples of Edge TPU devices
include Coral Dev Board and the Coral USB accelerator.</p>
        <p>Although the CNNs comprise a considerable number
of neurons at diferent layers, the model benefits from
weight sharing that keeps down the number of trainable
parameters. With local receptive fields (i.e., rectangular
iflters), the CNNs scan the presented images to look for
significant visual pattern features. This information is
combined in subsequent layers to detect more complex
higher-order features. The neurons’ activities form the so- Figure 1: A snapshot of the application running on the Pixel
called feature maps representing the extracted knowledge 3 virtual mobile device: the classification of a presented sports
in each layer. Alternating pooling layers blur the exact car is followed by searching for the closest dealerships ofering
position of the features and allow for down-sampling of similar cars in Poprad.
feature maps.</p>
        <p>Our ultimate objective is to develop a mobile-phone
application to classify cars we might see and search for Considering the limited hardware means of mobile
dedealerships to rent or buy a similar car. Fig. 1 presents vices, a crucial steppingstone in the application design
a snapshot illustrating the function of the application. represents the choice of an accurate, robust, and
memory/time eficient network model for the CNN-based car
ITAT’22: Information technologies – Applications and Theory, Septem- classifier.
ber 23–27, 2022, Zuberec, Slovakia As a part of our research, we tested the classification
G$e oivregtia.G.meorargzoievva.9@9@mf.sceuznnia.cmz.(cIz. (MGr.áGzoevoár)g;iev) and robustness performance of 10 selected CNN
mod0000-0002-3765-1400 (I. Mrázová) els. The results indicate that the EficcientNet models
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License are superior in all cases. More precisely, EficientNetB5,
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>To find the best model satisfying the above-specified requirements, we have selected 10 candidate network mod</title>
        <p>The data comes from the Keras Application page of
the oficial Keras documentation [11].
model is characterized by the depth of 189 and 23.9 mil- large for the small dataset we have and could easily get
lion parameters. InceptionV3 achieved the best results overfitted.
for its time, but today it gets easily outperformed even The state-of-the-art family of the so-called
Eficientby smaller models. Nets [21] exploits the NASNet strategy. The baseline</p>
        <p>The InceptionResNet model replaces the concatena- model of EficientNetB0 and its variants B1 to B7 upscaled
tions of InceptionV3 with residual connections skipping uniformly in all the network parameters (i.e., width,
the layers [16]. Inserting such shortcuts improves the depth, or resolution) belong to the most accurate and
network’s ability to back-propagate errors across multi- memory-eficient CNNs, now. EficientNetB0 is bigger
ple layers. InceptionResNetV2 has the depth of 449 layers than MobileNetV2, but has still a very small size
comand 55.9 million parameters which makes it one of the pared to the remaining models. The automated
construcbiggest models we dealt with. In this case the size of the tion of EficientNetB0 aims at finding the best possible
model and its residual Inception-like structure result into network given the predefined operations.
very good accuracy and robustness results. The bigger size of EficientNetB5 and B7 leads to
im</p>
        <p>The Xception network splits full convolutional oper- proved accuracy and noise robustness. B5 is
characterators into depthwise and pointwise convolutions. The ized by the depth of 312 and 30.6 million parameters. B7
depthwise separable convolutions reduce the necessary is with its depth of 438 and 66.7 million parameters the
computational costs almost ten times with only slightly second largest model in our collection (behind
NASNetreducing the accuracy compared to standard convolu- Large). EficientNetB7 also performs the best.
tions [17]. This led to a considerable drop in depth to 81.</p>
        <p>The number of parameters was, however, reduced just by 2.2. Android Mobile Devices
1 million (to 22.9 million). Still, despite of a reduced
number of layers and parameters, Xception usually performs Recent mobile devices, e.g., Google Pixel 6 equipped with
slightly better than InceptionV3. Google Tensor processor [4], can perform complex
com</p>
        <p>To support feature reuse, the DenseNet model em- putations such as image and video processing, real-time
braces an architecture connecting each convolutional evaluation of CNNs, or other machine learning tasks.
layer to all its successors [18]. DenseNet121 belongs to Except for the traditional CPU and GPU modules, the
rather smaller models. It has just 8.1 million parameters Google Tensor processor [4] also contains a TPU module.
and a depth of 242. We can clearly see the efect of the Below, we will provide a brief overview of mobile devices
added connections from each layer to all its successors. suitable for CNNs, such as mobile phones, accelerators,
The number of trainable parameters remains low even and micro-computers. Based on their price, we will
conthough the depth of the model is above average. In addi- sider three categories of Android-run smartphones:
tion to reducing the number of network parameters, this
approach further improves the eficiency of the network. • Mobile phones priced at about 100 EUR, e.g.,
Xi</p>
        <p>The austere model of MobileNetV2 [19] exploits the aomi Redmi 9 A with 2GB RAM and 32GB
interso-called linear bottleneck layers to capture the function nal memory. It runs Android 10, has an 8-core
of the entire layer. The model also takes advantage of MediaTek Helio G25 CPU, and supports AI
facethe so-called inverted residuals. In this case, several bot- scanning [1]. For the tests, we have created a less
tlenecks follow the input within a residual block and are powerful virtual mobile phone that could
evalenhanced by an expansion afterward. Utilizing the much uate even the Lite versions of the EficientNet
smaller input and output dimensions for the shortcuts models.
improves eficiency of the inverted design considerably. • Middle-priced smartphones at around 470 EUR,
MobileNetV2 has one the smallest depths (105) and also e.g., the Samsung Galaxy phone A52s 5G with
the smallest number of parameters (3.5 million) of all the 6GB RAM, 128GB internal memory, and an 8-core
models in our selection. Considering its small size, Mo- CPU [2]. It runs Android 11 (can be upgraded to
bileNetV2 is able to outperform bigger models in some the newest Android 12 OS). Without problems,
of the tests. these phones can operate TensorFlow
applica</p>
        <p>The NASNet approach automatically searches for the tions.
best network architecture considering the data at hand • Cutting-edge phones at about 1150 EUR, e.g., the
[20]. However, the learned image features can be trans- Galaxy S21 Ultra 5G phone with 12GB RAM (can
ferred to other computer vision problems. NASNetMobile be bought even with 16GB RAM), 256GB internal
and NASNetLarge are two variants of the same model. memory (512GB is possible, too), 8-core CPU, and
They share the same structure and difer only in their more [3]. It supports Android 11 and 12. These
size. They preceded the EficientNet model family and phones are more powerful than some laptops
toalso achieve worse results. Due to the depth of 533 and day, and we might even use them to fine-tune
88.9 million parameters, the NasNetLarge may be too small neural network models in the future.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Coral Dev Board is a single-board computer equipped</title>
        <p>with the Edge TPU coprocessor. Edge TPU is a chip
crafted specifically to accelerate machine learning
inference (MLI) for mobile CNN models [5]. Another example
of a device that uses the Edge TPU coprocessor is Coral
USB Accelerator. Its purpose is to enable or accelerate
MLI on other external devices. Both Coral Dev Board and
Coral USB Accelerator support Tensor-Flow Lite.
Further, the accelerator can cooperate with devices that run
Debian Linux, macOS, or Windows 10, even with another
single-board computer such as Raspberry Pi.
Unfortunately, most reviewed mobile devices are not powerful
enough to train deep neural networks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Application Design</title>
      <sec id="sec-3-1">
        <title>This section highlights the main design principles for</title>
        <p>the planned Android classifier of cars’ body styles. A
viable implementation option would be to gather the
input images and send them to a distant server via the
Internet. Then, the server shall evaluate the acquired
images with a CNN and return the classification results
to the Android application afterward to present them
to the user. This approach requires a stable Internet
connection; without it, the application is out of order.</p>
        <p>To overcome this limit, we decided to classify the car
images directly within the Android application by a
builtin TensorFlow Lite CNN model. A working Internet
connection is thus needed only to search for the best scoring
cars of the resulting type or the closest car dealerships
ofering these cars. On the other hand, the chosen
neural network model has to be small enough to fit into
the Android application. At the same time, the selected
CNN must be as accurate and robust as possible
(EficientNetB5, in our case). Figure 3 outlines the application flow
diagram.</p>
        <sec id="sec-3-1-1">
          <title>3.1. The Form of the Employed Data</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>We used a variant of the Stanford Cars dataset [6] to</title>
        <p>train and test the respective CNN models. The modified
dataset consists of 2560 images of the size 224x224 that
belong to 8 diferent classes (Fig. 2) - BigCoupeOrSedan,
Hatchback, MuscleCar, PickUp, Van, SportsCar, SUV, and
Unknown. ’Unknown’ contains images with no
identifiable cars. Table 2 summarizes the class distribution for
the involved class labels. For training, we split the dataset
into batches of size 64 (i.e., 40 batches in total). 80% of
the images form the training set, 20% make up the test
set.</p>
        <p>For most of the classes, their pattern distributions are
comparable. The only exception is the class ’Unknown’.</p>
        <p>The results we have obtained for the performed tests,
however, do not indicate significant overtraining of the
CNN models. A reason for this efect could represent
extensive data augmentation applied during training. In
addition, we can consider the so-called stratified
sampling when generating training, testing, or validation
datasets in the future.</p>
        <p>A critical issue in machine learning consists in
adequate data preprocessing. Preliminary experiments
indicate, for example, that even the color of the vehicles
might strongly afect the classification result. If the data
contains specific cars only in one color, the trained model
can pick that color as the distinguishing feature.
Sometimes, this choice might correspond to particular brand
colors, e.g., a red Ferrari or a blue Subaru.</p>
        <p>Other factors can also significantly afect the
classification results, e.g., the car’s angle in the photo. To limit the
considerable probability of misclassification in such cases,
we decided to augment the training data with car images
enhanced by various transformations (e.g., corrupted by
noise or taken from diferent perspectives).
20% of the original dataset patterns were
randomly chosen for testing during the 5-fold CV,
the rest was used for training.</p>
        <p>During training, CNNs extract features characteristic
for the given class and then attempt to detect these
features in the images provided for recall. Poorly trained
networks can, however, fail to identify representative
features from the data. Manufacturers often use, e.g.,
appealing body parts like headlights or the grille’s shape
for diferent types of vehicles they produce. Misguided
networks sometimes prefer to choose familiar design
elements as vital for classification. We shall thus prepare
the training data carefully to encourage an improved
classifier performance. In the forthcoming section on
supporting experiments, we will describe the employed
data set in more detail.
modified Stanford Cars dataset (see Section 3.1). To
enhance the recall capabilities of the trained networks, we
added an image augmentation layer to the considered
models. This layer automatically adds random noise to
the images and is active only during training. Further,
the considered augmentations comprise horizontal flip,
up to 54 degrees rotation, contrast with a factor set to
0.5, zoom with the height factor set to 0.15 (upper and
lower zooming limits), and translation (height and width
factors set to 0.15).</p>
        <p>During training, the image modifications were
performed on-place by means of a set of sequential image
augmentation layers from the Keras library. Every
training image has thus been randomly modified by all
augmentation layers. There also exists a very small
probability that the image is left without any modification. We
shall, however, highlight that augmentation layers can
modify the same image diferently in diferent training
epochs. This boosts the involved training dataset several
times. In fact, each iteration employs the same number
of diferent training patterns (of the same nature).</p>
        <p>We used a Gaussian filter implemented within the
SciPy library (scipy.ndimage.gaussian_filter) with the
standard deviation (sigma factor) set to 1.5 to create
blurry images. The 2D Gaussian kernel we used is defined
as:
(, ) =
1
2
 and  denote the distance from the origin (at the center
(0, 0) of the filter) in the horizontal and vertical axes.</p>
        <p>Image rotation was performed with the Keras
RandomRotation layer which involves the respective rotation
4. Supporting Experiments matrices. In order to make the model noise-robust, we
considered random rotations of up to 54 degrees. The
We will use the above-specified dataset to test the per- newly appeared empty regions near the image borders
formance of the considered CNN models: MobileNetV2, are filled using a reflection (by reflecting the closest
imEficientNetB0, EficientNetB5, EficientNetB7, NASNet- age pixel).</p>
        <p>Mobile, NASNetLarge, InceptionV3, Xception, Inception- To apply horizontal flips, we used the Keras layer called
ResNetV2 and DenseNet121. While we used Python to RandomFlip which performs flipping with a 50% chance.
write the project for evaluating the experiments, we have Similarly, using the methods from SciPy, we implemented
implemented the example Android application in Java also the remaining layers such as RandomContrast,
Ranusing Android API and Android Studio version 4.1.3. domZoom and RandomTranslation.</p>
        <p>To train and test the models, we resorted to the li- We attached the augmentation layers to the beginning
braries TensorFlow 2.5.0-rc1 [9], TensorFlow Lite [10] of the models, which allows to modify the images using a
and Keras 2.5.0 [11]. Keras can work directly with the GPU acceleration. The augmentation layers also become
ImageNet [8] checkpoints of the selected models. Further, part of the SavedModel during serialization. The
augmenwe applied NumPy 1.19.5 [12], and Pandas [13, 14] to pro- tation layers thus do not have to be created separately
cess the gathered data (count means, standard deviations, after loading the model [9].
confidence intervals, etc.). An alternative option would be to use a TensorFlow
Image augmentation pipeline. The augmentation
meth4.1. The Accuracy Test ods are, namely, part of the tf.image library and can be
applied to the dataset using the tf.data.Dataset.map
funcTo test the architectures for the achievable top-1 accu- tion. The advantages of this approach are that the data
racy, we used the 5-fold cross-validation (CV) over the for the following epoch can be prepared by the CPU in
advance during the current epoch and the model itself also
does not have to be further modified. The CNN would
be therefore a little bit smaller, but the image
augmentation pipeline should be manually constructed every time
before the training of the model starts.</p>
        <p>For the beginning five epochs, we trained just the last
classifier layers of the networks. Afterward, we kept
adjusting the top 10% layers of the networks for additional
ten epochs (with fixed last classification layer weights).</p>
        <p>The results are summarized in Table 3 together with the
necessary memory requirements. The table shows that
EficientNetB7, EficientNetB5, and InceptionResNetV2
belong to the most accurate models.</p>
        <p>Some models, however, did not achieve accuracy rates
measured on the ImageNet dataset due to the limited
number of training epochs, e.g., Xception (with the top-1
accuracy of 79.0% reported for ImageNet). In such a case,
(re)training of additional top 20% to 30% of network layers
with early stopping and patience set to 10 indicated a
significant improvement in the final top-1 accuracy (up
to 85.7% for the EficientNetB7 model and roughly 80%
for the other networks).</p>
        <sec id="sec-3-2-1">
          <title>4.2. The Robustness Tests</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>With these tests, we wanted to assess the resilience of</title>
        <p>the considered networks to noise corruption and various
image modifications. We prepared a unique validation
set of 287 images not used previously in training for
the experiments. The dataset contains images selected
both from the Stanford Cars dataset [6], and from the
1 The accuracy is averaged over all 5-fold cross validation
steps, CI specifies 95% confidence intervals.
1 The top-1 accuracy of 5 diferent checkpoints trained on the original data set with early stopping was considered, its
mean was calculated together with the corresponding 95 % confidence intervals (in %).
2 RGB stands for RGB noise. Random pixels were set to a random color value with probability p (%).</p>
      </sec>
      <sec id="sec-3-4">
        <title>Regards the robustness to random pixel color changes</title>
        <p>(see, e.g., Fig. 4), EficientNetB7 and EficientNetB5 are
the most robust but, at the same time, memory-intensive
models. The smallest network, MobileNetV2,
demonstrates, on the other hand, the worst results in this test.</p>
        <p>As an acceptable compromise, we can thus pick the
EfifcientNetB0 model with just 31.4MB memory
requirements and results outperforming many more extensive
networks, e.g., Xception, NASNetLarge, and even
InceptionV3 (except for highly noised images).</p>
        <p>On blurred and grayscale image sets, the models
performed better than in the RGB noise test, and the
MobileNetV2 model achieved even higher accuracy than
EficientNetB0. The other two EficientNet models are
significantly more accurate than both the MobileNetV2
and EficientNetB0. Yet as the top-1 accuracy fell
significantly for grayscale images compared to the original
Figure 7: A negative cropping test done with EficientNetB5. validation dataset, color proves to play an essential role
The upper image is the original one correctly classified as a in the classification process. Also, classification is more
hatchback. The bottom one was cropped from all sides with a accurate for blurred images than for grayscale images.
factor of 12 (cropping amount: width and height divided by Although blurring does not improve the overall
accu12), yet misclassified as an SUV. racy of the networks, it can sometimes emphasize
significant image characteristics and improve the classification.</p>
        <p>For example, let us consider the case illustrated in Fig. 5.</p>
        <p>DVM-CAR dataset, [7] (63 of them to better simulate the The EficientNetB5 model misclassified the shown
vehireality). cle as a sports car but correctly classified the blurred one</p>
        <p>Further, we created multiple variants of the 287-image as a hatchback. Blurring emphasized the edge
separatvalidation dataset (see Table 2). Each variant contains ing the back of the vehicle from the background for the
all of the images from the original dataset modified in a model, thus better indicating a hatchback.
diferent way: changing random pixels to a random color For many CNN architectures, cropping of images
rewith a certain probability, blurring the images, cropping sults in a higher top-1 accuracy compared to the original
them, and making them grayscale and/or blurring them validation set. During training, the augmentation layer
at the same time. This way we were able to test separately prepares the network for this test scenario, and cropping
the behavior of each model on modified images. removes the image’s noisy edges, thus focusing better on</p>
        <p>We trained the networks by employing early stopping the main object, see, e.g., Fig. 6. On the other hand,
cropwith patience set to 10 and GPU acceleration. The train- ping can also impact unwanted results, see, e.g., Fig. 7 of
ing ran for all models five times in a row and used always a hatchback misclassified as an SUV after image cropping
the same validation data. Table 4 shows the results aver- caused the car to fill the whole image and appear to be
aged over all five runs. more spacious.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Contemporary Android devices are powerful enough for</title>
        <p>real-time image processing based on neural networks.</p>
        <p>In this paper, we studied the accuracy, evaluation speed,
robustness, and size of 10 considered CNN models and
selected the best-performing ones to upload to the
developed Android smartphone application.</p>
        <p>Table 5 summarizes the results obtained for the car Table 6
dataset. According to top-1 accuracy, the most accurate Memory usage during training
(m81o.d8e%l)saanrde EEfi cficiieennttNNeettBB57(8(804.5.4%%).),TIhnecseepmtioondeRlessaNreettVh2e Model GPuUsemd(eGmBo)ry Musaexd. R(GABM) 1 Mo(dMeBls)ize2
biggest ones for their TensorFlow Lite size (243, 207, and EficientNetB5 5.3 2.5 251.5
108MB, resp.) yet remain easy to train. EficientNetB7 5.4 2.7 556.7</p>
        <p>After only 15 training epochs, the networks achieved InceptionResNetV2 3.0 2.6 425.3
adequate accuracy in the 5-fold CV test. The aforemen- 1 A possible bias can be caused by programs running on
tioned networks are robust against random RGB noise the background.
and various image distortions. All of them can be con- 2 The size of the model located on the GPU during
trainverted to the Tensor-Flow Lite format, although Eficient- ing (including the training metadata).
NetB7 and InceptionResNetV2 do not fit into an Android
application. Due to its size, the most accurate and noise- on-device training. The conversion to Lite reduces the
robust model suitable for an Android Studio application models’ size up to two times without reducing their
acseems to be EficientNetB5. curacy.</p>
        <p>Should the model be as small and as fast as possible, The TensorFlow Lite library was, on the other hand,
EficientNetB0 might pose a better choice. It achieves built to operate on portable devices with low
computasatisfiable accuracy and robustness results and it is the tional power. Originally, the Lite library did not allow
second smallest model among the considered ones. Fur- on-device training of Lite models. Meanwhile, this
limitather, it can achieve better results than many bigger mod- tion has been removed and on-device training is already
els like Xception, NASNetLarge, and even InceptionV3. supported. Despite of a well-written TensorFlow
docThe main contribution of this study thus consists in: umentation, training of Lite models still remains quite
• the development of a mobile Android application cumbersome, at least from the programming point of
that facilitates the classification of car images view.</p>
        <p>according to the car’s body style. Another limitation for our research comes from the
An• the choice of the EficientNetB5 model for the de- droid Studio that we have used to implement the trained
veloped smartphone application. Extensive test- CNNs in mobile applications. It has an inbuilt size limit
ing of the CNN models in question justifies this de- of 200MB for external files to be uploaded to a project.
cision that constitutes an acceptable compromise As a result of this restriction, we were not able to upload
for all the criteria, particularly concerning the Lite models bigger than 200MB to mobile applications..
model’s accuracy, robustness, and the required The last limitation is that Android Studio does not
oftime and memory costs. ifcially support uploading of TensorFlow models saved
Only for EficientNetB5, we obtained good results in formats diferent from TensorFlow Lite. On the other
(although not the very good ones) conforming to hand, TensorFlow Lite supports also other operating
sysall three considered criteria. None of the other tems such as iOS, so the developers are not limited to
models meets all of them. The other candidate writing their applications just for Android.
models, EficientNetB7 and InnceptionResNetV2, Further research could enhance the developed
applicaachieving acceptable accuracy results, do not fit tion both with on-device training and with federated
into the mobile application. learning. Federated learning enables robust training
across several decentralized edge devices or servers
hold</p>
        <p>While working with the standard TensorFlow library, ing local data samples without sharing them. This way,
we did not encounter any significant problems. But to as- the inbuilt CNN classifier could be easier retrained on
sess the viability of the networks for future on-device fine- new data to keep the implementation up-to-date. Other
tuning, we also measured the memory requirements of intriguing options for future research comprise the area
EficientNetB5, EficientNetB7, and InceptionResNetV2 of architecture optimization for the trained networks
during training (see Table 6; we averaged the obtained and the involvement of nature-inspired heuristics in the
results over five training sessions). EficientNetB5 con- process of CNN design.
sumed 5.3GB of GPU memory and 2.5GB of RAM during
each training session.</p>
        <p>Our computer needed 251.5MB of GPU memory to Acknowledgments
store EficientNetB5 and its training metadata. The other
two models were more demanding. Yet, even if we This research was supported by SVV project No. 260 575.
focused only on the EficientNetB5, we would need a
cutting-edge category smartphone like the Galaxy S21
Ultra 5G phone equipped with 12GB of RAM to launch
Weinberger, “Densely connected convolutional
networks”, CVPR, 2017, pp. 2261-2269.
[1] Xiaomi Czech, “Xiaomi Redmi 9A”, [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and
link https://www.xiaomi.cz/xiaomi-redmi-9a-2gb- L. -C. Chen, “MobileNetV2: Inverted residuals and
32gb-sky-blue/, Accessed: 27 Jan. 2022 linear bottlenecks”, CVPR, 2018, pp. 4510-4520.
[2] Alza.cz, “Samsung Galaxy A52s 5G”, link: [20] B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le,
“Learnhttps://www.alza.cz/samsungu-galaxy-a52s- ing transferable architectures for scalable image
5g?dq=6667487, Accessed: 27 Jan. 2022 recognition”, CVPR, 2018, pp. 8697-8710.
[3] Samsung, “Samsung Galaxy S21 Ultra 5G 256GB”, [21] M. Tan and Q. V. Le, “EficientNet: Rethinking
link: https://www.samsung.com/cz/smartphones/ model scaling for convolutional neural networks”,
galaxy-s21-5g/buy/, Accessed: 27 Jan. 2022 Proc. of the 36th International Conference on
Ma[4] Monika Gupta, “Google Tensor is a mile- chine Learning, PMLR 97:6105-6114, 2019.
stone for machine learning,” 19 Oct.
2021, Accessed: 14 Nov. 2021, link:
https://blog.google/products/pixel/introducinggoogle-tensor/
[5] Google LLC, “Edge TPU performance
benchmarks”, 2020, Accessed: 16 Nov. 2021, link:
https://coral.ai/docs/edgetpu/benchmarks/
[6] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D
object representations for fine-grained
categorization”, 4th IEEE Workshop on 3D Represent. and</p>
        <p>Recogn., ICCV 2013.
[7] J. Huang, B. Chen, L. Luo, S. Yue and I. Ounis,
“DVM-CAR: A large-scale automotive dataset for
visual marketing research and applications”, 2021,
arXiv: 2109.00881
[8] J. Deng et al., “ImageNet: A large-scale hierarchical</p>
        <p>image database”, CVPR, 2009, pp. 248–255.
[9] M. Abadi et al., “TensorFlow: Large-scale machine
learning on heterogeneous systems, 2015”, Software
available from tensorflow.org.
[10] TensorFlow, “Deploy machine learning models on
mobile and IoT devices”, Accessed: 6 Feb. 2022, link:
https://www.tensorflow.org/lite
[11] F. Chollet et al., “Keras.”, 2015, link: https://keras.io
[12] C.R. Harris et al., “Array programming with</p>
        <p>NumPy.” Nature 585, 2020, pp. 357–362.
[13] The pandas development team,
“pandasdev/pandas: Pandas 1.3.4”, 2021, doi:
10.5281/zenodo.5574486
[14] W. McKinney et al., “Data structures for statistical
computing in python”, Proc. of the 9th Python in</p>
        <p>Science Conference Vol. 445, 2021, pp. 51–56.
[15] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens and Z.</p>
        <p>Wojna, “Rethinking the Inception architecture for
computer vision”, CVPR, 2016, pp. 2818–2826.
[16] C. Szegedy, S. Iofe, V. Vanhoucke, and A.A. Alemi,
“Inception-v4, Inception-ResNet and the impact of
residual connections on learning”, AAAI, 2017, pp.</p>
        <p>4278–4284.
[17] F. Chollet, “Xception: Deep learning with
depthwise separable convolutions”, CVPR, 2017, pp.
18001807.
[18] G. Huang, Z. Liu, L. van der Maaten and K. Q.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>