Benchmarking Neural Networks on Heterogeneous
Hardware Resources
Christopher Noel Hessea,b , Holger Eichelbergerb
a
    Aptiv Services Deutschland GmbH, Hildesheim, Germany
b
    Universität Hildesheim, 31141 Hildesheim, Germany


                                         Abstract
                                         In recent years, artificial intelligence (AI) became a key enabling technology for many domains. To
                                         achieve best performance, modern AI methods have high resource demands, e.g., GPU servers for the
                                         training of neural networks. With the advent of further processor technologies, such as tensor processors
                                         or re-wirable processors, AI methods may be executed in shorter time while even saving energy. For
                                         many application domains such as autonomous driving or unmanned aerial vehicles, real-time constraints
                                         mandate low end-to-end latencies in AI processing.
                                             In this paper, we present a combined micro- and macro-benchmarking approach to analyze the
                                         performance as well as the power demands of modern processor architectures using convolutional neural
                                         networks as workload. We discuss tradeoffs among the different processor types and indicate issues and
                                         challenges that arise when performing such benchmarks on heterogeneous hardware resources.
                                             We show that FPGAs allow for an increase of 7x up to 45x in performance over high-end GPUs while
                                         using only 10% of the power. In the consumer space, novel architectures such as the Apple M1 are able
                                         to offer 3-5x better performance at 10-20% the power draw of current x86 CPU or GPU hardware.

                                         Keywords
                                         Latency, power, benchmarking, Artificial Intelligence, neural networks, CNN, GPU, TPU, FPGA


1. Introduction
Artificial intelligence (AI) is a key enabling technology to address challenging problems, like self-
driving cars. The increasing capabilities of AI also require more powerful compute resources,
e.g., for training of neural networks often graphical processing units (GPU) are utilized. In
contrast, a typical expectation is that prediction using an already trained AI model (commonly
referred to as inference) can also be performed on less powerful resources.
   In recent time, also further hardware architectures such as tensor processing units (TPU) or
re-wirable processors like field-programmable gate arrays (FPGA) are applied. While GPUs
and TPUs can be programmed through specific software libraries, FPGAs typically require
deep hardware knowledge as well as a different development approach. Nowadays, some
AI frameworks started to fill this gap and offer support for write-once-run-everywhere code.
Examples include OpenCL (for GPU, some FPGA and some CPU backends) or Intel’s OneAPI.

SSP’21: Symposium on Software Performance, November 09–10, 2021, Leipzig, Germany
Envelope-Open christopher.hesse@aptiv.de (C. N. Hesse); eichelberger@sse.uni-hildesheim.de (H. Eichelberger)
GLOBE https://aptiv.com/ (C. N. Hesse); http://www.sse.uni-hildesheim.de (H. Eichelberger)
Orcid 0000-0002-2584-5558 (H. Eichelberger)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   For developing AI-enabled hardware-accelerated applications, which in the extreme case
even rely on multiple hardware architectures, the question arises, which of these architectures
shall be applied to which AI method for the best performance. Besides performance measures
such as throughput, latency or memory utilization, also acquisition costs or energy consumption
are relevant, e.g., to trade-off AI benefits and AI usage with the impact on the environment.
   This work was conducted in the context of HAISEM-Lab1 , a BMBF-funded AI-lab, in which the
Universities of Hannover and Hildesheim collaborate on AI, software engineering and hardware
acceleration. We report on an evaluation of the performance and the power consumption
of specialized hardware architectures for AI workloads. Our research question is whether
neural networks can be used as micro- or macro-benchmarks to compare such systems. We aim
at benchmarking CPU-, GPU-, TPU- and FPGA-based systems as well as at the identification
of bottlenecks and problems. Related evaluations [1, 2] or benchmarks like MLPerf2 , cnn-
benchmark3 or mixbench4 usually focus on other workloads or hardware architectures.
   Our results show that FPGA hardware can lead to an increase of 7x up to 45x in performance
over a high-end GPU system while using only 10% of the power. Novel consumer space
architectures such as the Apple M1 are able to offer 3-5x better performance at 10-20% the
power draw compared to current x86 hardware.
   Structure of the paper: In Section 2, we discuss related work. We present our benchmarking
approach in Section 3 and discuss results in Section 4. We conclude the paper in Section 5.


2. Related Work
In heterogeneous computing, many works also on benchmarking are published, e.g., [3] intro-
ducing the the Rodinia benchmark suite. Although similar hardware resources or frameworks
are used, most works differ to ours as they do not take AI workloads into account.
   Some publications target the analysis or benchmarking of AI methods on different types of
processors. Among them, Qasaimeh et al. [1] report on a performance and energy evaluation of
AI computer vision kernels: For simple kernels, GPUs provide the best performance while for
more complex kernels FPGAs outperform the GPUs. In [2], Karki et al. present a benchmark
for deep neural network applications using CUDA and OpenCL on a server GPU, a mobile GPU,
and a mobile FPGA. They show that, e.g., the FPGA is more power efficient than the mobile
GPU platform. Moreover, there are several (industrial) macro-benchmarking suites such as
MLPerf2 , cnn-benchmark3 , or Yolo5 as well as the micro-benchmark mixbench4 .
   However, the works discussed above do not cover all hardware resources that we aim for or
they impose limitations, e.g., through the programming approach or the workload type.


   1
     Hardware-optimized AI Applications using modern Software Engineering Methods, http://haisem-lab.de/
   2
     https://mlperf.org/
   3
     https://github.com/jcjohnson/cnn-benchmarks
   4
     https://github.com/ekondis/mixbench
   5
     https://github.com/pjreddie/darknet
3. Approach
To perform a systematic comparison for the major types of AI accelerators (ASICs, i.e., application-
specific chips are out of our scope), we developed a combined micro- and macro-benchmarking
approach for different hardware architectures. As the target hardware is installed on the
premises of HAISEM-Lab partners, also power measurements using energy meters are possible.
   We selected Artificial Neural Networks (ANN) as workload due to the widespread interest
in this AI method. As there are various types of ANNs, we focus on Convolutional Neural
Networks (CNN), which are designed for image analysis, e.g., to realize driver support systems.
CNNs are intrinsically parallel so that we can yield stress on the highly-parallelizable compute
cores of GPUs, TPUs or FPGAs. Moreover, CNNs are reasonably understood and implementation
approaches are mature, i.e., frameworks such as TensorFlow or OpenCV even support CNNs
for different processor types. A CNN is typically structured in three layers:
   1. Dimensionality reduction through convolution matrices, also called kernels or filters.
      In this layer, filters of fixed size, e.g., 3x3 or 5x5, are used to reduce neighboured input
      pixel values to a single value. Filters can be trained, e.g., to detect edges.
   2. Introduction of non-linearity through activation functions, which operate on individual
      pixels. Examples are Sigmoid or rectified linear unit activation (ReLU) function.
   3. Reduction of spatial dimensions by pooling (subsampling). This layer reduces the com-
      putational complexity while preserving as much information as possible. Example tech-
      niques are maximizing or averaging sliding two-dimensional windows of fixed size.

   Figure 1 illustrates our synthetic CNN pipeline involving these three layers. The custom CNN
used for the macrobenchmark looks like the following: convolution → pooling → convolution
→ pooling → convolution → flatten → dense → dense. In total, it features 122,570 trainable
parameters. Please note that real CNNs may employ a larger number of layers or other kinds of
processing steps. In our approach (cf. [4] for details), each CNN layer is made up of multiple
invocations of the same function (convolution, pooling, activation) and each function forms a
micro-benchmark workload. As an additional stress test, we apply a 2D depthwise convolu-
tion, where each input channel of a colour input is processed individually. An entire execution
of a CNN represents a macro-benchmark targeting also interactions among the layers. As
input for the convolution, we use artificial single-channel gray-scale and multi-channel (RGB)


                                                                                        Bicycle

                                                                                        Car

                                                                                        Boat

                                                                                        Bird


                                 1            3              1            3
           Input             Convolution    Pooling      Convolution    Pooling      Output

                                 2                           2
                             Activation                  Activation


Figure 1: Basic CNN pipeline used as micro- and macro-benchmarks.
       Id         Type                CPU                 RAM                Accelerator
     DGX-1        GPU        2 Intel Xeon E5-2698        512 GB           8 Nvida Tesla V100
      P100        GPU        2 Intel Xeon E5-2697        256 GB           2 Nvidia Tesla P100
      V100        GPU        2 Intel Xeon E5-2697        256 GB           2 Nvidia Tesla V100
   V100-SXM2      GPU       2 Intel Xeon Gold 6148       384 GB       2 Nvidia Tesla V100-SXM2
      A100        GPU          2 AMD Epyc 7662             1 TB       4 Nvidia Tesla A100-SXM4
      Arria       FPGA      2 Intel Xeon Gold 6248       384 GB        4 Intel Arria 10 GX PAC
     Omen         GPU        1 Intel Core i7 9750H        16 GB   1 NVidia GeForce RTX 2070 Max-Q
       M1         GPU             1 Apple M1               16 G               1 Apple M1
     Jetson       GPU           1 ARM A57 SoC             4 GB          1 Nvidia Maxwell GPU

Table 1
Utilized hardware used in the experiment (as provided by the HAISEM-Lab partners).


images of sizes 100x100, 1000x1000 and 3000x3000 with usual convolution kernels of size 3x3
or 5x5. For analyzing the pooling as well as the ReLU activation we utilize the single-channel
input. For the macro-benchmarks, we make use of the CIFAR-10 dataset6 for training.
   To enable comparability among different hardware resources, we base the implementation
of the benchmarks on TensorFlow 2.3.0 wherever feasible. For FPGAs, which are not directly
supported by TensorFlow, we use Intel OpenVino7 to convert the models to the FPGAs. For
the macro-benchmarks, we use two similar CNNs, one developed from scratch in TensorFlow
and, to exploit the power of state-of-the-art frameworks, one in YOLOv35 with model weights
converted to TensorFlow. During a benchmark, we collect one data point every 10 seconds
containing CPU and RAM usage (perfstat) and GPU usage (Nvidia-smi). However, Nvidia-smi
does not run on embedded devices, so we monitor also the system-level power consumption via
an external Voltcraft Energy Logger EL4000. The EL4000 records one data point per second.
   We use an intentionally wide range of hardware subjects as shown in Table 1, which
include server (DGX-1, P100, V100, V100-SXM2, A100, Arria), desktop (M1), laptop (Omen) and
embedded (Jetson) resources. Most subjects run Ubuntu Linux (version 18.04 or 20.04), Omen is
installed with Fedora Workstation 33 and the Mac Mini M1 with macOS 11.2. Initial plans to
also utilize TPUs did not work out due to software incompatibilities. As alternative, we aimed
for TPUs in Google Cloud, which were not available to us due to license/region restrictions.
   A single benchmark consists of setup (including startup of the monitoring), execution and
teardown to save the results and to prepare the next iteration. Each iteration runs for 10 minutes
so that potential warm-up phases can be excluded during the analysis and that sufficient
information can be recorded by the external energy logger(s).


4. Results and Discussion
We summarize now the results of conducting the micro- and macro-benchmarks8 (cf. [4] for
more insights). The data gathered hints at CNNs lending themselves very well to parallelization.
   6
     https://www.cs.toronto.edu/~kriz/cifar.html
   7
     https://docs.openvinotoolkit.org/
   8
     Replication package https://doi.org/10.5281/zenodo.5572610
                                                Mean iteration times                                                                           Mean power draw per iteration
                                        404.5                                                                                1235                    1227                  1082
                                                                                    Nvidia DGX-1                                                                                       Nvidia DGX-1
                                                                                    Apple M1                          103                                                              Apple M1
                                                                    167.5
                                                                                    HP Omen                                                                                            HP Omen
             102                                                                    Nvidia Jetson                                                                                      Nvidia Jetson

                                                                                              18.4


                                                                                                     Power draw (W)
                                 11.6                                                                                                    129                     128                    132
Time (ms)


             101                                                                                                      102
                     3.9
                                                              1.9
             100                                  0.6                                   0.6
                           0.4
                                                        0.3                 0.3                                                     12                      12                    12
                                                                                                                      101
                                                                                  0.1                                                          7                       7                       7
            10 1
                          Conv2D                     MaxPool2D                   ReLU                                              Conv2D                MaxPool2D                ReLU
                   (3000x3000x3 * 5x5x3)          (3000x3000 * 5x5)           (3000x3000)                                   (3000x3000x3 * 5x5x3)     (3000x3000 * 5x5)        (3000x3000)

    Figure 2: Micro-benchmark iteration times (left) and power draw (right) for image size 3000x3000.


   At the same, a large number of low-complexity cores (such as those used in contemporary
   GPU models) win over a smaller number of high-complexity ones (such as modern CPUs).
   This is hardly surprising, as CPU cores are optimized for serialized computing. In most of our
   micro-benchmarks, sufficiently large input sizes such as RGB images containing 3000x3000
   pixels are able to fully saturate even high-end GPUs. This is not the case with e.g. the ReLU
   benchmark, as it can hardly be parallelized. Here, the subjects were very close in performance.
      Figure 2 illustrates some micro-benchmarking results for inputs with 3000x3000 pixels.
   For the power draw, we show the watts as reported by our energy logger (at a sample rate of
   1Hz). Since each benchmark was executed for 60 seconds, we can calculate the approximate
   energy consumption as mean power draw times the duration, e.g. for Conv2D and DGX-1:
   1235 W * 60 s = 74,100 Ws = 74,100 Joule. The Apple M1 SoC is able to deliver impressive
   levels of performance, being composed of only CPU and GPU cores. Compared to traditional
   x86 platforms with dedicated accelerator chips, we find the unified memory architecture to be
   victorious. On the other hand, much of the support libraries and frameworks such as TensorFlow
   are still in beta for Apple silicon: We encountered here issues where the performance was
   invariant with regards to the input size which makes little sense. No other tested platform
   exhibited such issues. The various servers equipped with Nvidia GPUs performed in line with
   our expectations and their specs. For that reason, we only display the DGX-1 system results in
   the result diagrams. For all collected benchmark data, the variances were generally low given a
   sufficient input size. For example, the DGX-1 system showed standard deviations around 0.001
   for a mean value of 0.09 ms at an input size of 100x100x3. Given an input of 3000x3000x3 pixels,
   the mean was 3.9 ms and the standard deviation was 0.03 for the 2D convolution benchmark.
      The macro-benchmark results can be summarized as follows. Using the CIFAR-10 dataset
   to train our own network resulted in low resource usage in most of the server class systems
   (for both, training and inference). This can most likely be attributed to the small input size of
   32x32x3 pixels. During the YOLOv3 inference benchmarks, the FPGA system outperformed all
   other tested systems by a large margin with a mean forward pass time of just 0.9 milliseconds.
   Curiously, the Apple M1 performed rather poor here, with a mean time of 401.6 ms when
   using the CPU only and 1214.9 ms when using the GPU cores. The Nvidia GPU servers yielded
numbers in the range of 35 to 55 ms for a full forward pass. Interestingly, the system equipped
with the V100 GPUs outperformed the one with the A100 GPUs.


5. Conclusions
Artificial Intelligence is a key enabling technology in many domains, but the success of AI
depends on supporting hardware architectures. In this paper, we provided experimental insights
into performance and power consumption of recent hardware architectures through micro- and
macro-benchmarks, showing that CNNs can be utilized for system comparisons.
   From our results, it is evident that generic hardware architectures such as FPGAs outmatch
the performance of other architectures such as GPUs and CPUs. At the same time, they consume
less power and, thus, allow for higher efficiency. Specialized ASICs such as TPUs, which were
unfortunately not available to us for technical and license reasons, may perform even better.
A major problem with ASIC and even FPGA programming remains in the complexity of the
software stack. Programming can be very complex and requires highly specialized domain
experts. Heterogeneous computing stacks such as OpenCL aim to solve this issue, but the actual
quality of the implementation and proper mapping to hardware vary greatly between vendors.
   One curious observation is that novel architectures such as the Apple M1 are able to leverage
traditional components (CPU and GPU) but achieve much higher efficiency and performance
through unified memory architectures. This avoids costly buffer copies and reduces the latency
by a great deal, as can be seen in our micro-benchmark results. Overall, the Apple M1 ends
up outperforming all other platforms when it comes to performance per dollar and even
performance per watt, with FPGA systems being a possible exception.


Acknowledgments
HAISEM-Lab is partially funded by the Federal Ministry of Education and Research (BMBF)
under grant 01|S19063B.


References
[1] M. Qasaimeh, K. Denolfy, J. Loy, K. Vissersy, J. Zambreno, P. H. Jones, Comparing Energy
    Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels, in: Intl. Conference
    on Embedded Software and Systems (ICESS’19), 2019, pp. 1–8.
[2] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, H. Jeon, Tango: A Deep
    Neural Network Benchmark Suite for Various Accelerators, CoRR abs/1901.04987 (2019).
[3] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, K. Skadron, Rodinia: A
    Benchmark Suite for Heterogeneous Computing, in: Intl. Symposium on Workload Charac-
    terization (IISWC’09), 2009, pp. 44–54.
[4] C. N. Hesse, Analysis and Comparison of Performance and Power Consumption of Neu-
    ral Networks on CPU, GPU, TPU and FPGA, 2021. URL: https://sse.uni-hildesheim.de/
    studium-lehre/beispiele-guter-ausarbeitungen/.