Introduction

QJK+] Basit Qureshi, Yasir Javed, Anis Koub^aa, Mohamed-Foued Sriti, and Maram Alajlan. Performance of a Low Cost Hadoop Cluster for Image Analysis in Cloud Robotics Environment.

Performance of Raspberry Pi microclusters for Edge Machine Learning in Tourism

Andreas Komninos

akomninos@ceid.upatras.gr 0

Ioulia Simou

simo@ceid.upatras.gr 0

Nikolaos Gkorgkolis

gkorgkolis@ceid.upatras.gr 0

John Garofalakis

garofala@ceid.upatras.gr 0 0 Computer Technology Institute and Press \Diophantus" , Rio, Patras, 26504 , Greece

2019

82 90

While a range of computing equipment has been developed or proposed for use to solve machine learning problems in edge computing, one of the least-explored options is the use of clusters of low-resource devices, such as the Raspberry Pi. Although such hardware con gurations have been discussed in the past, their performance for ML tasks remains unexplored. In this paper, we discuss the performance of a Raspberry Pi micro-cluster, con gured with industry-standard platforms, using Hadoop for distributed le storage and Spark for machine learning. Using the latest Raspberry Pi 4 model (quad core 1.5GHz, 4Gb RAM), we nd encouraging results for use of such micro-clusters both for local training of ML models and execution of ML-based predictions. Our aim is to use such computing resources in a distributed architecture to serve tourism applications through the analysis of big data.

Introduction

Interest in edge deployments is strong, since the drawbacks can be mitigated using alternative deployment approaches. For example, edge nodes can forward data to a cloud server so that complex and powerful ML models can be built. These models can be saved and distributed back to the edge nodes for use. Additionally, edge nodes can pre-process data locally before forwarding them to cloud servers, helping with the distribution of data cleansing and transformation workloads. Edge computing architectures also lend themselves particularly well to certain types of application, where the users might be more interested in locally pertinent data. Edge computing hardware form factors vary, ranging from server-class hardware, to simple desktop-class computers, and even IoT-class devices (e.g. Arduino, Raspberry Pi) can perform edge computing roles. Recently, dedicated edge nodes for ML have been developed (e.g. Google Coral, Nvidia Jetson, Intel Neural Compute), and hardware accelerator add-ons for existing platforms are also on the market (e.g. Raspberry PI AI hat, Intel 101).

One interesting con guration which has not gained much attention, is the ability of IoT devices running Linux operating systems, to work together in a cluster computing con guration. This ability leverages many of the known advantages of cloud computing (e.g. using a distributed le system such as HDFS, running big data analytics engines such as Spark), providing a scalable solution for powerful and resilient local edge components, while keeping deployment costs low. In this paper, we explore the performance of small Raspberry Pi (RPi) clusters in the role of an IoT ML edge server, using the RPi-4 model, which is the latest release in this product line. Although Pi clusters have been reported in previous literature, the RPi-4 model is newly released (Q2 2019) and its signi cant hardware improvements that make it a more realistic option for this role than before. 2

Related Work

The performance of Raspberry Pi clusters has been investigated in the past literature, mostly using Model 1B an 2B devices (see Table 1). The rst known paper to report ndings on such a deployment is [TWJ+], with a con guration of 56 Model-B RPis. No speci c performance evaluations were reported but the advantages of low-cost and the ability to use such clusters for research and educational purposes were highlighted in this paper. An even larger Model-B cluster (300 nodes) is reported in [AHP+], although again no performance evaluation is discussed. A smaller Model-B cluster (64 nodes) is discussed in [CCB+], demonstrating that network bandwidth and limited memory are barriers for such clusters. The performance advantages in computing power depend on the size of the computational task, with smaller problems not bene ting from additional computing resources (nodes) due to communication overheads, and memory size limiting the size of the computational task that can be performed. Similar results demonstrating the computational performance drop from the theoretical linear increase line as nodes are added, are obtained by [CPW] using a 24 unit Model 2B cluster, and also in a 12-node Model 2B [HT] and an 8-node Model 2B cluster [MEM+]. In [KS] the performance using external SSD storage (attached via USB) is evaluated, demonstrating that big data applications on such clusters (20 x Model 2B) is bound by the CPU limitations. In [QJK+], a 20-node RPi Model 2B cluster is investigated for real-time image analysis. Its performance was lower than virtual machines running on traditional PCs, however, the small form factor paired with the relatively good performance, make such clusters ideal for mobile application scenarios.

With regard to RPi cluster performance in the execution of ML algorithms, [SGS+] describe the performance of an 8-node Model 2B cluster. The researchers concluded that RPi clusters o er good tradeo s between energy consumption and execution time, though better support for parallel execution is needed to improve performance. Such optimisations are demonstrated in [CBLM], with the 28-node Model 2B cluster outperforming a traditional multicore server, in terms of processing power and power consumption, using 12 nodes only.

RPi clusters have been proposed for use in educational settings, to teach distributed computing [DZ, PD, Tot], as servers for container-based applications [PHM+] and as research instruments, e.g. to analyse data from social networks [dABV] or in security research [DAP+].

Overall, while signi cant interest in the research community has been shown towards RPi clusters, there is presently no work demonstrating their performance in ML roles except in [SGS+]. Hence, our goal for this paper is to investigate the performance of RPi clusters in an edge ML role, using the latest RPi Model 4B model which overcomes some of the previous network and memory constraints. 3

RPi cluster con guration

Our cluster consists of 6 RPi4 Model B devices (Fig. 1), with 4Gb RAM available on each node. Additionally, the devices were equipped with a 64Gb MicroSD card with a V30 write speed rating (30Mb/s). The devices were connected to a Gigabit ethernet switch to take full advantage of the speed improvements in the network card. The cluster was con gured with a Hadoop distributed le system (3.2.0) and Apache Spark (2.4.3). As such, we are able to leverage Spark's MLlib algorithms for distributed machine learning. For the distributed le storage, we set the number of le replicas to 4. Since Hadoop is installed in the cluster, we use the YARN resource allocator for the execution of Spark jobs. Each YARN container was con gured with 1920Mb of RAM (1.5Gb + 384Mb overhead), leaving 1Gb available for YARN master execution and 1Gb of RAM for the operating system (Raspbian 10 - Buster). Although the cluster can be con gured to run with multiple Spark executors on each device, we opted for a "fat executor" strategy, meaning one executor per device. Additionally, as a baseline con guration scenario (S-base), we opted for reserving one processor core on each device for use by the operating system, thus resulting in 3 cores per executor. Jobs were submitted from within the cluster, therefore one device always played the role of the client, one device played the role of the application manager (YARN), thus up to 4 devices were available as executors, in order to run the Spark jobs. 3.2

Datasets and ML algorithms

We used two datasets for our experiments. First, we perform experiments using a large dataset, in this case the used car classi eds dataset1. Secondly, since we aim to apply the RPi cluster in a tourism recommender system for Greece, as part of an ongoing project, we used the global scale checkins dataset [YZQ] and the Greek weather dataset2. From the former, we selected all check-ins made in the Attica region of Greece, and we fused the resulting data with the historical weather information from the latter dataset. Both datasets were uploaded to the Hadoop distributed le system in the cluster.

The used cars dataset was used to perform simple linear regression, using car model year and odometer reading, to predict its sale price. The check-ins dataset was used with the decision tree classi cation algorithm. In this case, by providing the geographical coordinates, month, day, hour and daily mean, high and low temperatures, the target was to determine the type of venue a user might check into (a multi-class classi cation task). This can be used to recommend types of venue that a user might like to visit depending on their current context. 1https://www.kaggle.com/austinreese/craigslist-carstrucks-data 2https://www.kaggle.com/spirospolitis/greek-weather-data 1. Load a dataset from HDFS into Spark Dataframe 2. Select feature and label columns 3. Filter out samples with NULL values

5. Split dataset to training and test datasets (0.7, 0.3) 6. Train a machine learning model 7. Perform predictions on the test dataset Experiment 1 - programming language performance

Apache Spark provides programming interfaces for the Python language, popular amongst ML developers, and also uses Scala natively. Since Python is a dynamically typed and interpreted language, where Scala is statically typed and compiled, Scala should provide a performance advantage for applications in our cluster, however, this depends on the type and volume of data used. In this rst experiment, we compare the performance of a simple application written in both languages. The application includes the following tasks in sequence:

4. Assemble feature columns into a single feature vector and append to dataset

For this rst experiment, we used the cars dataset to train a linear regression model. As a result of the data cleansing process, the nal dataset contained 9,585,316 samples. We measured the time taken to complete the data loading, the model training and predictions tasks, as shown in Fig. 2. From these results we can see that the Scala program executes more slowly when the number of executors is small, however, it achieves parity or even outperforms Python in execution time as the number of executors increases. Based on these results, we chose to proceed with the rest of the experimentation using Scala as the programming language. (a) Model training

(b) Predictions on test set Next, we implemented two ML algorithms to assess the cluster's performance. In this case, we were interested in determining how the number of executors and number of cores per executor a ect performance, in the implementation of di erent ML algorithms. As such, we retain the baseline con guration scenario S-base ( xed ncores=executor = 3, variable nexecutors 2 [1; 4]) and add a further scenario ( xed nexecutors = 4, variable ncores=executor 2 [1; 4]). This scenario is termed S-core henceforth. Note that, despite standard practice, in S-core we allocate up to 4 cores (the maximum available), to investigate the full resource utilisation contesting the OS requirements. The sequence of tasks was identical to the previous experiment, changing of course the type of model to be trained. Additionally, we implemented three extra steps: 8. Write the trained model to distributed storage 9. Load the pre-trained model from distributed storage 10. Perform a single prediction given a random feature vector from the test set

These additional tasks emulate the concept of batch training at regular intervals on the edge node, and using the pre-trained models for application purposes. For the second experiment, we used both datasets. We note that as a result of the data pre-processing task, the nal check-ins dataset contains 87,908 samples. 5.1

Application startup overhead

First, we measured the overhead time taken to obtain the necessary SparkContext environment (i.e. assigning a YARN application master and attaching executor nodes to the process). From Fig. 3 we note that the required overhead uctuates but remains roughly constant in all cases (S-base cars : = 29:322s; = 2:419s, check-ins = 27:089s; = 1:278s; S-core cars : = 30:271s; = 1:791s, check-ins = 28:013s; = 0:466s). Of course this overhead is required only at application startup and is not incurred for every request, when the application is written as a server waiting to receive ML result requests. Another metric is the time taken to load the dataset from HDFS storage, as a Spark Dataframe. As seen from Fig. 4, the dataset loading time is almost constant, demonstrating that any overhead comes from the HDFS access process and not Spark itself (S-base cars : = 20:259s; = 0:243s, check-ins = 20:374s; = 0:263s; S-core cars : = 20:077s; = 0:182s, check-ins = 20:372s; = 0:071s). Notably both datasets t comfortably within the memory allocated to each executor container. After data is loaded into Spark, the rst operation on the data is transformation, including cleansing null samples, re-casting data columns to appropriate data types, assembling the feature vector column and encoding the prediction label (for multi-class classi cation). As seen in Fig. 5, data transformation is more intensive for the check-ins dataset, as a result of the encoding of the prediction label (> 300 labels). Interestingly, while at all other cases the transformation times remain constant, for the S-base con guration we note that the transformation time increases with more than 2 executors. This is the result of the distribution of the mapping operation of the dataset across multiple nodes and the overhead caused by the communication requirements in reducing and aggregating results across more nodes (S-base cars : = 0:828s; = 0:251s, check-ins = 16:518s; = 5:1923s; S-core cars : = 0:804s; = 0:259s, check-ins = 21:736s; = 0:544s). Next, we investigate the time required to train the models in each scenario. Notably, in all cases, increasing the number of executors or cores per executor yields a performance advantage, even if small. This e ect is signi cantly more pronounced for the cars datasets, which is much larger in size (S-base cars : = 371:218s; = 197:369s, check-ins = 60:329s; = 9:636s; S-core cars : = 108:332s; = 108:332s, check-ins = 56:893s; = 11:684s). A further observation is that allowing access to the 4th core (S-core) that is typically reserved for the operating system, doesn't particularly improve performance. Notably, the average time to train the models is not prohibitive, even in the least favourable conditions, and does not exceed a few minutes of execution time. This demonstrates that periodic training of the edge-based models is feasible and can be performed comfortably in times of low resource demand, even when using large datasets. In terms of time required to evaluate test sets, again we note that the larger dataset (cars) bene ts from multiple executors and number of cores per executor (Fig. 7). As before, allowing access to the additional core in the S-core scenario, doesn't improve performance. Finally, it is noteworthy that for the smaller check-ins dataset, the execution time for the prediction set is sub-second (S-base cars : = 256:117s; = 157:592s, check-ins = 0:159s; = 0:114s; S-core cars : = 173:103s; = 68:294s, check-ins = 0:176s; = 0:156s). These results demonstrate that the cluster is able to resolve predictions on even very large sets, in under 2 minutes.

Related to these results, we report that the prediction of a single feature vector is almost instantanous, across all conditions, often requiring sub-millisecond execution time (S-base cars : = 0:004s; = 0:001s, check-ins = 0s; = 0:001s; S-core cars : = 0:003s; = 0:001s, check-ins = 0s; = 0s). Additionally, the time required to store and load the trained models is very short, as can be seen in Fig. 8 (Saving: S-base cars : = 5:152s; = 0:538s, check-ins = 6:323s; = 0:145s; S-core cars : = 4:951s; = 0:489s, check-ins Finally, we used the SparkLint3 package to gather statistics from job execution history about CPU utilisation. From these results we note that as the number of executors increases (S-base) the level of CPU utilisation across the entire job decreases, meaning that an increased number of executors a ords the cluster more capacity to run additional parallel tasks, as can be expected (Fig. 9). However, with the maximum number of executors (S-core), additional core allocation does not a ect CPU utilisation when more than 2 cores are allocated, showing that the additional resources are indeed utilised to decrease overall execution time. Further, plotting the two extremes of this scenario (1 core/executor and 4 cores/executor) we see that the resource utilisation varies signi cantly. In the former case, most work is carried out using 1 or 2 cores in the cluster (i.e. 1 - 2 executors), while in the latter case, the majority of task execution is separated across all available 16 cores, leading to the reduced execution time (Fig. 10). In this gure, the grey area is idle time (data has been transferred to the driver node), yellow is node-local (data and code resides on the same node), orange is rack-local (processing where data needs to be fetched from another node) and nally green represents local (in-memory) execution time. As part of the job analysis, the main aim here is to minimize the grey area which means that the cluster resources are not being utilised, and as can be seen, the greater number of cores per executor achieves this goal. 6

Training vs. Accuracy tradeo s

In the preceding analyses, the ML model parameters used were the default values set by Spark's MLlib. Specifically for the decision tree (checkins dataset) case, the generated model is quite simple (max depth: 5, min information gain: 0, min instances per node: 1, information gain measure: gini index). To assess the cluster's performance we considered a scenario where types of venue to check-in could be recommended for a large number of cases, roughly correspondent to the number of venues in a typical city center. We took 5% of the dataset for this purpose (4395 cases) and trained the decision tree on the remaining 95% of the data, using another ML environment (RapidMiner Studio) for convenience, and found (using random parameter search) that a very good performance of 89.15% accuracy can be achieved (max depth: 30, min information gain: 0.01, min instances per node: 2, information gain measure: entropy). Running this experiment on the RPi cluster however yielded an unexpected surprise. While on the Rapidminer environment training took a few seconds, on the RPi cluster the process failed due to insu cient memory on the Java heap, after several minutes of processing. As a reference, YARN containers on the cluster consume up to 2.5G RAM, including 2G for Spark executors and the related overhead (384M). To lighten the load, we experimented to nd a smaller tree complexity (depth) and training set size that would yield comparable performance using RapidMiner. We found that a max depth of 15 and sample size at 40% of the original (33405 samples) yielded a good compromise (see Fig. 11). Thus, to assess cluster performance in a more realistic scenario, we ran the experiment again for di erent sizes of the training dataset between 10 and 40% and predicting on the same number of cases, to assess training time and performance tradeo s. As shown in Fig.12, an increase of the training set size leads to expectable increases in training time, but not necessarily accuracy. For reference, a fair performance of 72.33% accuracy is attainable with 5m48s of training time using 20% of the original training set (16702 samples). 7

Discussion

In the preceding sections, we have demonstrated the ability to run a small RPi cluster as an edge computing resource, using industry standards such as the Hadoop distributed le system, and Apache Spark for machine learning and data analytics. To the best of our knowledge, this is the rst work to present an analytical evaluation of the RPi Model 4B in a cluster con guration for ML tasks. We have demonstrated that the performance of this cluster is su cient for the purposes of both training and executing ML models in an edge computing context. One of the most encouraging results from our analysis is the short time required to load pre-trained ML models and execute predictions with very fast speed. However, signi cant additional work remains.

Firstly, we have run two popular, albeit "lightweight" ML algorithms (linear regression and decision trees). The performance of the cluster should be evaluated using more complex models supported by Spark's MLlib (e.g. gradient boosted trees). Additionally, third party ML libraries such as DeepLearning4J and Tensor ow should be tested for performance, since they support various implementations of arti cial neural networks which are better suited for heavier tasks (e.g. image classi cation and NLP tasks). Another aspect to examine is the size of dataset that can be handled by the cluster. Even though we have tested with a relatively large and a smaller dataset, more analysis is required to identify the performance tradeo s between dataset size and speed of model training. Additionally, we need to test the cluster's capacity to serve under various request loads. We have noted that single feature vectors can be regressed or classi ed with sub-millisecond timing, but a real on-line application processing multiple simultaneous user/device requests or handling streaming data (e.g. from IoT devices or social network feeds), will place strain on the cluster's ability to respond in real-time. As a nal note, we highlight that our cluster is quite a small setup. This is intentional in our setup, since applications requiring edge computing infrastructures may have strict form factor and physical size limitations. However, it would be interesting to see how performance scales with additional nodes in the cluster. 7.0.1

Acknowledgements

Mr. Antonis Frengkou and Spyros Drimalas helped with the resources necessary for this experiment. Research in this paper was funded by the Helleni c Government NSRF 2014 -2020 (Filoxeno 2.0 project, T1EDK-00966) [APZ]

Mariano d' Amore, Rodolfo Baggio, and Enrico Valdani. A Practical Approach to Big Data in Tourism: A Low Cost Raspberry Pi Cluster. In Iis Tussyadiah and Alessandro Inversini, editors, Information and Communication Technologies in Tourism 2015, pages 169{181. Springer International Publishing.

Kevin Doucet and Jian Zhang. Learning Cluster Computing by Creating a Raspberry Pi Cluster. In Proceedings of the SouthEast Conference, ACM SE '17, pages 191{194. ACM.

Wajdi Hajji and Fung Po Tso. Understanding the Performance of Low Power Raspberry Pi Cloud for Big Data. 5(2):29. [CPW] [SGS+]

A. M. Pfalzgraf and J. A. Driscoll. A low-cost computer cluster for high-performance computing education. In IEEE International Conference on Electro/Information Technology, pages 362{366. [Tot]

D. Toth. A Portable Cluster for Each Student. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 1130{1134. [YZQ]

Dingqi Yang, Daqing Zhang, and Bingqing Qu. Participatory Cultural Mapping Based on Collective Behavior Data in Location-Based Social Networks. 7(3):30:1{30:23.

[AHP+]

Abrahamsson ,

Helmer ,

Phaphoom ,

Nicolodi ,

Preda ,

Miori ,

Angriman , J. Rikkila,

Wang ,

Hamily , and

Bugoloni . A ordable and

Energy-E cient Cloud

Computing Clusters: The Bolzano Raspberry Pi Cloud Cluster Experiment . In 2013 IEEE 5th International Conference on Cloud Computing Technology and Science , volume 2 , pages 170 { 175 .

Yuan

Ai ,

Mugen

Peng , and Kecheng Zhang. Edge computing technologies for Internet of Things: A primer . 4 ( 2 ): 77 { 86 .

Michael F.

Cloutier , Chad Paradis, and Vincent

Weaver . A Raspberry Pi Cluster Instrumented for Fine-Grained Power Measurement . 5 ( 4 ): 61 .

Kaewkasi and

Srisuruk . A study of big data processing constraints on a low-power Hadoop cluster . In 2014 International Computer Science and Engineering Conference (ICSEC) , pages 267 { 272 .

Li ,

Ota , and

Dong . Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing . 32 ( 1 ): 96 { 101 .

[MEM+]

Mappuji , N. E endy, M. Mustagh rin,

Sondok ,

R. P.

Yuniar , and

S. P.

Pangesti . Study of Raspberry Pi 2 quad-core Cortex-A7 CPU cluster as a mini supercomputer . In 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE) , pages 1 { 4.

[PHM+]

Pahl ,

Helmer ,

Miori ,

Sanin , and

Lee . A Container-Based Edge Cloud PaaS Architecture Based on Raspberry Pi Clusters . In 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW) , pages 117 { 124 .

[CBLM]

Candelario ,

Booth ,

A. S.

Leger , and

S. J.

Matthews . Investigating a Raspberry Pi cluster for detecting anomalies in the smart grid . In 2017 IEEE MIT Undergraduate Research Technology Conference (URTC) , pages 1 { 4.

[CCB+] Simon J. Cox , James T. Cox, Richard P. Boardman, Steven J.

Johnston , Mark

Scott , and Neil S. O'Brien. Iridis-pi: A low-cost, compact demonstration cluster . 17 ( 2 ): 349 { 358 .

Joa~o Sa ran , Gabriel Garcia, Matheus A. Souza , Pedro H. Penna , Marcio Castro, Lu s F. W. Goes , and Henrique

Freitas . A Low-Cost Energy-E cient Raspberry Pi Cluster for Data Mining Algorithms . In Frederic Desprez, Pierre-Francois

Dutot

, Christos Kaklamanis, Loris Marchal, Korbinian Molitorisz, Laura Ricci, Vittorio Scarano, Miguel A. Vega-Rodr

guez

, Ana Lucia Varbanescu, Sascha Hunold, Stephen L. Scott , Stefan Lankes, and Josef Weidendorfer, editors, Euro-Par 2016: Parallel Processing Workshops, Lecture Notes in Computer Science , pages 788 { 799 . Springer International Publishing.

[TWJ+]

F. P.

Tso ,

D. R.

White ,

Jouet ,

Singer , and

D. P.

Pezaros . The Glasgow Raspberry Pi Cloud: A Scale Model for Cloud Computing Infrastructures . In 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops , pages 108 { 112 .