11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 Big data and machine learning framework for clouds and its usage for text classification István Pintye, Eszter Kail, Péter Kacsuk Péter Kacsuk Institute for Computer Science and Control University of Westminster Hungarian Academy of Sciences London, UK Budapest, Hungary P.Kacsuk@westminster.ac.uk pintye.istvan@sztaki.mta.hu Abstract— The paper describes a big data and AI application research for the scientists of MTA. Nearly 100 projects have development and execution framework that was originally been run on MTA Cloud since its opening and more and more developed for MTA Cloud (an OpenStack based cloud) but could projects require to use Big Data and machine learning be used on other clouds including Amazon, OpenStack, applications. However, the large number of AI tools available OpenNebula and CloudSigma. The paper explains the concept for clouds are very complex and their proper deployment and and components of the big data and AI environment and configuration requires significant learning of both the tools and illustrates its usage by a text classification application. the underlying cloud. Furthermore, tools supporting different layers like user interface layer, language layer, machine Keywords—machine learning; big data; parallel and distributed execution; cloud; learning layer, deep learning layer are not always compatible and hence it requires further skill to select the right tools from each layer in a way that they should be able to work together in I. INTRODUCTION an AI environment. Researches in different scientific fields (Natural Sciences, Recognizing this problem, we have decided to develop so- Physics, Political Science) often require huge computational called AI reference architectures that can support the solution resources and storage capacity to handle real Big Data. of certail AI application classes and can run in the cloud in a Traditional sequential data processing algorithms are not reliable and robust way and can easily be deployed and used by sufficient to analyze this large volume of data. For efficient the end-user scientists. The ultimate goal is to develop a large processing and analysis new approaches, techniques and tools set of AI reference architectures for a large set of various AI are needed. problem classes. Moreover, cloud infrastructures and services are becoming The AI reference architectures have been created in three even more popular and are playing an appropriate and widely steps: used role to address the computation need of many scientific and commercial Big Data applications. Their widespread usage 1. Development and publication of a cloud orchestrator is a consequence of the dynamic and scalable nature of the called Occopus that enables the fast creation of services provided by cloud providers. complex application frameworks in the cloud based on Occopus infrastructure descriptors even by novice However, the data scientists face several problems once cloud users. they start planning the use or deployment of any Big Data platform on cloud(s). On one hand, the selection of the 2. Development and publication of the Occopus appropriate cloud provider(s) is always a cumbersome process infrastructure descriptors for generic AI reference since the potential user community has to take into architectures like for example: Jupyter, Python, Spark consideration several factors and trade-offs even if they need ml, Spark cluster and HDFS. only a generic Infrastructure-as-a-Service (IaaS) provider: 3. Development and publication of application-oriented private institutional (e.g. SZTAKI Cloud [1], federated cloud environments for various AI application domains. (e.g. MTA Cloud [2] or pan-European EGI FedCloud [3]) or public cloud (e.g. Amazon [4]). To demonstrate the third step, we use a text classification application provided by the POLTEXT (Text Mining of The Hungarian Academy of Sciences (MTA) provides free Political and Legal Texts) Incubator Project of MTA Centre for IaaS cloud (MTA cloud) services for research communities and Social Sciences. This problem is complex enough to easy to use, dynamic infrastructures adapted to the actual demonstrate the advantages of using the framework we have project requirements. MTA Cloud was established to accelerate created for supporting big data and AI applications. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 The structure of the paper is as follows. The next section file is split into one or more blocks, which are stored in a set of introduces the IaaS MTA cloud and its major services to create DataNodes. NameNode provides a map of where the data the big data and AI development and execution framework. blocks are in the cluster. JobTracker oversees and coordinates Section III. introduces our text-classification example with the parallel processing of data using MapReduce. Slave Nodes detailed stepwise specification. Section IV. summarizes the make up the vast majority of machines, they store the data and lessons learned from this real use case performed on MTA run the computations. Each slave runs both a DataNode and a community cloud. TaskTracker daemon that communicate with and receive instructions from their Master nodes. II. COMPONENTS AND SERVICES OF THE BIG DATA AND AI The Occopus infrastructure descriptors for such a FRAMEWORK Hadoop/HDFS cluster have been developed in SZTAKI and are published on the web page of MTA Cloud [2] as well as on A. MTA cloud and Occopus the web page of Occopus [9]. MTA Cloud was founded in 2015, when the Wigner Data Center and the Institute for Computer Science and Control (MTA SZTAKI) collaborated to establish a community Cloud for the member institutes of the Hungarian Academy of C. Support for high performance, distributed data processing Sciences. MTA Cloud has currently more than 80 active -Apache Spark projects from over 20 research institutes including among Apache Spark [10] is an open source, fast and general- others the Institute for Nuclear Research, the Research Centre purpose cluster framework, designed to run high performance for Astronomy and Earth Sciences and other academic and data analysis applications. Instead of the Apache Hadoop’s research institutes. Map Reduce programming paradigm [8], it performs internal In order to raise the abstraction level of the IaaS MTA computational data processing that results in a more flexible Cloud we have developed Occopus a cloud orchestrator and and faster run. The module uses a parallel data processing manager tool by which complex infrastructures like Hadoop or framework that stores data in memory and, if necessary, on Spark clusters can easily be built based on predeveloped and disk. This type of approach exceeds up to ten times the speed published Occopus infrastructure descriptors. The Occopus of Hadoop Map Reduce data processing [8]. cloud orchestrator can be deployed in MTA Cloud by any user Apache Spark was written in Scala, and the most important and once Occopus is deployed it can be used to build the of its favorite features are its highly developed easy-to-use selected infrastructure (e.g. Spark cluster) in MTA Cloud. A APIs, such as Scala, Java, Python and R, designed specifically tutorial explaining the deployment of Occopus is available on for handling large data sets. From an engineering perspective the web page of MTA Cloud (in Hungarian) [2]. The novelty of these APIs provide the biggest advantages and reason why Occopus was described and compared with similar cloud choosing the Spark framework. In addition to the Spark Core orchestrators in [6]. Here we mention only one of its main API, there are other libraries in the Spark Ecosystem, advantages. Its plugin architecture enables the use of plugins providing additional opportunities for large data analysis and for various cloud systems and hence AI reference architectures machine learning. These include Spark SQL for structured created by Occopus are easily portable among various cloud data processing, MLlib [11] for Machine Learning, etc. systems like Amazon, Azure, OpenStack, OpenNebula and CloudSigma. It is important to emphasize that Apache Spark is not a substitute for Apache Hadoop, but a kind of extension of it. B. Support for parallel data storage and processing – Apache Spark has been designed to be able to read and write data from Hadoop Hadoop’s own distributed file system (HDFS), and other Apache Hadoop is an open source software platform for storage systems such as HBase or Amazon S3. distributed storage and processing of very large data sets on computer clusters. Due to the special storage method, which is D. Spark Machine learning library based on a distributed file system (HDFS, Hadoop Distributed Apache Spark MLlib [11] is the Apache Spark machine File System [7]) Hadoop can process efficiently terabytes of learning library consisting of common learning algorithms and data in just minutes, and even petabytes in hours. utilities. As the core component library in Apache Spark, HDFS uses the MapReduce [8] paradigm that was proposed MLlib offers numerous supervised and unsupervised learning by Google and found wide-spread popularity. HDFS has a algorithms, from Logistic Regression to k-means and master/slave architecture. It means that the nodes apart from clustering, collaborative filtering, dimensionality reduction, the Client machine are Master nodes and Slave nodes. Master and underlying optimization primitives. node supervises the mechanism of data storing in HDFS and As the next step of building a Big Data and AI oriented running parallel computations (Map Reduce) on all that data. environment for MTA Cloud users we have developed the An HDFS cluster consists of a single NameNode, a number of Occopus infrastructure descriptors for Spark/HDFS clusters DataNodes, usually one per node in the cluster, which manage and published them both on the web page of MTA Cloud [2] storage attached to the nodes that they run on. The NameNode and on the web page of Occopus [9] oversees and coordinates the data storage function. Internally, a 11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 E. Interactive Development Environments oriented reference architecture can be deployed by Occopus With the above-mentioned frameworks big data and machine on MTA Cloud depending on the actual needs of the users: learning algorithms can easily be executed in a parallel 1. RStudio Web Server, Spark, HDFS for R users manner. In order to support scientists from different research 2. Jupyter Notebook, Spark, HDFS for Python, Scala fields we also support interactive development environments and Java (from version 9) users that are easy to use with various programming languages and These reference architectures are the starting points for the are very popular among the research communities. actual big data or AI applications. RStudio [12] is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor III. TEXT CLASSIFICATION SCENARIO that supports direct code execution, as well as tools for plotting, history, debugging and work space management. RStudio Desktop is a standalone desktop application that in no The third step was the usage of the developed reference way requires or connects to the RStudio Server. architectures for various big data and AI application domains. RStudio Web Server is a Linux server application that In this paper we have selected the text classification domain to provides a web browser/based interface to the version of R illustrate the usage of the Spark-oriented reference running on the server. Deploying R and RStudio on a server architecture. has a number of benefits: the ability to access R workspace MTA Centre for Social Sciences wanted to solve the following from any computer at any location; sharing of code, data, and problem on MTA Cloud: The coding of public policy major other files with colleagues; allowing multiple users to share topics on various legal and media corpora serves as an access to the more powerful computing resources available on important input for testing a wide range of hypotheses and a server; control access to data in a centralized manner; models in political science. This fundamental work has till centralized installation and configuration of R, R packages and recently mostly been conducted by double-blind human other libraries. coding, which is still considered the gold-standard for Jupyter Notebooks [13] are starting to become extremely categorizing text in this field. This method, however, is both popular especially in education and field of empirical research. rather expensive and increasingly unfeasible with the growing The reason for Jupyter’s great success stems from the clear size of available corpora. Different forms of automated advantages of literate programming and improved web coding, such as dictionary-based and supervised learning browser technologies. Literate programming is a software methods, offer a solution to these problems. But these methods development style pioneered by Stanford computer scientist, are themselves also reliant on appropriate dictionaries and/or Donald Knuth. Literate programming allows users to training sets, which need to be compiled and developed first. formulate and describe their thoughts with prose, supplemented by mathematical equations, as they prepare to We have provided the architecture for them described in write code blocks. It excels at demonstration, research, and Section II/E and at the same time demonstrated for them how teaching objectives especially for science. to use this architecture for solving their problem. After the There are a lot of free and open source Jupyter Notebook demonstration they started to use the RStudio version of the codes on numerous topics in many scientific disciplines, such framework meanwhile we have also investigated possible as machine learning, social sciences, physics, computer solutions for the problem using the Jupyter Notebook version. science, etc. They have LaTeX support for mathematical Here we show our approach to solve the problem. The steps of equations with MathJax, a web browser enhancement for solving the above described text classification problem are display of mathematics. These notebooks can be saved and shown in Figure 1. This simple figure in fact, represents easily shared in ipynb JSON format. They can also be several different execution pipelines depending on the choice committed to version control repositories such as git and the of the user. With the use of the Jupyter Notebook, Spark, code sharing site github. HDFS architecture we were able to execute and evaluate the different classification pipelines in parallel. In the next Jupyter notebooks can be viewed with nbviewer technology paragraphs the different stages of our Spark-based pipelines which is supported by github. Moreover, because these are detailed. notebook environments are for writing and developing code, they offer many niceties available in typical Interactive 1) Disribute the data Development Environments (IDEs) such as code completion and easy access to help. The first stage is to upload the data (text) into the HDFS As part of the second step of providing generic big data system in an appropriate form. At first a Resilient Distributed and AI platforms for scientists we have extended the Dataset is built, which is the basic data structure of Spark by Spark/HDFS cluster with both RStudio Web Server and dividing the dataset into logical partitions. These partitions Jupyter Notebook and created the necessary Occopus may be computed in parallel on different nodes of the cluster. infrastructure descriptors. As a result, two types of Spark- 11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 Read Text Structuring Tokenization of the input text. Of course, this feature should be represented Raw Text files from Hadoop Distributed File System Spark RDD to Spark DataFrame Split Text into arrays of Strings in a form of a vector. Accordingly, the next stage in our pipeline of Fig. 1 is feature vectorization. There are different kinds of feature vectorization algorithms and many of them are supported by the SparkML library. In the next paragraphs Feature Vectorization ML Processing Evaluation • Bag of Words • Naive Bayes Measuring the accuracy of a the applied feature vectorization and word embedding (CountVectorizer) • TF-IDF (Tfidf) • Random Forest • Logistic Regression particular machine learning method on unseen data methods are briefly introduced. • Word2Vec (Word2Vec) • Neural Net – Convolution Net a) Bag-of-Words Categorized Text The bag-of-words (BOW) algorithm provides feature Best method will be chosen extraction capabilities. As the name suggests, it does not keep the words structured just a “bag” of words. It gives back a histogram of the words within the text, i.e., considering each Figure 1 Text processing pipeline word count as a feature. The algorithm consists of two phases: first it builds a vocabulary of the known words and then it measures the presence of these words in the different 2) Data structuring documents related to the corpora. CountVectorizer function of Spark ML implements this The second stage is the data structuring step. Apache Spark concept by converting a collection of text documents to SQL is a module for structured data processing in Spark. vectors of token counts. It can be used to extract the Spark SQL module supports operating on a variety of data vocabulary and to generate an array of strings from the sources through the DataFrame API. DataFrame is a document. distributed collection of data organized into named columns. Actually, it is equal to the table concept in relational database b) TF-IDF systems or a dataframe in R/Python. DataFrame contains rows Term frequency-inverse document frequency (TF-IDF) is a with Schema. It can scale from kilobytes of data on the single feature vectorization method widely used in text mining to laptop to petabytes of data on a large cluster. A DataFrame reflect the importance of a term to a document in the corpus. can be operated on using relational transformations such as Terms with high frequency within a document have high filter, select, group by, sort, etc. Like Apache Spark in general, weights. In addition, terms frequently appearing in all Spark SQL in particular is all about distributed in-memory documents of the document corpus have lower weights. TF- computations on scale. IDF has been traditionally applied in information retrieval systems, because it is capable highlight documents that are 3) Text pre-processing closely related to a term but not to an exact string-match. Spark ML function that supports this method is IDF. The stored and structured data should be transformed into an appropriate input form for the machine learning algorithm c) Word2Vec (e.g.: neural networks). This is called text pre-processing. The next stage is therefore the text pre-processing which can have Bag-of-Words and TF-IDF hold no information about the several sub-steps including tokenization, stop-word, meaning of the word, how it is used in language and what is stemming. We restricted our pipeline to use only the its usual context (i.e. what other words it generally appears tokenization sub-step. close to). Word embeddings try to “compress” large one-hot Tokenization is the process of demarcating and possibly word vectors into much smaller vectors (a few hundred classifying sections of a string of input characters. For elements) which preserve some of the meaning and context of example, in the text string of a sentence the raw input (series the word. of characters) must be explicitly split into tokens with a given Word2Vec is a sophisticated word embedding technique, space delimiter in the same way as a natural language speaker which is based on the idea that words that occur in the would do. Spark machine learning library (mllib) has a lot of same contexts tend to have similar meanings. The training built in functions for text mining such as RegexTokenizer. objective of Word2Vec is to learn word representations that Therefore, users of the Spark environment shown in Figure 1 can predict its context in the same sentence or in the given do not have to develop any new software for tokenization, just corpus. This model maps each word to a unique and fixed-size use the Spark ML RegexTokenizer function. vector that can be used as features for document similarity calculations and classification respectively. 4) Feature vectorization The context of the word is the key measure of meaning that is utilized in Word2Vec. Words which have similar contexts Features in our text-classification problems mean to find share meaning under Word2Vec, and their reduced vector words, or terms that can represent some special characteristics representations will be similar. The built-in word2vec 11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 algorithm uses the skip-gram neural network model. In the c) Multinomial logistic regression skip-gram model version of Word2Vec, the goal is to take a In terms of its structure, logistic regression can be thought as a target word i.e. “sat” and predict the surrounding context neural network with no hidden layer, and just one output node. words. This involves an iterative learning process, that was Instead of fitting a straight line or hyperplane, the logistic performed by a neural network with one hidden layer regression model uses the logistic function to squeeze the consisting of 300 neurons. output of a linear equation between 0 and 1. In our case the number of inputs were equal with the number of words The end product of this learning will be an embedding layer in coming from the bag of words, the tf-idf model [15]. a network – this embedding layer is a kind of lookup table – the rows are vector representations of each word in our d) Multi Layer Perceptron (Neural Network) vocabulary. We have aggregated the word vectors of each word in a document, calculating mean to get one vector representation of 5) ML methods in text classification - supervised learning each document. Now each document is represented by a vector with 300 dimensions. The values of the vectors are the In this phase of the work we use the different built-in machine inputs of or fully connected neural network (or feedforward learning algorithms, that are shortly introduced in the next artificial neural network). paragraphs. Neural net consists of multiple layers of nodes. The layers are fully connected to the next layer in the network. a) Random Forest The input layer represents the input data. All other nodes map Random forests are ensembles of decision trees [14]. Decision inputs to outputs by a linear combination of the inputs with the trees and their ensembles are very popular methods node’s weights w and bias b after then applying an activation classification and regression type tasks, since they are easy to function. In spark the nodes in intermediate layers use sigmoid interpret, handle categorical features, can be extended to the (logistic) function, and this property is not changeable. The multiclass classification setting, and are able to capture non- last nodes in the output layer use softmax function where the linearities and feature interactions. number of nodes in the output layer corresponds to the number The spark.ml implementation supports decision trees for of classes. binary and multiclass classification and for regression, using both continuous and categorical features. The implementation e) Convolutional Nerual Net (CNN) partitions data by rows, allowing distributed training with In order to feed data to a CNN, we have to ensure that each millions or even billions of instances. word vector is fed to the model in a sequence that matches the In spark.ml Decision Tree classifier is available via the original document. The dimension of the vector we have for DecisionTreeClassifier() method [15]. the whole document is the length (or the number of words in the document) times the dimension of the vector (in our case Random forests combine many decision trees in order to 300) which represents the current word. It is important to note reduce the risk of overfitting. Random forests train a set of that each word has a fix and same length of vector decision trees separately, so the training can be done in representation. parallel. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining A neural network model will expect all the data to have the the predictions from each tree reduces the variance of the same dimension, but in case of different documents, they have predictions, improving the performance on test data. different lengths. This can be handled with padding. In spark.ml implementation random forests is available via By padding the inputs, we decide the maximum length of RandomForest() method [16]. words in a document, then zero pads the rest, if the input length is shorter than the designated length. In the case where b) Naïve Bayes it exceeds the maximum length, then it also truncates either “Bayes” is named from the famous Bayes’ Theorem in from the beginning or from the end. In other words, each probability, and “Naive” is because of the strong (naive) document is represented as a matrix, where rows are the words independence assumptions between every pair of features. and the columns are the Word2Vec features. This A feature’s value is the frequency of the word (in multinomial transormation enables our data to be fed into a Convolutional Naive Bayes) or a zero or one indicating whether the word Neural Net (CNN). was found in the document. Naïve Bayes method in Spark computes the conditional probability distribution of each 6) Evaluation feature given each label. It applies Bayes’ theorem to compute the conditional probability distribution of each label given an The final stage is testing, measuring, evaluating and ranking of observation [15]. the classification models and then to choose the best algorithm to classify the new incoming document. 11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019 In our experiment we were able to combine all the feature reference architectures are available on the Occopus web page vectorization methods with all machine learning algorithms as executable tutorials and the following two Spark-oriented (shown in Fig. 1) with 2 exceptions: reference architectures are available at the MTA Cloud web A) In Naïve Bayes method feature values must be non- page as tutorials: negative while the Word2Vec method produces real numbers. 1. RStudio Web Server, Spark, HDFS for R users B) Convolution neural net as a classifier can handle data 2. Jupyter Notebook, Spark, HDFS for Python, Scala which have the same size of dimensions. As we discussed and Java (from version 9) users earlier only the word2vec method can produce a proper input Future work will intend to create further AI reference for convolutional net, the bag-of-word, and tf-idf methods architectures to cover further AI application classes. cannot. All the experiments were conducted on eleven-node-cluster, with one master and ten worker nodes. Each node has 8 virtual ACKNOWLEDGMENT CPU cores and 16GB of RAM. The overall computing capacity consisted of 80 virtual CPUs and 160GB of RAM. We thank for the usage of MTA Cloud (https://cloud.mta.hu) We found that the Word2Vec feature combined with that significantly helped us achieve the results published in Convolutional Neural Net machine learning algorithm gave this paper. We would also like to acknowledge the support of the best performance. the Text Mining of Political and Legal Texts (POLTEXT) Incubator Project, MTA Centre for Social Sciences. IV. CONCLUSION We have developed a big data and AI application development REFERENCES and execution framework that needs three major steps to be [1] “SZTAKI Cloud home - SZTAKI Cloud.” [Online]. Available: https://cloud.sztaki.hu/en/home. [Accessed: 01-Apr-2019]. created: [2] “MTA Cloud | MTA Cloud.” [Online]. Available: https://cloud.mta.hu/. 1. Occopus to define and deploy the required [Accessed: 01-Apr-2019]. infrastructure in the target cloud. This was created by [3] E. Fernández-del-Castillo, D. Scardaci, and Á. L. García, “The EGI SZTAKI. Federated Cloud e-Infrastructure,” Procedia Comput. Sci., vol. 68, pp. 2. Occopus infrastructure descriptors for the generic big 196–205, Jan. 2015. data and AI tools and environments like Hadoop, [4] “Whitepapers – Amazon Web Services (AWS).” [Online]. Available: https://aws.amazon.com/whitepapers/. [Accessed: 01-Apr-2019]. HDFS, Spark, Jupyter Notebook, RStudio Web [5] “Laboratory of Parallel and Distributed Systems | MTA SZTAKI.” Server. These are provided as AI reference [Online]. Available: https://www.sztaki.hu/en/science/departments/lpds. architectures developed by SZTAKI and can be used [Accessed: 01-Apr-2019]. in according to the actual AI application class. [6] J. Kovács and P. Kacsuk, “Occopus: a Multi-Cloud Orchestrator to Deploy and Manage Complex Scientific Infrastructures,” J. Grid 3. A concrete application-oriented big data and AI Comput., vol. 16, no. 1, pp. 19–37, Mar. 2018. application development and execution framework [7] “HDFS Architecture Guide.” [Online]. Available: that is built by Occopus according to the Occopus https://hadoop.apache.org/docs/current1/hdfs_design.html. [Accessed: 01-Apr-2019]. infrastructure descriptors that are selected, [8] “MapReduce Tutorial.” [Online]. Available: customized and parameterized by the user. The https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. [Accessed: customization and parameterization process is 01-Apr-2019]. described in detail in the tutorials on the reference [9] “Welcome - Occopus.” [Online]. Available: http://occopus.lpds.sztaki.hu/de/. [Accessed: 01-Apr-2019]. architectures provided at the Occopus web page. [10] “Apache SparkTM - Unified Analytics Engine for Big Data.” [Online]. Available: https://spark.apache.org/. [Accessed: 01-Apr-2019]. In this paper we have demonstrated how to use a big data and [11] “MLlib | Apache Spark.” [Online]. Available: AI application development and execution reference https://spark.apache.org/mllib/. [Accessed: 01-Apr-2019]. architecture tailored for text classification applications. Due to [12] “Open source and enterprise-ready professional software for data the fast creation of the required Spark environment and the science - RStudio.” [Online]. Available: https://www.rstudio.com/. available resources in MTA Cloud we were able to try and test [Accessed: 01-Apr-2019]. all the possible text classification pipelines that are presented [13] “The Jupyter Notebook — IPython.” [Online]. Available: in Fig 1. https://ipython.org/notebook.html. [Accessed: 01-Apr-2019]. Although the presented big data and AI application [14] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. development and execution framework was created and tested [15] “Classification and regression - MLlib main guide.” [Online]. Available: on MTA Cloud it can be used on other clouds including https://spark.apache.org/docs/latest/ml-classification-regression.html. Amazon, Azure, OpenStack, OpenNebula and CloudSigma [16] “Ensembles - RDD-based API - Spark 2.4.0 Documentation.” [Online]. due to the plugin architecture of the underlying Occopus cloud Available: https://spark.apache.org/docs/latest/mllib-ensembles.html. orchestrator [6]. Many components of the described AI [Accessed: 01-Apr-2019].