=Paper= {{Paper |id=Vol-2975/paper2 |storemode=property |title=Big Data and Machine Learning Framework for Clouds and Its Usage for Text Classification |pdfUrl=https://ceur-ws.org/Vol-2975/paper2.pdf |volume=Vol-2975 |authors=István Pintye,Eszter Kail,Peter Kacsuk |dblpUrl=https://dblp.org/rec/conf/iwsg/PintyeKK19 }} ==Big Data and Machine Learning Framework for Clouds and Its Usage for Text Classification== https://ceur-ws.org/Vol-2975/paper2.pdf
                             11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019



  Big data and machine learning framework for clouds
           and its usage for text classification


          István Pintye, Eszter Kail, Péter Kacsuk                                                       Péter Kacsuk
          Institute for Computer Science and Control                                              University of Westminster
                Hungarian Academy of Sciences                                                           London, UK
                       Budapest, Hungary                                                         P.Kacsuk@westminster.ac.uk
                  pintye.istvan@sztaki.mta.hu


    Abstract— The paper describes a big data and AI application                research for the scientists of MTA. Nearly 100 projects have
development and execution framework that was originally                        been run on MTA Cloud since its opening and more and more
developed for MTA Cloud (an OpenStack based cloud) but could                   projects require to use Big Data and machine learning
be used on other clouds including Amazon, OpenStack,                           applications. However, the large number of AI tools available
OpenNebula and CloudSigma. The paper explains the concept                      for clouds are very complex and their proper deployment and
and components of the big data and AI environment and                          configuration requires significant learning of both the tools and
illustrates its usage by a text classification application.                    the underlying cloud. Furthermore, tools supporting different
                                                                               layers like user interface layer, language layer, machine
   Keywords—machine learning; big data; parallel and distributed
execution; cloud;
                                                                               learning layer, deep learning layer are not always compatible
                                                                               and hence it requires further skill to select the right tools from
                                                                               each layer in a way that they should be able to work together in
                          I. INTRODUCTION                                      an AI environment.
    Researches in different scientific fields (Natural Sciences,                   Recognizing this problem, we have decided to develop so-
Physics, Political Science) often require huge computational                   called AI reference architectures that can support the solution
resources and storage capacity to handle real Big Data.                        of certail AI application classes and can run in the cloud in a
Traditional sequential data processing algorithms are not                      reliable and robust way and can easily be deployed and used by
sufficient to analyze this large volume of data. For efficient                 the end-user scientists. The ultimate goal is to develop a large
processing and analysis new approaches, techniques and tools                   set of AI reference architectures for a large set of various AI
are needed.                                                                    problem classes.
    Moreover, cloud infrastructures and services are becoming                      The AI reference architectures have been created in three
even more popular and are playing an appropriate and widely                    steps:
used role to address the computation need of many scientific
and commercial Big Data applications. Their widespread usage                      1.   Development and publication of a cloud orchestrator
is a consequence of the dynamic and scalable nature of the                             called Occopus that enables the fast creation of
services provided by cloud providers.                                                  complex application frameworks in the cloud based on
                                                                                       Occopus infrastructure descriptors even by novice
    However, the data scientists face several problems once                            cloud users.
they start planning the use or deployment of any Big Data
platform on cloud(s). On one hand, the selection of the                           2.   Development and publication of the Occopus
appropriate cloud provider(s) is always a cumbersome process                           infrastructure descriptors for generic AI reference
since the potential user community has to take into                                    architectures like for example: Jupyter, Python, Spark
consideration several factors and trade-offs even if they need                         ml, Spark cluster and HDFS.
only a generic Infrastructure-as-a-Service (IaaS) provider:
                                                                                  3.   Development and publication of application-oriented
private institutional (e.g. SZTAKI Cloud [1], federated cloud
                                                                                       environments for various AI application domains.
(e.g. MTA Cloud [2] or pan-European EGI FedCloud [3]) or
public cloud (e.g. Amazon [4]).                                                    To demonstrate the third step, we use a text classification
                                                                               application provided by the POLTEXT (Text Mining of
   The Hungarian Academy of Sciences (MTA) provides free
                                                                               Political and Legal Texts) Incubator Project of MTA Centre for
IaaS cloud (MTA cloud) services for research communities and
                                                                               Social Sciences. This problem is complex enough to
easy to use, dynamic infrastructures adapted to the actual
                                                                               demonstrate the advantages of using the framework we have
project requirements. MTA Cloud was established to accelerate
                                                                               created for supporting big data and AI applications.

Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


    The structure of the paper is as follows. The next section      file is split into one or more blocks, which are stored in a set of
introduces the IaaS MTA cloud and its major services to create      DataNodes. NameNode provides a map of where the data
the big data and AI development and execution framework.            blocks are in the cluster. JobTracker oversees and coordinates
Section III. introduces our text-classification example with        the parallel processing of data using MapReduce. Slave Nodes
detailed stepwise specification. Section IV. summarizes the         make up the vast majority of machines, they store the data and
lessons learned from this real use case performed on MTA            run the computations. Each slave runs both a DataNode and a
community cloud.                                                    TaskTracker daemon that communicate with and receive
                                                                    instructions from their Master nodes.
  II. COMPONENTS AND SERVICES OF THE BIG DATA AND AI                    The Occopus infrastructure descriptors for such a
                         FRAMEWORK                                  Hadoop/HDFS cluster have been developed in SZTAKI and
                                                                    are published on the web page of MTA Cloud [2] as well as on
A. MTA cloud and Occopus                                            the web page of Occopus [9].
    MTA Cloud was founded in 2015, when the Wigner Data
Center and the Institute for Computer Science and Control
(MTA SZTAKI) collaborated to establish a community Cloud
for the member institutes of the Hungarian Academy of               C. Support for high performance, distributed data processing
Sciences. MTA Cloud has currently more than 80 active                   -Apache Spark
projects from over 20 research institutes including among               Apache Spark [10] is an open source, fast and general-
others the Institute for Nuclear Research, the Research Centre      purpose cluster framework, designed to run high performance
for Astronomy and Earth Sciences and other academic and             data analysis applications. Instead of the Apache Hadoop’s
research institutes.                                                Map Reduce programming paradigm [8], it performs internal
    In order to raise the abstraction level of the IaaS MTA         computational data processing that results in a more flexible
Cloud we have developed Occopus a cloud orchestrator and            and faster run. The module uses a parallel data processing
manager tool by which complex infrastructures like Hadoop or        framework that stores data in memory and, if necessary, on
Spark clusters can easily be built based on predeveloped and        disk. This type of approach exceeds up to ten times the speed
published Occopus infrastructure descriptors. The Occopus           of Hadoop Map Reduce data processing [8].
cloud orchestrator can be deployed in MTA Cloud by any user             Apache Spark was written in Scala, and the most important
and once Occopus is deployed it can be used to build the            of its favorite features are its highly developed easy-to-use
selected infrastructure (e.g. Spark cluster) in MTA Cloud. A        APIs, such as Scala, Java, Python and R, designed specifically
tutorial explaining the deployment of Occopus is available on       for handling large data sets. From an engineering perspective
the web page of MTA Cloud (in Hungarian) [2]. The novelty of        these APIs provide the biggest advantages and reason why
Occopus was described and compared with similar cloud               choosing the Spark framework. In addition to the Spark Core
orchestrators in [6]. Here we mention only one of its main          API, there are other libraries in the Spark Ecosystem,
advantages. Its plugin architecture enables the use of plugins      providing additional opportunities for large data analysis and
for various cloud systems and hence AI reference architectures      machine learning. These include Spark SQL for structured
created by Occopus are easily portable among various cloud          data processing, MLlib [11] for Machine Learning, etc.
systems like Amazon, Azure, OpenStack, OpenNebula and
CloudSigma.
                                                                    It is important to emphasize that Apache Spark is not a
                                                                    substitute for Apache Hadoop, but a kind of extension of it.
B. Support for parallel data storage and processing – Apache
                                                                    Spark has been designed to be able to read and write data from
    Hadoop                                                          Hadoop’s own distributed file system (HDFS), and other
    Apache Hadoop is an open source software platform for           storage systems such as HBase or Amazon S3.
distributed storage and processing of very large data sets on
computer clusters. Due to the special storage method, which is      D. Spark Machine learning library
based on a distributed file system (HDFS, Hadoop Distributed        Apache Spark MLlib [11] is the Apache Spark machine
File System [7]) Hadoop can process efficiently terabytes of        learning library consisting of common learning algorithms and
data in just minutes, and even petabytes in hours.                  utilities. As the core component library in Apache Spark,
    HDFS uses the MapReduce [8] paradigm that was proposed          MLlib offers numerous supervised and unsupervised learning
by Google and found wide-spread popularity. HDFS has a              algorithms, from Logistic Regression to k-means and
master/slave architecture. It means that the nodes apart from       clustering, collaborative filtering, dimensionality reduction,
the Client machine are Master nodes and Slave nodes. Master         and underlying optimization primitives.
node supervises the mechanism of data storing in HDFS and               As the next step of building a Big Data and AI oriented
running parallel computations (Map Reduce) on all that data.        environment for MTA Cloud users we have developed the
An HDFS cluster consists of a single NameNode, a number of          Occopus infrastructure descriptors for Spark/HDFS clusters
DataNodes, usually one per node in the cluster, which manage        and published them both on the web page of MTA Cloud [2]
storage attached to the nodes that they run on. The NameNode        and on the web page of Occopus [9]
oversees and coordinates the data storage function. Internally, a
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


E. Interactive Development Environments                            oriented reference architecture can be deployed by Occopus
With the above-mentioned frameworks big data and machine           on MTA Cloud depending on the actual needs of the users:
learning algorithms can easily be executed in a parallel               1. RStudio Web Server, Spark, HDFS for R users
manner. In order to support scientists from different research         2. Jupyter Notebook, Spark, HDFS for Python, Scala
fields we also support interactive development environments                 and Java (from version 9) users
that are easy to use with various programming languages and        These reference architectures are the starting points for the
are very popular among the research communities.                   actual big data or AI applications.
    RStudio [12] is an integrated development environment
(IDE) for R. It includes a console, syntax-highlighting editor
                                                                               III. TEXT CLASSIFICATION SCENARIO
that supports direct code execution, as well as tools for
plotting, history, debugging and work space management.
RStudio Desktop is a standalone desktop application that in no     The third step was the usage of the developed reference
way requires or connects to the RStudio Server.                    architectures for various big data and AI application domains.
    RStudio Web Server is a Linux server application that          In this paper we have selected the text classification domain to
provides a web browser/based interface to the version of R         illustrate the usage of the Spark-oriented reference
running on the server. Deploying R and RStudio on a server         architecture.
has a number of benefits: the ability to access R workspace        MTA Centre for Social Sciences wanted to solve the following
from any computer at any location; sharing of code, data, and      problem on MTA Cloud: The coding of public policy major
other files with colleagues; allowing multiple users to share      topics on various legal and media corpora serves as an
access to the more powerful computing resources available on       important input for testing a wide range of hypotheses and
a server; control access to data in a centralized manner;          models in political science. This fundamental work has till
centralized installation and configuration of R, R packages and    recently mostly been conducted by double-blind human
other libraries.                                                   coding, which is still considered the gold-standard for
    Jupyter Notebooks [13] are starting to become extremely        categorizing text in this field. This method, however, is both
popular especially in education and field of empirical research.   rather expensive and increasingly unfeasible with the growing
The reason for Jupyter’s great success stems from the clear        size of available corpora. Different forms of automated
advantages of literate programming and improved web                coding, such as dictionary-based and supervised learning
browser technologies. Literate programming is a software           methods, offer a solution to these problems. But these methods
development style pioneered by Stanford computer scientist,        are themselves also reliant on appropriate dictionaries and/or
Donald Knuth. Literate programming allows users to                 training sets, which need to be compiled and developed first.
formulate and describe their thoughts with prose,
supplemented by mathematical equations, as they prepare to             We have provided the architecture for them described in
write code blocks. It excels at demonstration, research, and       Section II/E and at the same time demonstrated for them how
teaching objectives especially for science.                        to use this architecture for solving their problem. After the
    There are a lot of free and open source Jupyter Notebook       demonstration they started to use the RStudio version of the
codes on numerous topics in many scientific disciplines, such      framework meanwhile we have also investigated possible
as machine learning, social sciences, physics, computer            solutions for the problem using the Jupyter Notebook version.
science, etc. They have LaTeX support for mathematical             Here we show our approach to solve the problem. The steps of
equations with MathJax, a web browser enhancement for              solving the above described text classification problem are
display of mathematics. These notebooks can be saved and           shown in Figure 1. This simple figure in fact, represents
easily shared in ipynb JSON format. They can also be               several different execution pipelines depending on the choice
committed to version control repositories such as git and the      of the user. With the use of the Jupyter Notebook, Spark,
code sharing site github.                                          HDFS architecture we were able to execute and evaluate the
                                                                   different classification pipelines in parallel. In the next
Jupyter notebooks can be viewed with nbviewer technology           paragraphs the different stages of our Spark-based pipelines
which is supported by github. Moreover, because these              are detailed.
notebook environments are for writing and developing code,
they offer many niceties available in typical Interactive             1) Disribute the data
Development Environments (IDEs) such as code completion
and easy access to help.                                           The first stage is to upload the data (text) into the HDFS
    As part of the second step of providing generic big data       system in an appropriate form. At first a Resilient Distributed
and AI platforms for scientists we have extended the               Dataset is built, which is the basic data structure of Spark by
Spark/HDFS cluster with both RStudio Web Server and                dividing the dataset into logical partitions. These partitions
Jupyter Notebook and created the necessary Occopus                 may be computed in parallel on different nodes of the cluster.
infrastructure descriptors. As a result, two types of Spark-
                                   11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


 Read Text                        Structuring                      Tokenization                        of the input text. Of course, this feature should be represented
  Raw Text files from Hadoop
  Distributed File System
                                   Spark RDD to Spark
                                    DataFrame
                                                                   Split Text into arrays of Strings
                                                                                                       in a form of a vector. Accordingly, the next stage in our
                                                                                                       pipeline of Fig. 1 is feature vectorization. There are different
                                                                                                       kinds of feature vectorization algorithms and many of them
                                                                                                       are supported by the SparkML library. In the next paragraphs
 Feature Vectorization            ML Processing                    Evaluation
 • Bag of Words                   • Naive Bayes                     Measuring the accuracy of a        the applied feature vectorization and word embedding
   (CountVectorizer)
 • TF-IDF (Tfidf)
                                  • Random Forest
                                  • Logistic Regression
                                                                    particular machine learning
                                                                    method on unseen data              methods are briefly introduced.
 • Word2Vec (Word2Vec)            • Neural Net – Convolution Net


                                                                                                               a) Bag-of-Words
 Categorized Text                                                                                      The bag-of-words (BOW) algorithm provides feature
 Best method will be chosen                                                                            extraction capabilities. As the name suggests, it does not keep
                                                                                                       the words structured just a “bag” of words. It gives back a
                                                                                                       histogram of the words within the text, i.e., considering each
                          Figure 1 Text processing pipeline                                            word count as a feature. The algorithm consists of two phases:
                                                                                                       first it builds a vocabulary of the known words and then it
                                                                                                       measures the presence of these words in the different
     2) Data structuring                                                                               documents related to the corpora.
                                                                                                       CountVectorizer function of Spark ML implements this
The second stage is the data structuring step. Apache Spark                                            concept by converting a collection of text documents to
SQL is a module for structured data processing in Spark.                                               vectors of token counts. It can be used to extract the
Spark SQL module supports operating on a variety of data                                               vocabulary and to generate an array of strings from the
sources through the DataFrame API. DataFrame is a                                                      document.
distributed collection of data organized into named columns.
Actually, it is equal to the table concept in relational database                                             b) TF-IDF
systems or a dataframe in R/Python. DataFrame contains rows                                            Term frequency-inverse document frequency (TF-IDF) is a
with Schema. It can scale from kilobytes of data on the single                                         feature vectorization method widely used in text mining to
laptop to petabytes of data on a large cluster. A DataFrame                                            reflect the importance of a term to a document in the corpus.
can be operated on using relational transformations such as                                            Terms with high frequency within a document have high
filter, select, group by, sort, etc. Like Apache Spark in general,                                     weights. In addition, terms frequently appearing in all
Spark SQL in particular is all about distributed in-memory                                             documents of the document corpus have lower weights. TF-
computations on scale.                                                                                 IDF has been traditionally applied in information retrieval
                                                                                                       systems, because it is capable highlight documents that are
     3) Text pre-processing                                                                            closely related to a term but not to an exact string-match.
                                                                                                       Spark ML function that supports this method is IDF.
The stored and structured data should be transformed into an
appropriate input form for the machine learning algorithm
                                                                                                              c) Word2Vec
(e.g.: neural networks). This is called text pre-processing. The
next stage is therefore the text pre-processing which can have                                          Bag-of-Words and TF-IDF hold no information about the
several sub-steps including tokenization, stop-word,                                                   meaning of the word, how it is used in language and what is
stemming. We restricted our pipeline to use only the                                                   its usual context (i.e. what other words it generally appears
tokenization sub-step.                                                                                 close to). Word embeddings try to “compress” large one-hot
      Tokenization is the process of demarcating and possibly                                          word vectors into much smaller vectors (a few hundred
classifying sections of a string of input characters. For                                              elements) which preserve some of the meaning and context of
example, in the text string of a sentence the raw input (series                                        the word.
of characters) must be explicitly split into tokens with a given                                       Word2Vec is a sophisticated word embedding technique,
space delimiter in the same way as a natural language speaker                                          which is based on the idea that words that occur in the
would do. Spark machine learning library (mllib) has a lot of                                          same contexts tend to have similar meanings. The training
built in functions for text mining such as RegexTokenizer.                                             objective of Word2Vec is to learn word representations that
Therefore, users of the Spark environment shown in Figure 1                                            can predict its context in the same sentence or in the given
do not have to develop any new software for tokenization, just                                         corpus. This model maps each word to a unique and fixed-size
use the Spark ML RegexTokenizer function.                                                              vector that can be used as features for document similarity
                                                                                                       calculations and classification respectively.
     4) Feature vectorization                                                                          The context of the word is the key measure of meaning that is
                                                                                                       utilized in Word2Vec. Words which have similar contexts
Features in our text-classification problems mean to find                                              share meaning under Word2Vec, and their reduced vector
words, or terms that can represent some special characteristics                                        representations will be similar. The built-in word2vec
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


algorithm uses the skip-gram neural network model. In the                c) Multinomial logistic regression
skip-gram model version of Word2Vec, the goal is to take a         In terms of its structure, logistic regression can be thought as a
target word i.e. “sat” and predict the surrounding context         neural network with no hidden layer, and just one output node.
words. This involves an iterative learning process, that was       Instead of fitting a straight line or hyperplane, the logistic
performed by a neural network with one hidden layer                regression model uses the logistic function to squeeze the
consisting of 300 neurons.                                         output of a linear equation between 0 and 1. In our case the
                                                                   number of inputs were equal with the number of words
The end product of this learning will be an embedding layer in     coming from the bag of words, the tf-idf model [15].
a network – this embedding layer is a kind of lookup table –
the rows are vector representations of each word in our                    d) Multi Layer Perceptron (Neural Network)
vocabulary.
                                                                   We have aggregated the word vectors of each word in a
                                                                   document, calculating mean to get one vector representation of
   5) ML methods in text classification - supervised learning
                                                                   each document. Now each document is represented by a
                                                                   vector with 300 dimensions. The values of the vectors are the
In this phase of the work we use the different built-in machine
                                                                   inputs of or fully connected neural network (or feedforward
learning algorithms, that are shortly introduced in the next
                                                                   artificial neural network).
paragraphs.
                                                                   Neural net consists of multiple layers of nodes. The layers are
                                                                   fully connected to the next layer in the network.
     a) Random Forest                                              The input layer represents the input data. All other nodes map
Random forests are ensembles of decision trees [14]. Decision      inputs to outputs by a linear combination of the inputs with the
trees and their ensembles are very popular methods                 node’s weights w and bias b after then applying an activation
classification and regression type tasks, since they are easy to   function. In spark the nodes in intermediate layers use sigmoid
interpret, handle categorical features, can be extended to the     (logistic) function, and this property is not changeable. The
multiclass classification setting, and are able to capture non-    last nodes in the output layer use softmax function where the
linearities and feature interactions.                              number of nodes in the output layer corresponds to the number
The spark.ml implementation supports decision trees for            of classes.
binary and multiclass classification and for regression, using
both continuous and categorical features. The implementation             e) Convolutional Nerual Net (CNN)
partitions data by rows, allowing distributed training with
                                                                   In order to feed data to a CNN, we have to ensure that each
millions or even billions of instances.
                                                                   word vector is fed to the model in a sequence that matches the
In spark.ml Decision Tree classifier is available via the
                                                                   original document. The dimension of the vector we have for
DecisionTreeClassifier() method [15].
                                                                   the whole document is the length (or the number of words in
                                                                   the document) times the dimension of the vector (in our case
Random forests combine many decision trees in order to
                                                                   300) which represents the current word. It is important to note
reduce the risk of overfitting. Random forests train a set of
                                                                   that each word has a fix and same length of vector
decision trees separately, so the training can be done in
                                                                   representation.
parallel. The algorithm injects randomness into the training
process so that each decision tree is a bit different. Combining
                                                                   A neural network model will expect all the data to have the
the predictions from each tree reduces the variance of the
                                                                   same dimension, but in case of different documents, they have
predictions, improving the performance on test data.
                                                                   different lengths. This can be handled with padding.
In spark.ml implementation random forests is available via
                                                                   By padding the inputs, we decide the maximum length of
RandomForest() method [16].
                                                                   words in a document, then zero pads the rest, if the input
                                                                   length is shorter than the designated length. In the case where
      b) Naïve Bayes                                               it exceeds the maximum length, then it also truncates either
“Bayes” is named from the famous Bayes’ Theorem in                 from the beginning or from the end. In other words, each
probability, and “Naive” is because of the strong (naive)          document is represented as a matrix, where rows are the words
independence assumptions between every pair of features.           and the columns are the Word2Vec features. This
A feature’s value is the frequency of the word (in multinomial     transormation enables our data to be fed into a Convolutional
Naive Bayes) or a zero or one indicating whether the word          Neural Net (CNN).
was found in the document. Naïve Bayes method in Spark
computes the conditional probability distribution of each
                                                                      6) Evaluation
feature given each label. It applies Bayes’ theorem to compute
the conditional probability distribution of each label given an
                                                                   The final stage is testing, measuring, evaluating and ranking of
observation [15].
                                                                   the classification models and then to choose the best algorithm
                                                                   to classify the new incoming document.
                         11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


In our experiment we were able to combine all the feature            reference architectures are available on the Occopus web page
vectorization methods with all machine learning algorithms           as executable tutorials and the following two Spark-oriented
(shown in Fig. 1) with 2 exceptions:                                 reference architectures are available at the MTA Cloud web
A) In Naïve Bayes method feature values must be non-                 page as tutorials:
negative while the Word2Vec method produces real numbers.                 1. RStudio Web Server, Spark, HDFS for R users
B) Convolution neural net as a classifier can handle data                 2. Jupyter Notebook, Spark, HDFS for Python, Scala
which have the same size of dimensions. As we discussed                       and Java (from version 9) users
earlier only the word2vec method can produce a proper input          Future work will intend to create further AI reference
for convolutional net, the bag-of-word, and tf-idf methods           architectures to cover further AI application classes.
cannot.
All the experiments were conducted on eleven-node-cluster,
with one master and ten worker nodes. Each node has 8 virtual                                   ACKNOWLEDGMENT
CPU cores and 16GB of RAM. The overall computing
capacity consisted of 80 virtual CPUs and 160GB of RAM.              We thank for the usage of MTA Cloud (https://cloud.mta.hu)
We found that the Word2Vec feature combined with                     that significantly helped us achieve the results published in
Convolutional Neural Net machine learning algorithm gave             this paper. We would also like to acknowledge the support of
the best performance.                                                the Text Mining of Political and Legal Texts (POLTEXT)
                                                                     Incubator Project, MTA Centre for Social Sciences.
                      IV. CONCLUSION
We have developed a big data and AI application development                                         REFERENCES
and execution framework that needs three major steps to be           [1]  “SZTAKI Cloud home - SZTAKI Cloud.” [Online]. Available:
                                                                          https://cloud.sztaki.hu/en/home. [Accessed: 01-Apr-2019].
created:
                                                                     [2] “MTA Cloud | MTA Cloud.” [Online]. Available: https://cloud.mta.hu/.
    1. Occopus to define and deploy the required                          [Accessed: 01-Apr-2019].
         infrastructure in the target cloud. This was created by     [3] E. Fernández-del-Castillo, D. Scardaci, and Á. L. García, “The EGI
         SZTAKI.                                                          Federated Cloud e-Infrastructure,” Procedia Comput. Sci., vol. 68, pp.
    2. Occopus infrastructure descriptors for the generic big             196–205, Jan. 2015.
         data and AI tools and environments like Hadoop,             [4] “Whitepapers – Amazon Web Services (AWS).” [Online]. Available:
                                                                          https://aws.amazon.com/whitepapers/. [Accessed: 01-Apr-2019].
         HDFS, Spark, Jupyter Notebook, RStudio Web                  [5] “Laboratory of Parallel and Distributed Systems | MTA SZTAKI.”
         Server. These are provided as AI reference                       [Online]. Available: https://www.sztaki.hu/en/science/departments/lpds.
         architectures developed by SZTAKI and can be used                [Accessed: 01-Apr-2019].
         in according to the actual AI application class.            [6] J. Kovács and P. Kacsuk, “Occopus: a Multi-Cloud Orchestrator to
                                                                          Deploy and Manage Complex Scientific Infrastructures,” J. Grid
    3. A concrete application-oriented big data and AI                    Comput., vol. 16, no. 1, pp. 19–37, Mar. 2018.
         application development and execution framework             [7] “HDFS            Architecture        Guide.”       [Online].      Available:
         that is built by Occopus according to the Occopus                https://hadoop.apache.org/docs/current1/hdfs_design.html. [Accessed:
                                                                          01-Apr-2019].
         infrastructure descriptors that are selected,
                                                                     [8] “MapReduce                 Tutorial.”         [Online].           Available:
         customized and parameterized by the user. The                    https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. [Accessed:
         customization and parameterization process is                    01-Apr-2019].
         described in detail in the tutorials on the reference       [9] “Welcome              -        Occopus.”         [Online].        Available:
                                                                          http://occopus.lpds.sztaki.hu/de/. [Accessed: 01-Apr-2019].
         architectures provided at the Occopus web page.
                                                                     [10] “Apache SparkTM - Unified Analytics Engine for Big Data.” [Online].
                                                                          Available: https://spark.apache.org/. [Accessed: 01-Apr-2019].
In this paper we have demonstrated how to use a big data and
                                                                     [11] “MLlib         |       Apache        Spark.”      [Online].      Available:
AI application development and execution reference                        https://spark.apache.org/mllib/. [Accessed: 01-Apr-2019].
architecture tailored for text classification applications. Due to   [12] “Open source and enterprise-ready professional software for data
the fast creation of the required Spark environment and the               science - RStudio.” [Online]. Available: https://www.rstudio.com/.
available resources in MTA Cloud we were able to try and test             [Accessed: 01-Apr-2019].
all the possible text classification pipelines that are presented    [13] “The Jupyter Notebook — IPython.” [Online]. Available:
in Fig 1.                                                                 https://ipython.org/notebook.html. [Accessed: 01-Apr-2019].
Although the presented big data and AI application                   [14] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
                                                                          2001.
development and execution framework was created and tested
                                                                     [15] “Classification and regression - MLlib main guide.” [Online]. Available:
on MTA Cloud it can be used on other clouds including                     https://spark.apache.org/docs/latest/ml-classification-regression.html.
Amazon, Azure, OpenStack, OpenNebula and CloudSigma                  [16] “Ensembles - RDD-based API - Spark 2.4.0 Documentation.” [Online].
due to the plugin architecture of the underlying Occopus cloud            Available:       https://spark.apache.org/docs/latest/mllib-ensembles.html.
orchestrator [6]. Many components of the described AI                     [Accessed: 01-Apr-2019].