=Paper=
{{Paper
|id=Vol-2975/paper2
|storemode=property
|title=Big Data and Machine Learning Framework for Clouds and Its Usage for Text Classification
|pdfUrl=https://ceur-ws.org/Vol-2975/paper2.pdf
|volume=Vol-2975
|authors=István Pintye,Eszter Kail,Peter Kacsuk
|dblpUrl=https://dblp.org/rec/conf/iwsg/PintyeKK19
}}
==Big Data and Machine Learning Framework for Clouds and Its Usage for Text Classification==
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
Big data and machine learning framework for clouds
and its usage for text classification
István Pintye, Eszter Kail, Péter Kacsuk Péter Kacsuk
Institute for Computer Science and Control University of Westminster
Hungarian Academy of Sciences London, UK
Budapest, Hungary P.Kacsuk@westminster.ac.uk
pintye.istvan@sztaki.mta.hu
Abstract— The paper describes a big data and AI application research for the scientists of MTA. Nearly 100 projects have
development and execution framework that was originally been run on MTA Cloud since its opening and more and more
developed for MTA Cloud (an OpenStack based cloud) but could projects require to use Big Data and machine learning
be used on other clouds including Amazon, OpenStack, applications. However, the large number of AI tools available
OpenNebula and CloudSigma. The paper explains the concept for clouds are very complex and their proper deployment and
and components of the big data and AI environment and configuration requires significant learning of both the tools and
illustrates its usage by a text classification application. the underlying cloud. Furthermore, tools supporting different
layers like user interface layer, language layer, machine
Keywords—machine learning; big data; parallel and distributed
execution; cloud;
learning layer, deep learning layer are not always compatible
and hence it requires further skill to select the right tools from
each layer in a way that they should be able to work together in
I. INTRODUCTION an AI environment.
Researches in different scientific fields (Natural Sciences, Recognizing this problem, we have decided to develop so-
Physics, Political Science) often require huge computational called AI reference architectures that can support the solution
resources and storage capacity to handle real Big Data. of certail AI application classes and can run in the cloud in a
Traditional sequential data processing algorithms are not reliable and robust way and can easily be deployed and used by
sufficient to analyze this large volume of data. For efficient the end-user scientists. The ultimate goal is to develop a large
processing and analysis new approaches, techniques and tools set of AI reference architectures for a large set of various AI
are needed. problem classes.
Moreover, cloud infrastructures and services are becoming The AI reference architectures have been created in three
even more popular and are playing an appropriate and widely steps:
used role to address the computation need of many scientific
and commercial Big Data applications. Their widespread usage 1. Development and publication of a cloud orchestrator
is a consequence of the dynamic and scalable nature of the called Occopus that enables the fast creation of
services provided by cloud providers. complex application frameworks in the cloud based on
Occopus infrastructure descriptors even by novice
However, the data scientists face several problems once cloud users.
they start planning the use or deployment of any Big Data
platform on cloud(s). On one hand, the selection of the 2. Development and publication of the Occopus
appropriate cloud provider(s) is always a cumbersome process infrastructure descriptors for generic AI reference
since the potential user community has to take into architectures like for example: Jupyter, Python, Spark
consideration several factors and trade-offs even if they need ml, Spark cluster and HDFS.
only a generic Infrastructure-as-a-Service (IaaS) provider:
3. Development and publication of application-oriented
private institutional (e.g. SZTAKI Cloud [1], federated cloud
environments for various AI application domains.
(e.g. MTA Cloud [2] or pan-European EGI FedCloud [3]) or
public cloud (e.g. Amazon [4]). To demonstrate the third step, we use a text classification
application provided by the POLTEXT (Text Mining of
The Hungarian Academy of Sciences (MTA) provides free
Political and Legal Texts) Incubator Project of MTA Centre for
IaaS cloud (MTA cloud) services for research communities and
Social Sciences. This problem is complex enough to
easy to use, dynamic infrastructures adapted to the actual
demonstrate the advantages of using the framework we have
project requirements. MTA Cloud was established to accelerate
created for supporting big data and AI applications.
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
The structure of the paper is as follows. The next section file is split into one or more blocks, which are stored in a set of
introduces the IaaS MTA cloud and its major services to create DataNodes. NameNode provides a map of where the data
the big data and AI development and execution framework. blocks are in the cluster. JobTracker oversees and coordinates
Section III. introduces our text-classification example with the parallel processing of data using MapReduce. Slave Nodes
detailed stepwise specification. Section IV. summarizes the make up the vast majority of machines, they store the data and
lessons learned from this real use case performed on MTA run the computations. Each slave runs both a DataNode and a
community cloud. TaskTracker daemon that communicate with and receive
instructions from their Master nodes.
II. COMPONENTS AND SERVICES OF THE BIG DATA AND AI The Occopus infrastructure descriptors for such a
FRAMEWORK Hadoop/HDFS cluster have been developed in SZTAKI and
are published on the web page of MTA Cloud [2] as well as on
A. MTA cloud and Occopus the web page of Occopus [9].
MTA Cloud was founded in 2015, when the Wigner Data
Center and the Institute for Computer Science and Control
(MTA SZTAKI) collaborated to establish a community Cloud
for the member institutes of the Hungarian Academy of C. Support for high performance, distributed data processing
Sciences. MTA Cloud has currently more than 80 active -Apache Spark
projects from over 20 research institutes including among Apache Spark [10] is an open source, fast and general-
others the Institute for Nuclear Research, the Research Centre purpose cluster framework, designed to run high performance
for Astronomy and Earth Sciences and other academic and data analysis applications. Instead of the Apache Hadoop’s
research institutes. Map Reduce programming paradigm [8], it performs internal
In order to raise the abstraction level of the IaaS MTA computational data processing that results in a more flexible
Cloud we have developed Occopus a cloud orchestrator and and faster run. The module uses a parallel data processing
manager tool by which complex infrastructures like Hadoop or framework that stores data in memory and, if necessary, on
Spark clusters can easily be built based on predeveloped and disk. This type of approach exceeds up to ten times the speed
published Occopus infrastructure descriptors. The Occopus of Hadoop Map Reduce data processing [8].
cloud orchestrator can be deployed in MTA Cloud by any user Apache Spark was written in Scala, and the most important
and once Occopus is deployed it can be used to build the of its favorite features are its highly developed easy-to-use
selected infrastructure (e.g. Spark cluster) in MTA Cloud. A APIs, such as Scala, Java, Python and R, designed specifically
tutorial explaining the deployment of Occopus is available on for handling large data sets. From an engineering perspective
the web page of MTA Cloud (in Hungarian) [2]. The novelty of these APIs provide the biggest advantages and reason why
Occopus was described and compared with similar cloud choosing the Spark framework. In addition to the Spark Core
orchestrators in [6]. Here we mention only one of its main API, there are other libraries in the Spark Ecosystem,
advantages. Its plugin architecture enables the use of plugins providing additional opportunities for large data analysis and
for various cloud systems and hence AI reference architectures machine learning. These include Spark SQL for structured
created by Occopus are easily portable among various cloud data processing, MLlib [11] for Machine Learning, etc.
systems like Amazon, Azure, OpenStack, OpenNebula and
CloudSigma.
It is important to emphasize that Apache Spark is not a
substitute for Apache Hadoop, but a kind of extension of it.
B. Support for parallel data storage and processing – Apache
Spark has been designed to be able to read and write data from
Hadoop Hadoop’s own distributed file system (HDFS), and other
Apache Hadoop is an open source software platform for storage systems such as HBase or Amazon S3.
distributed storage and processing of very large data sets on
computer clusters. Due to the special storage method, which is D. Spark Machine learning library
based on a distributed file system (HDFS, Hadoop Distributed Apache Spark MLlib [11] is the Apache Spark machine
File System [7]) Hadoop can process efficiently terabytes of learning library consisting of common learning algorithms and
data in just minutes, and even petabytes in hours. utilities. As the core component library in Apache Spark,
HDFS uses the MapReduce [8] paradigm that was proposed MLlib offers numerous supervised and unsupervised learning
by Google and found wide-spread popularity. HDFS has a algorithms, from Logistic Regression to k-means and
master/slave architecture. It means that the nodes apart from clustering, collaborative filtering, dimensionality reduction,
the Client machine are Master nodes and Slave nodes. Master and underlying optimization primitives.
node supervises the mechanism of data storing in HDFS and As the next step of building a Big Data and AI oriented
running parallel computations (Map Reduce) on all that data. environment for MTA Cloud users we have developed the
An HDFS cluster consists of a single NameNode, a number of Occopus infrastructure descriptors for Spark/HDFS clusters
DataNodes, usually one per node in the cluster, which manage and published them both on the web page of MTA Cloud [2]
storage attached to the nodes that they run on. The NameNode and on the web page of Occopus [9]
oversees and coordinates the data storage function. Internally, a
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
E. Interactive Development Environments oriented reference architecture can be deployed by Occopus
With the above-mentioned frameworks big data and machine on MTA Cloud depending on the actual needs of the users:
learning algorithms can easily be executed in a parallel 1. RStudio Web Server, Spark, HDFS for R users
manner. In order to support scientists from different research 2. Jupyter Notebook, Spark, HDFS for Python, Scala
fields we also support interactive development environments and Java (from version 9) users
that are easy to use with various programming languages and These reference architectures are the starting points for the
are very popular among the research communities. actual big data or AI applications.
RStudio [12] is an integrated development environment
(IDE) for R. It includes a console, syntax-highlighting editor
III. TEXT CLASSIFICATION SCENARIO
that supports direct code execution, as well as tools for
plotting, history, debugging and work space management.
RStudio Desktop is a standalone desktop application that in no The third step was the usage of the developed reference
way requires or connects to the RStudio Server. architectures for various big data and AI application domains.
RStudio Web Server is a Linux server application that In this paper we have selected the text classification domain to
provides a web browser/based interface to the version of R illustrate the usage of the Spark-oriented reference
running on the server. Deploying R and RStudio on a server architecture.
has a number of benefits: the ability to access R workspace MTA Centre for Social Sciences wanted to solve the following
from any computer at any location; sharing of code, data, and problem on MTA Cloud: The coding of public policy major
other files with colleagues; allowing multiple users to share topics on various legal and media corpora serves as an
access to the more powerful computing resources available on important input for testing a wide range of hypotheses and
a server; control access to data in a centralized manner; models in political science. This fundamental work has till
centralized installation and configuration of R, R packages and recently mostly been conducted by double-blind human
other libraries. coding, which is still considered the gold-standard for
Jupyter Notebooks [13] are starting to become extremely categorizing text in this field. This method, however, is both
popular especially in education and field of empirical research. rather expensive and increasingly unfeasible with the growing
The reason for Jupyter’s great success stems from the clear size of available corpora. Different forms of automated
advantages of literate programming and improved web coding, such as dictionary-based and supervised learning
browser technologies. Literate programming is a software methods, offer a solution to these problems. But these methods
development style pioneered by Stanford computer scientist, are themselves also reliant on appropriate dictionaries and/or
Donald Knuth. Literate programming allows users to training sets, which need to be compiled and developed first.
formulate and describe their thoughts with prose,
supplemented by mathematical equations, as they prepare to We have provided the architecture for them described in
write code blocks. It excels at demonstration, research, and Section II/E and at the same time demonstrated for them how
teaching objectives especially for science. to use this architecture for solving their problem. After the
There are a lot of free and open source Jupyter Notebook demonstration they started to use the RStudio version of the
codes on numerous topics in many scientific disciplines, such framework meanwhile we have also investigated possible
as machine learning, social sciences, physics, computer solutions for the problem using the Jupyter Notebook version.
science, etc. They have LaTeX support for mathematical Here we show our approach to solve the problem. The steps of
equations with MathJax, a web browser enhancement for solving the above described text classification problem are
display of mathematics. These notebooks can be saved and shown in Figure 1. This simple figure in fact, represents
easily shared in ipynb JSON format. They can also be several different execution pipelines depending on the choice
committed to version control repositories such as git and the of the user. With the use of the Jupyter Notebook, Spark,
code sharing site github. HDFS architecture we were able to execute and evaluate the
different classification pipelines in parallel. In the next
Jupyter notebooks can be viewed with nbviewer technology paragraphs the different stages of our Spark-based pipelines
which is supported by github. Moreover, because these are detailed.
notebook environments are for writing and developing code,
they offer many niceties available in typical Interactive 1) Disribute the data
Development Environments (IDEs) such as code completion
and easy access to help. The first stage is to upload the data (text) into the HDFS
As part of the second step of providing generic big data system in an appropriate form. At first a Resilient Distributed
and AI platforms for scientists we have extended the Dataset is built, which is the basic data structure of Spark by
Spark/HDFS cluster with both RStudio Web Server and dividing the dataset into logical partitions. These partitions
Jupyter Notebook and created the necessary Occopus may be computed in parallel on different nodes of the cluster.
infrastructure descriptors. As a result, two types of Spark-
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
Read Text Structuring Tokenization of the input text. Of course, this feature should be represented
Raw Text files from Hadoop
Distributed File System
Spark RDD to Spark
DataFrame
Split Text into arrays of Strings
in a form of a vector. Accordingly, the next stage in our
pipeline of Fig. 1 is feature vectorization. There are different
kinds of feature vectorization algorithms and many of them
are supported by the SparkML library. In the next paragraphs
Feature Vectorization ML Processing Evaluation
• Bag of Words • Naive Bayes Measuring the accuracy of a the applied feature vectorization and word embedding
(CountVectorizer)
• TF-IDF (Tfidf)
• Random Forest
• Logistic Regression
particular machine learning
method on unseen data methods are briefly introduced.
• Word2Vec (Word2Vec) • Neural Net – Convolution Net
a) Bag-of-Words
Categorized Text The bag-of-words (BOW) algorithm provides feature
Best method will be chosen extraction capabilities. As the name suggests, it does not keep
the words structured just a “bag” of words. It gives back a
histogram of the words within the text, i.e., considering each
Figure 1 Text processing pipeline word count as a feature. The algorithm consists of two phases:
first it builds a vocabulary of the known words and then it
measures the presence of these words in the different
2) Data structuring documents related to the corpora.
CountVectorizer function of Spark ML implements this
The second stage is the data structuring step. Apache Spark concept by converting a collection of text documents to
SQL is a module for structured data processing in Spark. vectors of token counts. It can be used to extract the
Spark SQL module supports operating on a variety of data vocabulary and to generate an array of strings from the
sources through the DataFrame API. DataFrame is a document.
distributed collection of data organized into named columns.
Actually, it is equal to the table concept in relational database b) TF-IDF
systems or a dataframe in R/Python. DataFrame contains rows Term frequency-inverse document frequency (TF-IDF) is a
with Schema. It can scale from kilobytes of data on the single feature vectorization method widely used in text mining to
laptop to petabytes of data on a large cluster. A DataFrame reflect the importance of a term to a document in the corpus.
can be operated on using relational transformations such as Terms with high frequency within a document have high
filter, select, group by, sort, etc. Like Apache Spark in general, weights. In addition, terms frequently appearing in all
Spark SQL in particular is all about distributed in-memory documents of the document corpus have lower weights. TF-
computations on scale. IDF has been traditionally applied in information retrieval
systems, because it is capable highlight documents that are
3) Text pre-processing closely related to a term but not to an exact string-match.
Spark ML function that supports this method is IDF.
The stored and structured data should be transformed into an
appropriate input form for the machine learning algorithm
c) Word2Vec
(e.g.: neural networks). This is called text pre-processing. The
next stage is therefore the text pre-processing which can have Bag-of-Words and TF-IDF hold no information about the
several sub-steps including tokenization, stop-word, meaning of the word, how it is used in language and what is
stemming. We restricted our pipeline to use only the its usual context (i.e. what other words it generally appears
tokenization sub-step. close to). Word embeddings try to “compress” large one-hot
Tokenization is the process of demarcating and possibly word vectors into much smaller vectors (a few hundred
classifying sections of a string of input characters. For elements) which preserve some of the meaning and context of
example, in the text string of a sentence the raw input (series the word.
of characters) must be explicitly split into tokens with a given Word2Vec is a sophisticated word embedding technique,
space delimiter in the same way as a natural language speaker which is based on the idea that words that occur in the
would do. Spark machine learning library (mllib) has a lot of same contexts tend to have similar meanings. The training
built in functions for text mining such as RegexTokenizer. objective of Word2Vec is to learn word representations that
Therefore, users of the Spark environment shown in Figure 1 can predict its context in the same sentence or in the given
do not have to develop any new software for tokenization, just corpus. This model maps each word to a unique and fixed-size
use the Spark ML RegexTokenizer function. vector that can be used as features for document similarity
calculations and classification respectively.
4) Feature vectorization The context of the word is the key measure of meaning that is
utilized in Word2Vec. Words which have similar contexts
Features in our text-classification problems mean to find share meaning under Word2Vec, and their reduced vector
words, or terms that can represent some special characteristics representations will be similar. The built-in word2vec
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
algorithm uses the skip-gram neural network model. In the c) Multinomial logistic regression
skip-gram model version of Word2Vec, the goal is to take a In terms of its structure, logistic regression can be thought as a
target word i.e. “sat” and predict the surrounding context neural network with no hidden layer, and just one output node.
words. This involves an iterative learning process, that was Instead of fitting a straight line or hyperplane, the logistic
performed by a neural network with one hidden layer regression model uses the logistic function to squeeze the
consisting of 300 neurons. output of a linear equation between 0 and 1. In our case the
number of inputs were equal with the number of words
The end product of this learning will be an embedding layer in coming from the bag of words, the tf-idf model [15].
a network – this embedding layer is a kind of lookup table –
the rows are vector representations of each word in our d) Multi Layer Perceptron (Neural Network)
vocabulary.
We have aggregated the word vectors of each word in a
document, calculating mean to get one vector representation of
5) ML methods in text classification - supervised learning
each document. Now each document is represented by a
vector with 300 dimensions. The values of the vectors are the
In this phase of the work we use the different built-in machine
inputs of or fully connected neural network (or feedforward
learning algorithms, that are shortly introduced in the next
artificial neural network).
paragraphs.
Neural net consists of multiple layers of nodes. The layers are
fully connected to the next layer in the network.
a) Random Forest The input layer represents the input data. All other nodes map
Random forests are ensembles of decision trees [14]. Decision inputs to outputs by a linear combination of the inputs with the
trees and their ensembles are very popular methods node’s weights w and bias b after then applying an activation
classification and regression type tasks, since they are easy to function. In spark the nodes in intermediate layers use sigmoid
interpret, handle categorical features, can be extended to the (logistic) function, and this property is not changeable. The
multiclass classification setting, and are able to capture non- last nodes in the output layer use softmax function where the
linearities and feature interactions. number of nodes in the output layer corresponds to the number
The spark.ml implementation supports decision trees for of classes.
binary and multiclass classification and for regression, using
both continuous and categorical features. The implementation e) Convolutional Nerual Net (CNN)
partitions data by rows, allowing distributed training with
In order to feed data to a CNN, we have to ensure that each
millions or even billions of instances.
word vector is fed to the model in a sequence that matches the
In spark.ml Decision Tree classifier is available via the
original document. The dimension of the vector we have for
DecisionTreeClassifier() method [15].
the whole document is the length (or the number of words in
the document) times the dimension of the vector (in our case
Random forests combine many decision trees in order to
300) which represents the current word. It is important to note
reduce the risk of overfitting. Random forests train a set of
that each word has a fix and same length of vector
decision trees separately, so the training can be done in
representation.
parallel. The algorithm injects randomness into the training
process so that each decision tree is a bit different. Combining
A neural network model will expect all the data to have the
the predictions from each tree reduces the variance of the
same dimension, but in case of different documents, they have
predictions, improving the performance on test data.
different lengths. This can be handled with padding.
In spark.ml implementation random forests is available via
By padding the inputs, we decide the maximum length of
RandomForest() method [16].
words in a document, then zero pads the rest, if the input
length is shorter than the designated length. In the case where
b) Naïve Bayes it exceeds the maximum length, then it also truncates either
“Bayes” is named from the famous Bayes’ Theorem in from the beginning or from the end. In other words, each
probability, and “Naive” is because of the strong (naive) document is represented as a matrix, where rows are the words
independence assumptions between every pair of features. and the columns are the Word2Vec features. This
A feature’s value is the frequency of the word (in multinomial transormation enables our data to be fed into a Convolutional
Naive Bayes) or a zero or one indicating whether the word Neural Net (CNN).
was found in the document. Naïve Bayes method in Spark
computes the conditional probability distribution of each
6) Evaluation
feature given each label. It applies Bayes’ theorem to compute
the conditional probability distribution of each label given an
The final stage is testing, measuring, evaluating and ranking of
observation [15].
the classification models and then to choose the best algorithm
to classify the new incoming document.
11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019
In our experiment we were able to combine all the feature reference architectures are available on the Occopus web page
vectorization methods with all machine learning algorithms as executable tutorials and the following two Spark-oriented
(shown in Fig. 1) with 2 exceptions: reference architectures are available at the MTA Cloud web
A) In Naïve Bayes method feature values must be non- page as tutorials:
negative while the Word2Vec method produces real numbers. 1. RStudio Web Server, Spark, HDFS for R users
B) Convolution neural net as a classifier can handle data 2. Jupyter Notebook, Spark, HDFS for Python, Scala
which have the same size of dimensions. As we discussed and Java (from version 9) users
earlier only the word2vec method can produce a proper input Future work will intend to create further AI reference
for convolutional net, the bag-of-word, and tf-idf methods architectures to cover further AI application classes.
cannot.
All the experiments were conducted on eleven-node-cluster,
with one master and ten worker nodes. Each node has 8 virtual ACKNOWLEDGMENT
CPU cores and 16GB of RAM. The overall computing
capacity consisted of 80 virtual CPUs and 160GB of RAM. We thank for the usage of MTA Cloud (https://cloud.mta.hu)
We found that the Word2Vec feature combined with that significantly helped us achieve the results published in
Convolutional Neural Net machine learning algorithm gave this paper. We would also like to acknowledge the support of
the best performance. the Text Mining of Political and Legal Texts (POLTEXT)
Incubator Project, MTA Centre for Social Sciences.
IV. CONCLUSION
We have developed a big data and AI application development REFERENCES
and execution framework that needs three major steps to be [1] “SZTAKI Cloud home - SZTAKI Cloud.” [Online]. Available:
https://cloud.sztaki.hu/en/home. [Accessed: 01-Apr-2019].
created:
[2] “MTA Cloud | MTA Cloud.” [Online]. Available: https://cloud.mta.hu/.
1. Occopus to define and deploy the required [Accessed: 01-Apr-2019].
infrastructure in the target cloud. This was created by [3] E. Fernández-del-Castillo, D. Scardaci, and Á. L. García, “The EGI
SZTAKI. Federated Cloud e-Infrastructure,” Procedia Comput. Sci., vol. 68, pp.
2. Occopus infrastructure descriptors for the generic big 196–205, Jan. 2015.
data and AI tools and environments like Hadoop, [4] “Whitepapers – Amazon Web Services (AWS).” [Online]. Available:
https://aws.amazon.com/whitepapers/. [Accessed: 01-Apr-2019].
HDFS, Spark, Jupyter Notebook, RStudio Web [5] “Laboratory of Parallel and Distributed Systems | MTA SZTAKI.”
Server. These are provided as AI reference [Online]. Available: https://www.sztaki.hu/en/science/departments/lpds.
architectures developed by SZTAKI and can be used [Accessed: 01-Apr-2019].
in according to the actual AI application class. [6] J. Kovács and P. Kacsuk, “Occopus: a Multi-Cloud Orchestrator to
Deploy and Manage Complex Scientific Infrastructures,” J. Grid
3. A concrete application-oriented big data and AI Comput., vol. 16, no. 1, pp. 19–37, Mar. 2018.
application development and execution framework [7] “HDFS Architecture Guide.” [Online]. Available:
that is built by Occopus according to the Occopus https://hadoop.apache.org/docs/current1/hdfs_design.html. [Accessed:
01-Apr-2019].
infrastructure descriptors that are selected,
[8] “MapReduce Tutorial.” [Online]. Available:
customized and parameterized by the user. The https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. [Accessed:
customization and parameterization process is 01-Apr-2019].
described in detail in the tutorials on the reference [9] “Welcome - Occopus.” [Online]. Available:
http://occopus.lpds.sztaki.hu/de/. [Accessed: 01-Apr-2019].
architectures provided at the Occopus web page.
[10] “Apache SparkTM - Unified Analytics Engine for Big Data.” [Online].
Available: https://spark.apache.org/. [Accessed: 01-Apr-2019].
In this paper we have demonstrated how to use a big data and
[11] “MLlib | Apache Spark.” [Online]. Available:
AI application development and execution reference https://spark.apache.org/mllib/. [Accessed: 01-Apr-2019].
architecture tailored for text classification applications. Due to [12] “Open source and enterprise-ready professional software for data
the fast creation of the required Spark environment and the science - RStudio.” [Online]. Available: https://www.rstudio.com/.
available resources in MTA Cloud we were able to try and test [Accessed: 01-Apr-2019].
all the possible text classification pipelines that are presented [13] “The Jupyter Notebook — IPython.” [Online]. Available:
in Fig 1. https://ipython.org/notebook.html. [Accessed: 01-Apr-2019].
Although the presented big data and AI application [14] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
development and execution framework was created and tested
[15] “Classification and regression - MLlib main guide.” [Online]. Available:
on MTA Cloud it can be used on other clouds including https://spark.apache.org/docs/latest/ml-classification-regression.html.
Amazon, Azure, OpenStack, OpenNebula and CloudSigma [16] “Ensembles - RDD-based API - Spark 2.4.0 Documentation.” [Online].
due to the plugin architecture of the underlying Occopus cloud Available: https://spark.apache.org/docs/latest/mllib-ensembles.html.
orchestrator [6]. Many components of the described AI [Accessed: 01-Apr-2019].