<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Early Performance Evaluation of Supervised Graph Anomaly Detection Problem Implemented in Apache Spark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artem Mazeev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Semenov</string-name>
          <email>semenovg@nicevt.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Doropheev</string-name>
          <email>dmitry@dorofeev.su</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timur Yusubaliev</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>JSC NICEVT</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Moscow Institute of Physics and Technology</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Quality Software Solutions ltd</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>84</fpage>
      <lpage>91</lpage>
      <abstract>
        <p>Apache Spark is one of the most popular Big Data frameworks. Performance evaluation of Big Data frameworks is a topic of interest due to the increasing number and importance of data analytics applications within the context of HPC and Big Data convergence. In the paper we present early performance evaluation of a typical supervised graph anomaly detection problem implemented using GraphX and MLlib libraries in Apache Spark on a cluster.</p>
      </abstract>
      <kwd-group>
        <kwd>machine learning</kwd>
        <kwd>MLlib</kwd>
        <kwd>Spark</kwd>
        <kwd>graph processing</kwd>
        <kwd>su- pervised anomaly detection</kwd>
        <kwd>performance evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years, data intensive applications have become widespread and
appeared in many science and engineering areas (biology, bioinformatics, medicine,
cosmology, nance, social network analysis, cryptanalysis etc.). They are
characterized by a large amount of data, irregular workloads, unbalanced computations
and low sustained performance of computing systems. Development of new
algorithmic approaches and programming technologies are urgently needed to boost
e ciency of HPC systems for similar applications thus enabling advancing of
HPC and Big Data convergence [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Spark [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a framework which optimizes programming and execution
models of MapReduce [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] by introducing resilient distributed dataset (RDD)
abstraction. Users can choose between the cost of storing an RDD, the speed of
accessing it, the probability of losing part of it, and the cost of recomputing it.
Apache Spark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a popular open-source implementation of Spark. It supports
a rich set of high-level tools including MLlib for machine learning and GraphX
for graph processing.
      </p>
      <p>Anomaly detection in graphs occurs in many application areas, for example,
in the analysis of nancial markets, in spam ltering, as well as in detection of
cyber attacks.</p>
      <p>
        In the paper we evaluate performance of a typical supervised anomaly
detection problem implemented using GraphX [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and MLlib libraries in Apache
Spark. We use synthetic graph generator for performance evaluation. In our
approach we calculate features based on community extraction. Then we t model
using supervised machine learning techniques. One can apply this model to new
objects, thus performing anomaly detection to new data sets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        Performance evaluation of Big Data frameworks is drawing more attention due
to increasing number and importance of data analytics applications within the
context of HPC and Big Data convergence. There exist many papers that present
performance evaluation of Spark, e.g. [
        <xref ref-type="bibr" rid="ref11 ref3 ref7">3, 11, 7</xref>
        ]. Chaimov et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] ported, tuned
and evaluated Spark on Cray XC systems developed in production at a large
supercomputing center. They have reached scalability up to 10,000 cores of the
Cray XC system.
      </p>
      <p>
        Performance evaluation of the initial version of the GraphX library is
considered in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The paper presents strong scaling of the PageRank algorithm.
      </p>
      <p>
        Some papers consider performance evaluation of machine learning
applications implemented in Spark. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the MLlib library is not used. In [
        <xref ref-type="bibr" rid="ref12 ref8">12, 8</xref>
        ] there
is no scalability results of the MLlib library.
      </p>
      <p>In this paper, we present performance and scalability results of the typical
machine learning application implemented using the latest version (2.1.1, May
2017) of standard GraphX and MLlib libraries in Apache Spark on a cluster
equipped with Angara and 1 Gbit/s Ethernet interconnect.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Supervised Graph Anomaly Detection Problem</title>
      <p>We consider an anomaly detection problem for synthetic graphs to evaluate
performance and scalability of graph processing and the machine learning techiques
in Apache Spark.</p>
      <p>
        We consider a random uniform weighted directed graph G = (V; E) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
jV j = N; jEj = M . Each edge connects two random vertices of the graph G
so that there is no self-loops. Each edge has attributes. The list of attributes
includes integer edge weight (a random value in [0; 105)) and another integer
values; max degree is a maximal degree of each vertex.
      </p>
      <p>The edge is considered as anomaly if its weight is greater than a given
threshold. We consider AN OM ALY EDGES F RACT ION M random edges as
anomalous by adding to their weights random values in [0; 109). We consider
AN OM ALY EDGES F RACT ION = 0:05. Other edges are normal.</p>
      <p>The edge weight is an opaque anomaly feature, it allows to build a training
set and a test data set for our synthetic supervised problem. In the problem we
have to t model using supervised machine learning techniques. It is needed to
classify whether the edge is anomalous or normal.</p>
      <p>Eventually, the computation process consists of two stages: feature
calculation and supervised (machine) learning. First, it is necessary to calculate the
features for each edge. Feature calculation includes community extraction
procedure.</p>
      <p>We de ne a community around vertex u as a set of vertices v : dist(u; v)
R, where dist { the shortest distance between u and v. Edges are considered
as undirected during the community extraction. We extract two communities
around both vertices of each edge of the training set. We consider communities
with R = f1; 2g.</p>
      <p>Feature calculation is heavily based on the extracted community. The total
number of features is 52. The set of features for an edge includes:
{ degrees of the edge vertices, weight of the edge, other parameters from the
edge attributes,
{ minimum, maximum and average for degree, indegree and outdegree of the
community,
{ number of edges and vertices in the community,
{ sum of weights of all edges in the community.</p>
      <p>After the feature calculation stage, a machine learning stage is performed.</p>
      <p>We believe that our synthetic supervised graph anomaly detection problem
is typical because during the data mining research it is needed to calculate graph
features many times, and then to t a model. Especially, it is required in the
beginning of a research while trying to choose a suitable set of features and to
select an appropriate machine learning technique.
3.1</p>
      <sec id="sec-3-1">
        <title>Time Complexity</title>
        <p>We consider time complexity of the feature calculation stage. We calculate
features for each edge of the graph. Each edge has two incident vertices. The
complexity of feature calculation for one vertex is the number of vertices in the
community with R = 2 around this vertex, i.e. max degree2 operations in the
worst case. So, time complexity for one vertex is O(1).</p>
        <p>The number of vertices in the graph for which we calculate features is O(min(N; M ))
because if M &lt; N , then we calculate features only for the relevant vertices. So,
theoretical time complexity of the feature calculation stage is O(min(N; M )). Of
course, we can run it by parallel and then with p processes time complexity will
be O(min(N; M )=p).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Implementation</title>
      <p>We implement the algorithm and the synthetic graph generator with using of
Scala language, the GraphX system and the MLlib library on top of Apache
Spark, version 2.1.1.</p>
      <p>In our implementation each edge has a string which stores integer attributes
delimited by a ',' symbol. After generating vertices and edges we create a graph
using Graph method from the GraphX library.</p>
      <p>
        In the work we use Sparks RDD program interface. Resilient distributed
dataset (RDD) is the main abstraction in Spark, which represents a read-only
collection of objects partitioned across a set of machines. Users can explicitly
cache RDD in memory across machines and reuse it in multiple MapReduce-like
parallel operations [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The latest Spark program interface DataFrame [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] seems
to be more e cient, we plan to use it in the future work.
      </p>
      <p>The feature calculation stage works as follows. We use degrees, inDegrees,
outDegrees methods from GraphX library, in the current implementation
version we calculate other features by using join and map RDD operations. We
calculate features for all edges of the graph because it costs almost the same
time as calculating features only for edges which are used in the machine
learning stage.</p>
      <p>Our implementation uses simple operations (for example, map, filter) that
can be performed independently for each element from dataset. Also, the
implementation uses expensive operations: distinct, subtract, join, and groupBy.
These operations are potentially expensive because they include a shu e. The
shu e is a Spark mechanism for re-distributing data so that it is grouped di
erently across partitions of data. This typically involves copying data across the
cluster, making the shu e a complex and costly operation. In our
implementation we often use cache method to store data in the main memory.</p>
      <p>In the machine learning stage we use LogisticRegressionWithLBFGS, SVMWithSGD,
RandomForest methods from the MLlib library. Currently, there is no feature
selection stage in our solution, but we plan to add it in future work.
4.1</p>
      <sec id="sec-4-1">
        <title>Time Complexity in Apache Spark</title>
        <p>The sort algorithm used for the shu e operation is not speci ed in the Apache
Spark documentation. We assume that optimal complexity of parallel sort
algorithms is O(n log(n)=p), where p is a number of processes which can not
be more than n. We use it as a rough bound of the time complexity for the
sort algorithm inside the shu e operation. Therefore, the complexity of shu e
is O(n log(n)=p), where n is the number of elements in RDD or DataFrame, p
is the number of processes.</p>
        <p>The programs hot spot is the calculation of communities with R = 2. This
operation contains a join of the two RDDs, the rst RDD consists of pairs (a
neighbour of the vertex and the vertex) and the second consists of the reversed
pairs (the vertex and a neighbour of the vertex), i.e. after the join operation we
have vertices with dist = 2 for any vertex in the graph. Both RDDs consist of
O(max degree min(N; M )) = O(min(N; M )) elements. So, complexity of the
shu e operation is O(min(N; M ) log(min(N; M ))=p).</p>
        <p>Complexity of the local calculations after the shu e operation is O(min(N; M )=p),
p | the number of parallel processes because for each element of the rst RDD
there exist max degree elements of the second RDD in the worst case (we build
an extension of the community with R = 1 to the community with the R = 2).</p>
        <p>Our theoretical and pro ling analysis shows that the time of the remaining
programs operations is insigni cant. Eventually, our time complexity evaluation
of the feature calculation stage implemented in Apache Spark is O(min(N; M )
log(min(N; M ))=p).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Performance Evaluation</title>
      <p>
        All presented results has been obtained on the Angara-K1 cluster. It has
36 nodes, but in the paper we use only 8 nodes. Table 1 provides information
about the architecture and a software overview of the Angara-K1 partition. All
Angara-K1 nodes are connected to each other by the Angara and 1 Gbit/s
Ethernet interconnects. High-speed Angara interconnect is developed in NICEVT,
performance evaluation of the Angara-K1 cluster with Angara interconnect on
scienti c workloads is presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>We use the following default graph parameters: N = 219, M = 222. We
suggest that the graph size is large enough for scalability evaluation, but
performance evaluation consumes reasonable time.</p>
      <p>In the gures the dashed line shows theoretical evaluation. We plot this
line as follows. We take left point in the corresponding obtained results line and
multiply this value on the ratio of corresponding asymptotic values, for example,
for weak scaling we get time value X for the single core point, then for the 64
cores point we get theoretical time X ((64 N ) log(64 N )=64)=(N log(N )=1) =
X log(64 N )=log(N ), where N is the number of vertices.</p>
      <p>Strong scaling on the default graph is shown in Fig. 1a and Fig. 1b. The
speedup of feature calculation from 1 to 8 cores on the single node is 3.41 but
speedup from 1 to 8 nodes using 8 cores per node is 5.08. The reason of poor single
node scalability is that feature calculation is a memory bound Spark application.
Fig. 2 con rms this by showing strong scaling on the same problem with di erent
cores per node number.</p>
      <p>From the results in Fig. 1a, we can see that on the single node the feature
calculation stage requires more time than the machine learning stage, but feature
calculation scales fairly good.
Feature calculation
+Machine learning
Theoretical feature calculation
Feature calculation (1 core per node)</p>
      <p>Feature calculation
Theoretical feature calculation
8
7
6
p 5
u
ed 4
e
Sp 3
2
1
01
70
60
50</p>
      <p>Strong scaling of the di erent machine learning algorithms is shown in Fig. 3.
We consider LogisticRegressionWithLBFGS, SVMWithSGD and RandomForest
methods of the MLlib library. These methods scale only to 4 cores of the cluster
single node.</p>
      <p>Weak scaling is shown in Fig. 4. For p processes (cores) we use graph with
N = 214 p vertices and M = 217 p edges. Weak scaling results is poor. Among
the possible reasons there is a single one that Spark con guration is not optimal.
Future tuning can address the problem.</p>
      <p>Fig. 5 shows execution time of the program on graphs with di erent size. We
use 8 nodes of the Angara-K1 cluster. The feature calculation results are near
to the optimal theoretical line.</p>
      <p>Feature calculation
+Machine learning</p>
      <p>Theoretical feature calculation
900
800
700
The paper presents performance evaluation of a typical supervised graph anomaly
detection problem detection problem implemented using GraphX and MLlib in
Apache Spark on the commodity cluster equipped with Angara and 1 Gbit/s
Ethernet interconnects.</p>
      <p>The considered anomaly detection problem consists of calculation of features
and supervised machine learning. The feature calculation requires more time
than machine learning stage on the single cluster node, but it scales good. The
machine learning stage implemented using the MLlib library does not scale
beyond the single cluster node.</p>
      <p>Our theoretical analysis shows that the performance results of strong
scaling and scaling on graphs with di erent size on a xed cluster con guration
is relatively good. It seems that Apache Spark is a memory bound application
and many cores per cluster node running leads to lower e ciency. This fact and
poor performance results of weak scaling of the problem is the subject of future
research.</p>
      <p>Acknowledgments. Research is being conducted with the nance support of
the Ministry of Education and Science of the Russian Federation Unique ID for
Applied Scienti c Research (project) RFMEFI57816X0218. The data presented,
the statements made, and the views expressed are solely the responsibility of the
authors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Apach Spark Homepage, http://spark.apache.org, http://spark.apache.org/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agarkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ismagilov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makagon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semenov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Performance evaluation of the Angara interconnect</article-title>
          .
          <source>In: Proceedings of the International Conference Russian Supercomputing Days</source>
          . pp.
          <volume>626</volume>
          {
          <issue>639</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Or</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wendell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Scaling spark in the real world: performance and usability</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>8</volume>
          (
          <issue>12</issue>
          ),
          <year>1840</year>
          {
          <year>1843</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lian</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaftan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Spark sql: Relational data processing in spark</article-title>
          .
          <source>In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>1383</volume>
          {
          <fpage>1394</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chaimov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malony</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iancu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ibrahim</surname>
            ,
            <given-names>K.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          , J.:
          <source>Scaling Spark on HPC systems</source>
          pp.
          <volume>97</volume>
          {
          <issue>110</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>MapReduce: Simpli ed data processing on large clusters</article-title>
          .
          <source>In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation - Volume 6. OSDI'04</source>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , Berkeley, CA, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Dunner,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Parnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Atasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Sifalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pozidis</surname>
          </string-name>
          , H.:
          <article-title>High-performance distributed machine learning using Apache Spark</article-title>
          .
          <source>arXiv preprint arXiv:1612.01437</source>
          (
          <year>2016</year>
          ), https://arxiv.org/pdf/1612.01437.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Dunner,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Parnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.P.</given-names>
            ,
            <surname>Atasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Sifalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pozidis</surname>
          </string-name>
          , H.:
          <article-title>High-performance distributed machine learning using Apache Spark</article-title>
          .
          <source>CoRR abs/1612</source>
          .01437 (
          <year>2016</year>
          ), http://arxiv.org/abs/1612.01437
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Erdo}s,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Renyi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>On random graphs</article-title>
          .
          <source>Publicationes Mathematicae Debrecen</source>
          <volume>6</volume>
          ,
          <issue>290</issue>
          {
          <fpage>297</fpage>
          (
          <year>1959</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crankshaw</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          , I.:
          <article-title>GraphX: Graph processing in a distributed data ow framework</article-title>
          .
          <source>OSDI 14</source>
          ,
          <issue>599</issue>
          {
          <fpage>613</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>h</year>
          .,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>I.s.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Na</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>W.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
          </string-name>
          , S.y.:
          <article-title>Performance evaluation of apache spark according to the number of nodes using principal component analysis</article-title>
          .
          <source>In: Proceedings of the 2015 International Conference on Big Data Applications and Services</source>
          . pp.
          <volume>98</volume>
          {
          <fpage>103</fpage>
          . BigDAS '15,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yavuz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sparks</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amde</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Owen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwalkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>MLlib: Machine learning in Apache Spark</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>17</volume>
          (
          <issue>34</issue>
          ),
          <volume>1</volume>
          {
          <issue>7</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dongarra</surname>
          </string-name>
          , J.:
          <article-title>Exascale computing and big data: The next frontier</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>7</issue>
          ),
          <volume>56</volume>
          {
          <fpage>68</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Benchmarking Apache Spark with machine learning applications (</article-title>
          <year>2016</year>
          ), http://www.pdl.cmu.edu/PDL-FTP/BigLearning/CMUPDL-16-107.pdf
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Spark: Cluster computing with working sets</article-title>
          .
          <source>HotCloud 10</source>
          ,
          <issue>1</issue>
          {
          <issue>7</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>