<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>JSC NICEVT</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moscow</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>a.agarkov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>semenovg@nicevt.ru</string-name>
        </contrib>
      </contrib-group>
      <fpage>92</fpage>
      <lpage>101</lpage>
      <abstract>
        <p>In this paper we consider an association problem with constraints for two dynamically enlarging tables. We consider a base full association algorithm and propose a partial association algorithm that improves e ciency of the base algorithm. We implement and evaluate the algorithms in Apache Spark for a particular case on the cluster with Angara interconnect.</p>
      </abstract>
      <kwd-group>
        <kwd>association problem</kwd>
        <kwd>dynamically enlarging tables</kwd>
        <kwd>Apache</kwd>
        <kwd>Spark</kwd>
        <kwd>Angara interconnect</kwd>
        <kwd>performance evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In the recent years data intensive applications have become widespread and
appeared in many science and engineering areas (biology, bioinformatics, medicine,
cosmology, nance, social network analysis, cryptology etc.). They are
characterized by a large amount of data, irregular workloads, unbalanced computations
and low sustained performance of computing systems. Development of new
algorithmic approaches and programming technologies are urgently needed to boost
e ciency of HPC systems for similar applications, thus enabling advancing of
HPC and Big Data convergence [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In the paper we consider an association problem with constraints for two
dynamically enlarging tables. We have two large tables and an ordered set of
rule groups which determine associations between entries from the rst table
and the second table. When two table entries compose an association by a rule
in the current rule group, then these entries must be excluded from association
process for the following rule groups. Each entry is associated with other entries
from the both tables directly or indirectly through the other associations. It is
needed to determine the association type and to list of the associated entries for
each entry. Tables are dynamically enlarging, the goal is to improve potential
performance of the association process by using of the associations, built on the
original tables.</p>
      <p>
        Apache Spark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a popular open-source implementation of Spark. Spark
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a framework which optimizes programming and execution models of
MapReduce [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Current implementation of Apache Spark can not e ciently
use advanced features (e.g. RDMA) on clusters with high-performance
interconnects. Researchers from Ohio State Univercity proposed a high-performance
RDMA-based design for accelerating the Spark framework [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ]. Chaimov et al.
from Cray ported and tuned Spark on Cray XC systems developed in
production at a large supercomputing center [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The Mellanox company presented an
open-source Spark RDMA implementation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We consider a high-speed Angara
interconnect [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as a target of Spark optimization. But in the current work we
run Apache Spark through the TCP/IP interface on the Angara interconnect.
      </p>
      <p>In the paper we describe a base full association approach to the problem,
propose a partial association approach that improves e ciency of the base
approach, implement corresponding algorithms using Apache Spark and present
evaluation results on a cluster with the Angara interconnect.
2
We consider two large tables with M rows. The tables have identical structure,
each table has N elds. The table unique key consists of all N elds of the table.</p>
      <p>We consider an ordered set of rule groups, which determine associations
between entries from the rst table and entries from the second table:
{ Rule is a set of elds, which are used to compare two table entries. It is
required to build associations between the tables: to nd matches between
di erent table entries by the rule.
{ Group is a set of rules; rules of a group are applied to the table entries
independently of each other. When two table entries compose the association
by a rule in the current rule group, then these entries are marked by the
current group number and must be excluded from association process for
the following rule groups.</p>
      <p>Each table entry can be associated with one or many entries of another table.
Moreover, association is a transitive relation. Associations for each entry can be
classi ed into one of four association types: one from the rst table to one from
the second table, one to many, many to one and many to many (1: 1, 1: M, M:
1, M: N). Therefore each entry is associated with other entries from the both
tables directly or indirectly through the other associations.</p>
      <p>The goal of the table association problem is to determine the association
type and to list of the associated entries for each entry.</p>
      <p>After we build associations between the tables, K new entries are added to
each table. Added entries di er from original entries by a given subset of elds.
Association process is needed to repeat to make the augmented tables associated
too.</p>
      <p>Full association approach can generate associations between given tables by
the mentioned set of rules from scratch. It is required to build associations
between the augmented tables.</p>
      <sec id="sec-1-1">
        <title>The goal of the dynamically enlarging table association problem is to</title>
        <p>improve potential performance of the full association approach on the augmented
tables by using of the associations, built on the original tables.</p>
        <p>For the sake of simplicity in the paper we consider a particular case of the
problem.
2.1</p>
      </sec>
      <sec id="sec-1-2">
        <title>Data Structure and Association Rules</title>
        <p>In our work each table entry has 5 elds, where the rst and the second elds
are integer identi ers, three other elds are data elds. The unique key for every
entry is all of the ve elds. Each entry has a unique synthetic identi er.</p>
        <p>In the work the considered ordered set of rule groups consists of 5 groups
and 15 rules, see Table 1. Symbol `+' denotes equality requirement of the
corresponding elds of the two table entries. Symbol `{' denotes that elds of the two
table entries are not matched. For example, a key that determines an association
between two table entries is speci ed for each rule and consists of the elds, that
are marked with the `+' symbol.</p>
        <p>The full association approach is matching each entry from the rst table
with each entry of the second table by each rule from the current rule group.
If matching is successful then we create and store an association between the
entries; the association is marked by the current group number. Entries that do
not have any associations are marked by the group number six.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Algorithms</title>
      <sec id="sec-2-1">
        <title>Full Association Algorithm</title>
        <p>The full association algorithm consists of two stages: associations matching and
transitive closure. The rst stage actually implements the full association
approach. All possible pairs are found by the rst group of rules, then every entry
that is included in the pairs is excluded from the tables. This procedure is
repeated for each group of rules. The result of the stage is a set of associations (it
will be graph edges) between entries (it will be graph vertices).</p>
        <p>At the second stage the transitive closure (TC) algorithm is executed for each
selected group. At rst, we construct a bipartite graph. Vertices in the left vertex
set are unique identi ers of the entries from the rst table, in the right vertex set
there are unique identi ers of the entries from the second table. There exists an
edge between two vertices of the di erent graph parts if the association between
corresponding entries has been found during the rst stage of the algorithm.</p>
        <p>
          Transitive closure [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] T C is performed by the following formula:
T C = [i=1;2::Ri; where Ri+1 = Ri join E;
        </p>
        <p>R1 = E; E
set of graph edges:
(1)</p>
        <p>The transitive closure is built by repetitive merging result of a join operation
between previous resulting set of edges and original set of graph edges until the
result is not changed, i.e. xed point is reached. Thus, for each vertex in T C
there exist vertex pairs that connect current vertex with other vertices in the
connected component.</p>
        <p>Finally, the association type of each vertex is de ned (1:1, 1:M, M:1, M:N).
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Partial Association Algorithm</title>
        <p>There are original (old) tables, the associations that have been built for old
tables, and there are new tables that are smaller than original. The added entries
di er from original entries by the #3 eld.</p>
        <p>When new entries are added to the original tables, one can apply full
association algorithm to the augmented tables from scratch. We propose a partial
association algorithm to improve performance by using of associations that are
built for the original tables.</p>
        <p>The partial association algorithm is executed also in two stages: association
matching and transitive closure.</p>
        <p>It is important that added (new) entries di er from the original entries by
the #3 eld. The main idea of the rst stage is matching only new entries of
the tables for each rule with matching requirement by the #3 eld, in that case
there will be no associations between new and old entries. For each rule without
matching requirement by the last eld new and old entries must be matched, see
Figure 1.</p>
        <p>The association matching stage di ers from the same stage in the full
association algorithm. Each entry included in new associations must be excluded
from old associations. As seen in Figure 2, if a new entry is associated with an
old entry, and the association group number new gn is smaller than the group
number old gn of the association between the old entry with another entry, then
these old associations must be removed; if new gn is equal to old gn then the
new entry should be added to the component.</p>
        <p>Old</p>
        <p>The transitive closure stage is executed only for new associations. The
resulting graph of transitive closure is combined with the old graph with invalid
associations excluded during the matching stage.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementation Details</title>
      <p>
        Apache Spark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a popular open-source implementation of Spark. It provides
programmers with an application programming interface centered on a data
structure called the resilient distributed dataset (RDD), a read-only multiset of
data items distributed over a cluster of machines, that is maintained in a
faulttolerant way [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It was developed in response to limitations in the MapReduce
cluster computing paradigm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which forces a particular linear data ow
structure on distributed programs: MapReduce programs read input data from disk,
map a function across the data, reduce the results of the map, and store
reduction results on disk. Spark's RDD function as a working set for distributed
programs that o ers a (deliberately) restricted form of distributed shared
memory [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The latest Spark program interface DataFrame [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] seems to be more
e cient than the RDD interface, but in the current work we use RDD, and we
suppose to use DataFrames in the next research works.
      </p>
      <p>We use Java 8 and Apache Spark 1.6.1 for implementation of the full and
partial association algorithms. We use RDD of Tuple5&lt;Long, Long, Long, Long,
Long&gt; type for table structure representation, the sequence of types in Tuple5
corresponds to the table elds ID1, ID2, #1, #2, #3. We attach unique identi er
(Long) to the Tuple5 of each entry.</p>
      <p>After the association stage we have RDD of Tuple2&lt;Long, Long&gt; that
represents association between the unique identi ers of two table entries.
4.1</p>
      <sec id="sec-3-1">
        <title>Synthetic Table Generator</title>
        <p>Synthetic table generator creates distributed random tables and works as follows.
First, two identical tables of the required size are generated. Each eld value is
a uniformly distributed random integer number in the following intervals:
{ ID1, ID2 { [0; 10000),
{ #1 { [0; 1000000),
{ #3 { [0; 1000).</p>
        <p>Value of the #2 eld is a position number of the entry.</p>
        <p>We randomly modify second table entries in order to create possibility of
association between entries from the rst and the second tables for each rule. We
modify entry elds that are marked with the `{' symbol in Table 1. Distribution
of modi cations in the rules is shown in Table 2. 72% of the second table entries
remain unchanged. In 2% of the table entries there are random modi cations in
the #3 eld values. In 6% of the table entries there are random modi cations in
the #2 eld values and so on. As a result 72% of the table entries correspond to
the rst rule, 2% { to the second rule, 6% { to the third rule and so on.</p>
        <p>We generate the augmented tables as follows. First, we add new entries to the
rst table, eld values of each entry is a uniformly distributed random integer
number in the following intervals:
{ ID1, ID2 { [0; 10000),
{ #1 { [0; 1000000),
{ #3 { [1000; 2000).</p>
        <p>Value of the #2 eld is a position number of the new entry in the whole
table. As can be seen, the old table and the new table have di erent values in
the #3 eld.</p>
        <p>Second, we copy augmented part of the rst table to the second table and
randomly modify it as well as is described in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Performance Evaluation</title>
      <p>
        All presented results are obtained on the Angara-K1 cluster. We use 12 out 36
nodes. All Angara-K1 nodes are linked by the Angara interconnect. Russian
highspeed Angara interconnect is developed in NICEVT, performance evaluation of
the Angara-K1 cluster with the Angara interconnect on scienti c workloads is
presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In the current work we run Apache Spark through the TCP/IP
interface on the Angara interconnect. Table 3 provides an architecture and a
software overview of the Angara-K1 partition.
      </p>
      <p>Figures 3, 4, 5 and 6 show the comparison results of the full and partial
association algorithms. The reported running times do not include reading data
and writing the result to the lesystem. Table 4 presents the total table sizes in
GB during implementation executing for di erent entry numbers.
The algorithm running times are shown in Figure 3, we use 8 cores per node
and 8 nodes of the cluster, table size is varied. Old table size is 300 million
entries, new table size is 75 million entries. The gure shows that performance
di erence between the algorithms grows with table size.</p>
      <p>Strong scaling is shown in Figure 4. Old table size is 100 million entries, new
table size is 25 million entries. The speedup of the full and partial association
algorithms is approximately 3 on 8 nodes. Among the possible reasons of moderate
performance there is a single one: Spark con guration is not optimal. Further
00
50
100 150 200
Table size, millions of entries
250
300
2
4 Nodes 6number 8
10
12
Partial Association
Full Association
Partial Association
Full Association
tuning can address the problem. Horizontal line from 8 to 12 nodes indicates
that the table size is small for further performance increasing.</p>
      <p>In Figure 5 pro ling results are shown. Shaded color denotes the association
matching stage (stage #1), normal color denotes the transitive closure stage
(stage #2). Old table size is 300 million entries, new table size is 50 million
entries. It can be seen that the partial algorithm optimizes primarily the transitive
closure stage.</p>
      <p>Running time on 4 nodes is more than two times faster than on 2 nodes,
because the problem size is too large for 2 nodes and Garbage Collector occupies
a signi cant part of the time.</p>
      <p>The dependence of algorithm running times on the amount of new data is
shown in Figure 6. We use 6 nodes, 300 million entries in the each table, the
fraction of the new table entries varies from 12.5 to 100 percents of the total table
size. The smaller the percentage of new data is, the faster the partial association
algorithm is executed. The running time of the full association algorithm does
not change, because the total amount of data does not change.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In the paper we propose the partial association algorithm for the table
association problem of two dynamically enlarging tables with speci c constraints.
For the sake of simplicity we consider a particular case of the problem. We
implement the base full association algorithm and the proposed algorithm
using Apache Spark and present performance evaluation of the algorithms on the
cluster equipped with the Angara interconnect. Performance of the proposed
algorithm exceeds performance of the full association algorithm for a variety of
data sets.</p>
      <p>In future work we plan to make detail pro ling of the implemented algorithms
in terms of Apache Spark internal operations and to optimize Apache Spark on
the Angara interconnect.</p>
      <p>Acknowledgments. The work was supported by the grant No. 17-07-01592A
of the Russian Foundation for Basic Research (RFBR).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Apach Spark Homepage, http://spark.apache.org/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mellanox</given-names>
            <surname>Spark</surname>
          </string-name>
          <string-name>
            <surname>RDMA</surname>
          </string-name>
          , https://github.com/Mellanox/SparkRDMA
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Abiteboul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hull</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vianu</surname>
          </string-name>
          , V. (eds.):
          <article-title>Foundations of Databases: The Logical Level</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Longman</surname>
          </string-name>
          Publishing Co., Inc., Boston, MA, USA, 1st edn. (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Agarkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ismagilov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makagon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semenov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Performance evaluation of the Angara interconnect</article-title>
          .
          <source>In: Proceedings of the International Conference Russian Supercomputing Days</source>
          . pp.
          <volume>626</volume>
          {
          <issue>639</issue>
          (
          <year>2016</year>
          ), http://www.dislab.org/docs/rsd2016-angara-bench.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lian</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaftan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <string-name>
            <surname>Spark</surname>
            <given-names>SQL</given-names>
          </string-name>
          :
          <article-title>Relational data processing in Spark</article-title>
          .
          <source>In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>1383</volume>
          {
          <fpage>1394</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chaimov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malony</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iancu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ibrahim</surname>
            ,
            <given-names>K.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          , J.:
          <source>Scaling Spark on HPC systems</source>
          pp.
          <volume>97</volume>
          {
          <issue>110</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>MapReduce: Simpli ed data processing on large clusters</article-title>
          .
          <source>In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation - Volume 6. OSDI'04</source>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , Berkeley, CA, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gugnani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panda</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>High-performance design of apache spark with rdma and its bene ts on various workloads</article-title>
          (
          <year>December 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>M.W.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panda</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          :
          <article-title>Accelerating spark with rdma for big data processing: Early experiences</article-title>
          .
          <source>In: Proceedings of the 2014 IEEE 22Nd Annual Symposium on High-Performance Interconnects</source>
          . pp.
          <volume>9</volume>
          {
          <fpage>16</fpage>
          . HOTI '14, IEEE Computer Society, Washington, DC, USA (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dongarra</surname>
          </string-name>
          , J.:
          <article-title>Exascale computing and Big Data: The next frontier</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <issue>7</issue>
          ),
          <volume>56</volume>
          {
          <fpage>68</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>McCauley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing</article-title>
          .
          <source>In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation</source>
          .
          <source>NSDI'12</source>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , Berkeley, CA, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Spark: Cluster computing with working sets</article-title>
          .
          <source>HotCloud 10</source>
          ,
          <issue>7</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>