Introduction

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

JSC NICEVT

Moscow

Russia

a.agarkov

semenovg@nicevt.ru

92 101

In this paper we consider an association problem with constraints for two dynamically enlarging tables. We consider a base full association algorithm and propose a partial association algorithm that improves e ciency of the base algorithm. We implement and evaluate the algorithms in Apache Spark for a particular case on the cluster with Angara interconnect.

association problem dynamically enlarging tables Apache Spark Angara interconnect performance evaluation

Introduction

In the recent years data intensive applications have become widespread and appeared in many science and engineering areas (biology, bioinformatics, medicine, cosmology, nance, social network analysis, cryptology etc.). They are characterized by a large amount of data, irregular workloads, unbalanced computations and low sustained performance of computing systems. Development of new algorithmic approaches and programming technologies are urgently needed to boost e ciency of HPC systems for similar applications, thus enabling advancing of HPC and Big Data convergence [ 10 ].

In the paper we consider an association problem with constraints for two dynamically enlarging tables. We have two large tables and an ordered set of rule groups which determine associations between entries from the rst table and the second table. When two table entries compose an association by a rule in the current rule group, then these entries must be excluded from association process for the following rule groups. Each entry is associated with other entries from the both tables directly or indirectly through the other associations. It is needed to determine the association type and to list of the associated entries for each entry. Tables are dynamically enlarging, the goal is to improve potential performance of the association process by using of the associations, built on the original tables.

Apache Spark [ 1 ] is a popular open-source implementation of Spark. Spark [ 12 ] is a framework which optimizes programming and execution models of MapReduce [ 7 ]. Current implementation of Apache Spark can not e ciently use advanced features (e.g. RDMA) on clusters with high-performance interconnects. Researchers from Ohio State Univercity proposed a high-performance RDMA-based design for accelerating the Spark framework [ 9, 8 ]. Chaimov et al. from Cray ported and tuned Spark on Cray XC systems developed in production at a large supercomputing center [ 6 ]. The Mellanox company presented an open-source Spark RDMA implementation [ 2 ]. We consider a high-speed Angara interconnect [ 4 ] as a target of Spark optimization. But in the current work we run Apache Spark through the TCP/IP interface on the Angara interconnect.

In the paper we describe a base full association approach to the problem, propose a partial association approach that improves e ciency of the base approach, implement corresponding algorithms using Apache Spark and present evaluation results on a cluster with the Angara interconnect. 2 We consider two large tables with M rows. The tables have identical structure, each table has N elds. The table unique key consists of all N elds of the table.

We consider an ordered set of rule groups, which determine associations between entries from the rst table and entries from the second table: { Rule is a set of elds, which are used to compare two table entries. It is required to build associations between the tables: to nd matches between di erent table entries by the rule. { Group is a set of rules; rules of a group are applied to the table entries independently of each other. When two table entries compose the association by a rule in the current rule group, then these entries are marked by the current group number and must be excluded from association process for the following rule groups.

Each table entry can be associated with one or many entries of another table. Moreover, association is a transitive relation. Associations for each entry can be classi ed into one of four association types: one from the rst table to one from the second table, one to many, many to one and many to many (1: 1, 1: M, M: 1, M: N). Therefore each entry is associated with other entries from the both tables directly or indirectly through the other associations.

The goal of the table association problem is to determine the association type and to list of the associated entries for each entry.

After we build associations between the tables, K new entries are added to each table. Added entries di er from original entries by a given subset of elds. Association process is needed to repeat to make the augmented tables associated too.

Full association approach can generate associations between given tables by the mentioned set of rules from scratch. It is required to build associations between the augmented tables.

The goal of the dynamically enlarging table association problem is to

improve potential performance of the full association approach on the augmented tables by using of the associations, built on the original tables.

For the sake of simplicity in the paper we consider a particular case of the problem. 2.1

Data Structure and Association Rules

In our work each table entry has 5 elds, where the rst and the second elds are integer identi ers, three other elds are data elds. The unique key for every entry is all of the ve elds. Each entry has a unique synthetic identi er.

In the work the considered ordered set of rule groups consists of 5 groups and 15 rules, see Table 1. Symbol `+' denotes equality requirement of the corresponding elds of the two table entries. Symbol `{' denotes that elds of the two table entries are not matched. For example, a key that determines an association between two table entries is speci ed for each rule and consists of the elds, that are marked with the `+' symbol.

The full association approach is matching each entry from the rst table with each entry of the second table by each rule from the current rule group. If matching is successful then we create and store an association between the entries; the association is marked by the current group number. Entries that do not have any associations are marked by the group number six.

Algorithms Full Association Algorithm

The full association algorithm consists of two stages: associations matching and transitive closure. The rst stage actually implements the full association approach. All possible pairs are found by the rst group of rules, then every entry that is included in the pairs is excluded from the tables. This procedure is repeated for each group of rules. The result of the stage is a set of associations (it will be graph edges) between entries (it will be graph vertices).

At the second stage the transitive closure (TC) algorithm is executed for each selected group. At rst, we construct a bipartite graph. Vertices in the left vertex set are unique identi ers of the entries from the rst table, in the right vertex set there are unique identi ers of the entries from the second table. There exists an edge between two vertices of the di erent graph parts if the association between corresponding entries has been found during the rst stage of the algorithm.

Transitive closure [ 3 ] T C is performed by the following formula: T C = [i=1;2::Ri; where Ri+1 = Ri join E;

R1 = E; E set of graph edges: (1)

The transitive closure is built by repetitive merging result of a join operation between previous resulting set of edges and original set of graph edges until the result is not changed, i.e. xed point is reached. Thus, for each vertex in T C there exist vertex pairs that connect current vertex with other vertices in the connected component.

Finally, the association type of each vertex is de ned (1:1, 1:M, M:1, M:N). 3.2

Partial Association Algorithm

There are original (old) tables, the associations that have been built for old tables, and there are new tables that are smaller than original. The added entries di er from original entries by the #3 eld.

When new entries are added to the original tables, one can apply full association algorithm to the augmented tables from scratch. We propose a partial association algorithm to improve performance by using of associations that are built for the original tables.

The partial association algorithm is executed also in two stages: association matching and transitive closure.

It is important that added (new) entries di er from the original entries by the #3 eld. The main idea of the rst stage is matching only new entries of the tables for each rule with matching requirement by the #3 eld, in that case there will be no associations between new and old entries. For each rule without matching requirement by the last eld new and old entries must be matched, see Figure 1.

The association matching stage di ers from the same stage in the full association algorithm. Each entry included in new associations must be excluded from old associations. As seen in Figure 2, if a new entry is associated with an old entry, and the association group number new gn is smaller than the group number old gn of the association between the old entry with another entry, then these old associations must be removed; if new gn is equal to old gn then the new entry should be added to the component.

Old

The transitive closure stage is executed only for new associations. The resulting graph of transitive closure is combined with the old graph with invalid associations excluded during the matching stage. 4

Implementation Details

Apache Spark [ 1 ] is a popular open-source implementation of Spark. It provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a faulttolerant way [ 12 ]. It was developed in response to limitations in the MapReduce cluster computing paradigm [ 7 ], which forces a particular linear data ow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDD function as a working set for distributed programs that o ers a (deliberately) restricted form of distributed shared memory [ 11 ]. The latest Spark program interface DataFrame [ 5 ] seems to be more e cient than the RDD interface, but in the current work we use RDD, and we suppose to use DataFrames in the next research works.

We use Java 8 and Apache Spark 1.6.1 for implementation of the full and partial association algorithms. We use RDD of Tuple5<Long, Long, Long, Long, Long> type for table structure representation, the sequence of types in Tuple5 corresponds to the table elds ID1, ID2, #1, #2, #3. We attach unique identi er (Long) to the Tuple5 of each entry.

After the association stage we have RDD of Tuple2<Long, Long> that represents association between the unique identi ers of two table entries. 4.1

Synthetic Table Generator

Synthetic table generator creates distributed random tables and works as follows. First, two identical tables of the required size are generated. Each eld value is a uniformly distributed random integer number in the following intervals: { ID1, ID2 { [0; 10000), { #1 { [0; 1000000), { #3 { [0; 1000).

Value of the #2 eld is a position number of the entry.

We randomly modify second table entries in order to create possibility of association between entries from the rst and the second tables for each rule. We modify entry elds that are marked with the `{' symbol in Table 1. Distribution of modi cations in the rules is shown in Table 2. 72% of the second table entries remain unchanged. In 2% of the table entries there are random modi cations in the #3 eld values. In 6% of the table entries there are random modi cations in the #2 eld values and so on. As a result 72% of the table entries correspond to the rst rule, 2% { to the second rule, 6% { to the third rule and so on.

We generate the augmented tables as follows. First, we add new entries to the rst table, eld values of each entry is a uniformly distributed random integer number in the following intervals: { ID1, ID2 { [0; 10000), { #1 { [0; 1000000), { #3 { [1000; 2000).

Value of the #2 eld is a position number of the new entry in the whole table. As can be seen, the old table and the new table have di erent values in the #3 eld.

Second, we copy augmented part of the rst table to the second table and randomly modify it as well as is described in Table 2.

Performance Evaluation

All presented results are obtained on the Angara-K1 cluster. We use 12 out 36 nodes. All Angara-K1 nodes are linked by the Angara interconnect. Russian highspeed Angara interconnect is developed in NICEVT, performance evaluation of the Angara-K1 cluster with the Angara interconnect on scienti c workloads is presented in [ 4 ]. In the current work we run Apache Spark through the TCP/IP interface on the Angara interconnect. Table 3 provides an architecture and a software overview of the Angara-K1 partition.

Figures 3, 4, 5 and 6 show the comparison results of the full and partial association algorithms. The reported running times do not include reading data and writing the result to the lesystem. Table 4 presents the total table sizes in GB during implementation executing for di erent entry numbers. The algorithm running times are shown in Figure 3, we use 8 cores per node and 8 nodes of the cluster, table size is varied. Old table size is 300 million entries, new table size is 75 million entries. The gure shows that performance di erence between the algorithms grows with table size.

Strong scaling is shown in Figure 4. Old table size is 100 million entries, new table size is 25 million entries. The speedup of the full and partial association algorithms is approximately 3 on 8 nodes. Among the possible reasons of moderate performance there is a single one: Spark con guration is not optimal. Further 00 50 100 150 200 Table size, millions of entries 250 300 2 4 Nodes 6number 8 10 12 Partial Association Full Association Partial Association Full Association tuning can address the problem. Horizontal line from 8 to 12 nodes indicates that the table size is small for further performance increasing.

In Figure 5 pro ling results are shown. Shaded color denotes the association matching stage (stage #1), normal color denotes the transitive closure stage (stage #2). Old table size is 300 million entries, new table size is 50 million entries. It can be seen that the partial algorithm optimizes primarily the transitive closure stage.

Running time on 4 nodes is more than two times faster than on 2 nodes, because the problem size is too large for 2 nodes and Garbage Collector occupies a signi cant part of the time.

The dependence of algorithm running times on the amount of new data is shown in Figure 6. We use 6 nodes, 300 million entries in the each table, the fraction of the new table entries varies from 12.5 to 100 percents of the total table size. The smaller the percentage of new data is, the faster the partial association algorithm is executed. The running time of the full association algorithm does not change, because the total amount of data does not change. 6

Conclusion

In the paper we propose the partial association algorithm for the table association problem of two dynamically enlarging tables with speci c constraints. For the sake of simplicity we consider a particular case of the problem. We implement the base full association algorithm and the proposed algorithm using Apache Spark and present performance evaluation of the algorithms on the cluster equipped with the Angara interconnect. Performance of the proposed algorithm exceeds performance of the full association algorithm for a variety of data sets.

In future work we plan to make detail pro ling of the implemented algorithms in terms of Apache Spark internal operations and to optimize Apache Spark on the Angara interconnect.

Acknowledgments. The work was supported by the grant No. 17-07-01592A of the Russian Foundation for Basic Research (RFBR).

1. Apach Spark Homepage, http://spark.apache.org/

Mellanox

Spark RDMA , https://github.com/Mellanox/SparkRDMA

3. Abiteboul , S. , Hull , R. , Vianu , V. (eds.): Foundations of Databases: The Logical Level . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edn. ( 1995 )

4. Agarkov , A. , Ismagilov , T. , Makagon , D. , Semenov , A. , Simonov , A. : Performance evaluation of the Angara interconnect . In: Proceedings of the International Conference Russian Supercomputing Days . pp. 626 { 639 ( 2016 ), http://www.dislab.org/docs/rsd2016-angara-bench.pdf

5. Armbrust , M. , Xin , R.S. , Lian , C. , Huai , Y. , Liu , D. , Bradley , J.K. , Meng , X. , Kaftan , T. , Franklin , M.J. , Ghodsi , A. , et al.: Spark

SQL

: Relational data processing in Spark . In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data . pp. 1383 { 1394 . ACM ( 2015 )

6. Chaimov , N. , Malony , A. , Canon , S. , Iancu , C. , Ibrahim , K.Z. , Srinivasan , J.: Scaling Spark on HPC systems pp. 97 { 110 ( 2016 )

7. Dean , J. , Ghemawat , S.: MapReduce: Simpli ed data processing on large clusters . In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation - Volume 6. OSDI'04 ,

USENIX

Association , Berkeley, CA, USA ( 2004 )

8. Lu , X. , D. , S. , Gugnani , S. , Panda , D. : High-performance design of apache spark with rdma and its bene ts on various workloads ( December 2016 )

9. Lu , X. , Rahman , M.W.U. , Islam , N. , Shankar , D. , Panda , D.K. : Accelerating spark with rdma for big data processing: Early experiences . In: Proceedings of the 2014 IEEE 22Nd Annual Symposium on High-Performance Interconnects . pp. 9 { 16 . HOTI '14, IEEE Computer Society, Washington, DC, USA ( 2014 )

10. Reed , D. , Dongarra , J.: Exascale computing and Big Data: The next frontier . Communications of the ACM 57 ( 7 ), 56 { 68 ( 2014 )

11. Zaharia , M. , Chowdhury , M. , Das , T. , Dave , A. , Ma , J., McCauley , M. , Franklin , M.J. , Shenker , S. , Stoica , I. : Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing . In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation . NSDI'12 ,

USENIX

Association , Berkeley, CA, USA ( 2012 )

12. Zaharia , M. , Chowdhury , M. , Franklin , M.J. , Shenker , S. , Stoica , I. : Spark: Cluster computing with working sets . HotCloud 10 , 7 ( 2010 )