An Evaluation of Large Set Intersection Techniques on GPUs Christos Bellas, Anastasios Gounaris Aristotle University of Thessaloniki Thessaloniki, Greece (chribell,gounaria)@csd.auth.gr ABSTRACT the results. Last, in Section 5 we conclude and discuss possible In this paper, we focus on large set intersection, which is a pivotal future work. operation in information retrieval, graph analytics and database systems. We aim to experimentally detect under which condi- 2 PRELIMINARIES tions, using a graphics processing unit (GPU) is beneficial over In this section, we first give a formal notation for the set inter- CPU techniques and which exact techniques are capable of yield- section problem. Next, we give an overview of the related work ing improvements. We cover and adapt techniques initially pro- about set intersection after explaining the exact scope of this posed for graph analytics, while we investigate new hybrids for paper. completeness. Finally, we present a comprehensive evaluation highlighting the main characteristics of the techniques examined 2.1 Set Intersection Problem Description when both a single pair of two large sets are processed and all Single-instance set intersection (SISI) problem: given i) a finite pairs in a dataset are examined. universe of elements 𝐸, and ii) a set 𝐴 = {𝑒 1𝐴 , . . . , 𝑒𝑛𝐴 } of size {𝐴,𝐡 } 𝑛 and a set 𝐡 = {𝑒 1𝐡 , . . . , π‘’π‘š 𝐡 } of size π‘š, where 𝑒 𝑖 ∈ 𝐸, set 1 INTRODUCTION intersection 𝐴∩𝐡 produces a new set 𝑆 containing all the common Set intersection is an essential operation in numerous application elements among 𝐴 and 𝐡. domains, such as information retrieval for text search engines us- Multi-instance set intersection (MISI) problem: in the multi- ing an inverted index [2, 5], graph analytics for triangle counting instance set intersection problem, we are given a collection of and community detection [11, 17, 19], and database systems for π‘˜ sets 𝐢 = {𝑆 1, . . . , π‘†π‘˜ }, and we aim to find the set intersection  merging RID-lists [15] and performing fast (potentially bitwise) among all π‘˜2 of pairs (𝑆𝑖 ,𝑆 𝑗 ), where 0 < 𝑖 < 𝑗 ≀ π‘˜. operations on data in columnar format [9]. Obviously, the solutions of SISI are of Ξ©(𝑛 + π‘š) complexity, Modern GPUs offer a high-level parallel environment at low whereas MISI solutions are of quadratic complexity in π‘˜ without cost. As a result, over the past years, there has been a considerable allowing for any pair pruning, as, for example, in problems such research work on improving graph analytics on a GPU, mostly as set similarity joins [3]. Therefore, the focus is not on evaluating in the context of graph triangle counting, where set intersection the behavior of algorithms differing in asymptotic complexity, dominates the running time [4, 7, 8, 10, 13, 18]. The majority since all techniques are of similar complexity; rather, we aim to of these studies focus on improving the level of parallelism by assess the impact of different potentially low-level engineering reducing redundant comparisons and distributing the workload techniques when 𝑛 and π‘š are at the order of millions and the evenly among GPU threads. Set intersection on GPUs has also size of 𝐸 is orders of magnitude larger. been examined in the context of set similarity joins [3]. In this work, we compare and evaluate state-of-the-art GPU 2.2 Scope of our work techniques for set intersection, when the sets are large, i.e., they Over the past years, there has been a lot of research that encap- contain millions of elements. More specifically, we gather to- sulates GPU techniques for the set intersection problem, as will gether techniques belonging to five different rationales, namely be discussed shortly. In the majority of cases, the problem of intersect path, optimized binary search, hash, bitmap and set sim- set intersection is tackled in the context of triangle counting to ilarity join-based solutions, while we include novel promising accelerate graph analytics. Triangle counting is a special case of hybrid solutions that combine existing techniques to better fit set intersection, which stems from the need to find quickly inter- into our setting. section counts among vertex adjacency lists of small lengths. As We perform experiments both when the intersection of one a result, the techniques proposed are tailored to specific algorith- pair of sets (single-instance problem) and the intersections among mic optimizations. In the context of this work, we extract, adapt all set pairs in a database are computed (multi-instance problem). and evaluate these techniques under a more demanding large We derive useful insights, and more specifically we provide ex- set intersection scenario performing also a comparison against perimental evidence about the superiority of two intersection all methodologies proposed that can address the SISI and MIMI path flavors for the single-instance problem and of either bitmap- problems for large sets. based or similarity join-based solutions for the multi-instance problem depending on the size of the sets.1 2.3 Overview of existing solutions The rest of the paper is organized as follows. Section 2 gives We present the relevant techniques mostly in chronological or- a background on the problem and an overview of the related der. Ding et al. [6] were the first to implement a parallel GPU work. Section 3 describes the evaluated GPU techniques for set set intersection algorithm. In their proposed solution, they use intersection. In Section 4, we present the experiments and discuss the so-called Parallel Merge Find (PMF) algorithm to compute a 1 The source code is available at https://github.com/chribell/set_intersection set intersection. In essence, given two sets, the shorter one is partitioned into disjoint segments, with each segment assigned Β© Copyright 2021 for this paper by its authors. Use permitted under Creative to a different GPU thread. Then, the last element of each segment Commons License Attribution 4.0 International (CC BY 4.0). is searched in the longer set to find the corresponding or closest positions for the partitioning of the longer set. As a result, each A5 8 9 11 12 13 16 18 19 22 GPU thread becomes capable of computing its own intersection B 5 among segments in parallel. In [20], Wu et al. highlight the inef- ficiency of the approach followed in [6] for sets of smaller size. 7 Subsequently, they propose a GPU technique for set intersection 8 5 that takes advantage of the fast on-chip shared memory. First, an 9 empty array of size equal to the size of the shorter set is allocated 10 in shared memory. Next, each GPU thread is assigned with a specific number of elements from the shorter set and conducts 12 binary searches in the longer set. If an element is found, its cor- 16 responding cell in the shared memory array is set to 1. Finally, a 20 5 scan operation on the shared memory array is performed along 22 with a stream compaction procedure, so that threads produce the 5 final intersection. In [21], Wu et al. extend their original work Count Augmentation described above by introducing heuristic strategies to balance the workload across thread blocks more efficiently. Additionally, Figure 1: Intersect Path example using 4 thread blocks. the proposal in [1] further extends the original work in [21] by introducing a linear regression approach to reduce the search range of a binary search and a hash segmentation approach as an alternative to binary search. Last, in the context of set similarity join, the authors of [16] In [8], Green et al. present the Intersect Path (IP) algorithm for propose a technique for conducting set intersections by using set intersection, which is a variation of the well established GPU a static inverted index and atomic operations. By setting the Merge Path algorithm for merging lists. In addition, IP can be similarity degree equal to zero, their solution can be used to solve considered as an extension of PMF since it encapsulates a similar the SISI and MISI problems. set partition logic. First, input sets are partitioned into segments As already mentioned, some of these works, such as [4, 8, that can be intersected independently by multiple thread blocks. 10, 14, 18] are more complete proposals targeting the triangle Second, for each thread block, workload is distributed in such a counting problem; here we narrow down our attention only to way that near equal number of elements to intersect are allocated the part that is relevant to the SISI and MISI problems. to threads. In [7], the authors extend the work of [8] by proposing an adaptive load balancing technique and dynamically assign 3 TECHNIQUES work to GPU threads based on work estimations. The authors In the previous section, we have identified five different method- of [14] and [18] follow a similar approach to set intersection. ologies, based on (1) IP, (2) optimized binary search, (3) bitmap More specifically, their main concept is, for each GPU thread, to operations, (4) hashing and (5) set similarity joins, respectively. sequentially compute a set intersection by using a two-pointers Here, we present five different state-of-the-art GPU techniques merge algorithm. In the context of triangle counting, such an for set intersection, one for each methodology, in more detail. In approach is applicable since the requirement is (i) for each GPU addition, we discuss some modifications to these techniques to thread to compute a two-set intersection count independently, target large set intersection yielding two novel hybrid solutions. and (ii) these partial counts to be accumulated to compute the We do not consider, works such as [14, 18] that mostly rely on the final global count. However, in the setting of set intersection over preprocessing of input data and conduct intersection with a sim- two large sets, their solution is inferior and similar to a sequential ple merge fashion algorithm. In contrast, we evaluate techniques CPU approach. that are more adaptable in a general set intersection scenario. We In [4], Bisson et al. propose a different approach, namely, GPU give a concise presentation for each evaluated technique below. set intersection to be based on bitmaps and atomic operations. Given two sets, to compute their intersection, the GPU threads Intersect Path (IP) [8]. Given two ordered sets 𝐴 and 𝐡, IP create the bitmap representation of the first set in parallel and considers the traversal of a grid, noted as Intersect Matrix, of size then, iterate over the elements of the second set to search for |𝐴| Γ— |𝐡|. Beginning from the top left corner of the grid, the path the corresponding set bits. Based on the average set size, the can move downwards if 𝐴[𝑖] > 𝐡 [ 𝑗], to the right if 𝐴[𝑖] < 𝐡 [ 𝑗], workload allocation is per thread, per warp or per block. and diagonally if 𝐴[𝑖] = 𝐡 [ 𝑗], until it eventually reaches the In [10], Hu et al. demonstrate that set intersection is faster bottom right corner. There are two partitioning stages, one on employing efficient binary search-based techniques than IP-based kernel grid level and the other on block level. On grid level, techniques, arguing that the latter suffer from nontrivial overhead equidistant cross diagonals are placed on the Intersect Matrix. of partitioning the input sets and non-coalesced memory accesses. The number of diagonals is equal to the number of thread blocks On the other hand, their proposed algorithm optimizes binary plus one, in order to delimit the boundaries of each block. Using search at a warp level to achieve coalesced memory access and to binary search, the point of intersection between a cross diagonal alleviate the need for workload balancing. In addition, by caching and the path is found. As a result, each thread block is assigned the first levels of the binary search tree they employ in the shared to intersect disjoint subsets of the input. In case a cross diagonal memory, they can achieve additional speedup gains. intersects with the path inside a matrix cell, a count augmentation In [13], Pandey et al. propose a hash-based technique for set is required beforehand, since this intersection is not assigned to intersection. In brief, first the algorithm hashes the shorter set any thread block. An example of this scenario is shown in Figure 1. into buckets, and then iterates over the larger set and hashes On block level, the same cross diagonal approach applies in order each element to the corresponding bucket. Afterwards, a linear to distribute workload among threads. Each thread may conduct search is conducted within each bucket to find the intersections. a serial or binary search based intersection. S20 = {4, 8, 9, 13} 12 4 8 9 13 8 18 A S2 S3 S6 S12 S2 S4 S12S14 S2 S3 S6 S8 S9 S12S14S15 V 5 9 13 19 T0 T1 T2 T3 11 16 22 intersection 0 3 2 1 0 2 0 1 1 0 0 3 0 2 1 0 0 0 0 S20 B 5 7 8 9 1012162022 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10S11S12S13S14S15S16S17S18S19 Figure 2: Optimized Binary Search example. Figure 5: Static-index Intersection example using 4 GPU threads (Adapted from [3]). A 5 8 9 11 12 13 16 18 19 22 bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 (ii) dynamic, where bitmaps are built on the fly in the case that & 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 we cannot store all of them in the available global memory. The bB 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 latter is similar to the work of [4]. Bitmaps are constructed in parallel using the atomicOr function. In the naive scenario, each B 5 7 8 9 10 12 16 20 22 GPU thread fetches two bitmap words, one from each set, and conducts a popcount operation on the resulting logical AND Figure 3: Naive Bitmap-based Intersection example. word to compute the subset intersection. An example of the naive scenario is shown in Figure 3. In the dynamic scenario, A 5 8 9 11 12 13 16 18 19 22 each GPU thread block first constructs the bitmap representation of the current set, and then, for every next set, the threads iterate bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 over its elements and check whether the respective bits are set (see the example in Figure 4). B 5 7 8 9 10 12 16 20 22 Static-index Intersection (sf-gssjoin) [16]. Given π‘˜ sets, sf-gssjoin first constructs a static inverted index over all set elements. For Figure 4: Dynamic Bitmap-based Intersection example. each set, the inverted lists for all its elements are concatenated in a logical vector. Next, this vector is partitioned evenly with Optimized Binary Search (OBS) [10]. Given two ordered sets each partition assigned to a specific GPU thread. Thus, each 𝐴 and 𝐡, OBS caches the first levels of the binary search tree thread processes independently its corresponding partition and of the larger set in shared memory to reduce expensive global contributes to the intersections of the current set with the next memory reads. For example, as shown in Figure 2, the higher ones. To ensure correctness, threads increment the intersection level nodes of the binary tree reside in shared memory, whereas counts by using atomic operations. An example of sf-gssjoin is the leafs are located in global memory. The smaller set, which shown in Figure 5. is 𝐡 in the example, is used for lookup and as a result there are All of the presented techniques with the exception of BI-naive |𝐡| total binary search lookups. Each thread is assigned a lookup and IP, conduct a single instance set intersection on a single block and, in each iteration, up to 32 (i.e., the size of warp) lookups or warp. This results in severe GPU underutilization, especially are executed simultaneously. For the multi-instance problem, a for sets of size in the order of millions, since a single execution simple optimization is to cache each set to shared memory one unit is assigned with the complete workload. To tackle this issue, at a time, and then iterate every subsequent one to perform the we investigate the integration of the kernel grid level partitioning intersection for every pair. This requires sets to be sorted by their of IP with OBS and HI, which have not been proposed previously size. in the literature. We denote these novel hybrid techniques as IP- OBS and IP-HI respectively. For the single instance scenario, in Hash-based Intersection (HI) [13]. Given two sets 𝐴 and 𝐡, HI BI-dynamic, we modify the algorithm by constructing the bitmap first hashes the shorter set and constructs buckets in parallel, of the first set across multiple blocks and then we partition evenly and then, iterates and hashes every element of the larger set into the second set into disjoint subsets with each one assigned to a the corresponding bucket as already described in Section 2. The specific block. initial hashing of the smaller set is preferred in order to reduce the number of collisions. Buckets are statically allocated once 4 EVALUATION in linear global memory space. The size of each bucket, i.e. the In this section, we evaluate the techniques in the previous section, number of entries from the shorter set in the bucket, is stored in while, in our experiments, we also include CPU variants for shared memory. Thus, to ensure correctness for the MISI problem, completeness. More specifically, we compare the GPU techniques we only need to clear the buckets sizes in shared memory, and against three CPU alternatives, namely, (i) SIMD, which performs let every next short set hashing overwrite the previous one. intersection on sorted integers using SIMD instructions [12], (ii) Bitmap-based Intersection (BI). Given two sets 𝐴 and 𝐡, BI std::set_intersection, which uses the C++ standard library and (iii) conducts the set intersection on their bitmap representations, boost::dynamic_bitset which uses the Boost library in a similar with each bitmap requiring |𝐸|/8 bytes of memory regardless fashion as BI-dynamic. We measure the total intersection time of the set sizes. More specifically, there are two flavors of BI, for the CPU techniques using the std::chrono library. For the GPU namely (i) naive, where all bitmaps fit in global memory, and operations, we measure the total time, including transfers and Dataset Cardinality Avg set size # Element Universe technique. Furthermore, sf-gssjoin is the second best performing ENRON 1.0 Β· 104 794 1.1 Β· 106 technique, when manages to launch, i.e. when the static index ORKUT 1.0 Β· 10 4 1001 8.7 Β· 106 fits in the global memory, which is not the case when the element 4 3.7 Β· 104 TWITTER 1.0 Β· 10 150 universe is 109 in our experiments. On the other hand, IP-OBS Table 1: Real world dataset characteristics. is more robust and its performance is the closest to BI-dynamic across both datasets. In addition, we observe a 80% performance increase for the hybrid IP-OBS technique over the standalone OBS. Last, even though IP and IP-HI are the worst performing memory allocations, by using the CUDA event API. We do not GPU techniques, they manage to achieve average 5.7X speedup measure any preprocess time. The experiments were conducted compared to the best performing CPU technique. on a machine with an Intel i7 5820k clocked at 3.3 GHz, 32 GB As shown in Figure 8, for real world datasets with π‘˜ = 10000, RAM at 2400 MHz and an NVIDIA Titan XP on CUDA 11.0. This sf-gssjoin is the most efficient technique across every dataset and GPU has 30 Streaming Multiprocessors, with a total of 3840 cores, achieves average 147X speedup over the best performing CPU 12 GB of global memory and a 384-bit memory bus width. technique. Moreover, the small average set size leads in small We experiment with artificial datasets, where set elements fol- static indices, which yields an optimal workload distribution low the normal, uniform and zipf like distribution. To distinguish among GPU threads and overall, results in better GPU utilization. each dataset, we use the notation Distribution-Element universe- In contrast, BI-dynamic, OBS and HI conduct set intersection Average set size. We vary the element universe from 108 up to at block level. Thus, there is an inherent workload imbalance 109 . Respectively, we vary the average set size from 106 up to 107 . among GPU thread blocks. We note that IP, and consequently For the multi-instance scenario, we also experiment with three its two workload allocation rationales, IP-OBS and IP-HI cannot real world datasets previously employed in [3]. More specifically, always execute due to memory constraints. More specifically, for each sorted dataset we extract the last π‘˜ = 10000 sets, i.e. the the required memory to store the diagonals for the grid level largest sets. Table 1 gives an overview of the real-world datasets’  partitioning of IP is 2 Γ— π‘˜2 Γ— (π‘π‘™π‘œπ‘π‘˜π‘  + 1). As a result, there characteristics. is an upperbound to the number of blocks to comply with the global memory restriction. However, due to the small set sizes, 4.1 SISI experiments we consider IP partitioning as an excessive approach to conduct We examine the single instance scenario using twelve artificial set intersection on these datasets. datasets (combinations of 3 distributions, 2 element universe sizes Based on our experimental evaluation, we conclude that for and 2 set sizes). We have also experimented with several block large MISI problems, BI-dynamic and IP-OBS are preferable. On sizes, but due to lack of space, we present the results of the best the other hand, when dealing with multiple instance small set configuration for each technique. intersections, the grid level partitioning of IP adds overhead and As shown in Figure 6, IP and IP-OBS are the clear winners for results in severe GPU underutilization. In such cases, standalone the SISI epxeriments, in the sense that one of them is the best OBS and HI perform better but are surpassed by sf-gssjoin, which overall performing technique across every dataset. Moreover, the achieves optimal GPU utilization. differences between their performance is relatively small: up to 12% for the normal and zipf distribution, and up to 35% for the 5 CONCLUSION uniform distribution. Overall, these techniques achieve average 2.3X speedup compared to the best performing CPU technique; In this work, we adapt and evaluate GPU set intersection tech- the best performing CPU technique differs between the cases, niques that were previously applied to graph analytics. Although but in average, SIMD is the most robust. In general, the speedup these techniques are better suited for intersecting sets of smaller is similar for all distributions types and increases with higher size, we experimentally show that, through certain enhance- universe and average set size, reaching 2.92X. The speedup of ments, they can be easily adapted for large set intersection. We IP or IP-OBS over the worst performing CPU or GPU technique explain which are the best approaches in the single- and multi- exceeds an order of magnitude; we have observed speedups up instance cases, and we introduce a novel hybrid, namely, com- to 48.67X over a CPU technique and up to 10.97X over a GPU bining 𝐼𝑃 with 𝑂𝐡𝑆, which proves to be a dominant solution technique. for the former cases, and competitive in the latter ones. Also, In addition, IP-HI exhibits the worst performance regarding employing bitmap-based solutions pays off in the multi-instance GPU techniques and seems unable to perform better than CPU for case. Finally, if the set sizes are relatively smaller, multi-instance many settings; this shows that simply relying on the IP workload set intersection stands to benefit from adapting a set similarity allocation rationale is not adequate. Finally, BI -based techniques join technique. behave better for smaller universe sizes. Acknowledgments. The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under 4.2 MISI experiments the HFRI PhD Fellowship grant (Fellowship Number: 1154). In ad- dition, the authors gratefully acknowledge the support of NVIDIA For the multi-instance evaluation, we conduct two sets of ex- through the donation of the GPU (GPU Grant Program). periments. In the first one, we use artificially generated datasets consisting of large sets that follow the zipf distribution, which is the closest to real world. In the second one, we use real world REFERENCES [1] Naiyong Ao, Fan Zhang, Di Wu, Douglas S Stones, Gang Wang, Xiaoguang datasets and evaluate the intersection techniques on smaller set Liu, Jing Liu, and Sheng Lin. 2011. Efficient parallel lists intersection and sizes. index compression algorithms using graphics processing units. Proceedings of As shown in Figure 7, in both artificial datasets with π‘˜ = the VLDB Endowment 4, 8 (2011), 470–481. [2] JΓ©rΓ©my Barbay, Alejandro LΓ³pez-Ortiz, and Tyler Lu. 2006. Faster adaptive set 1000, BI-dynamic is the best performing technique and achieves intersections for text searching. In International Workshop on Experimental average 35X speedup compared to the best performing CPU and Efficient Algorithms. Springer, 146–157. 100 BI-naive BI-dynamic IP IP-OBS IP-HI 80 SIMD std::set_intersection boost::dynamic_bitset 60 Time (ms) 40 20 0 7 6 7 6 7 6 7 6 7 6 7 6 9 -10 9 -10 8 -10 8 -10 9 -10 0 8 -10 8 -10 9 10 9 10 8 10 8 10 0 -1 9 -10 -10 -10 -10 uni-1 0 uni-1 uni-1 0 uni-1 0 10 - 10 - 10 - 10 - norm norm norm norm zipf- zipf- zipf- zipf- Dataset Figure 6: Single-instance experiments on artificial datasets. 4000 3500 sf-gssjoin BI-dynamic OBS IP-OBS SIMD std::set_intersection [6] Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics pro- 3000 IP IP-HI boost::dynamic_bitset cessors for high performance IR query processing. In Proceedings of the 18th 2500 international conference on World wide web. 421–430. 2000 [7] James Fox, Oded Green, Kasimir Gabert, Xiaojing An, and David A Bader. 2018. 1500 Fast and adaptive list intersections on the gpu. In 2018 IEEE High Performance 1000 extreme Computing Conference (HPEC). IEEE, 1–7. 500 [8] Oded Green, Pavan Yalamanchili, and LluΓ­s-Miquel MunguΓ­a. 2014. Fast Time (secs) triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular 70 Applications: Architectures and Algorithms. 1–8. 60 [9] Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and 50 Techniques, 3rd edition. Morgan Kaufmann. 40 30 [10] Yang Hu, Hang Liu, and H Howie Huang. 2018. Tricore: Parallel triangle 20 counting on gpus. In SC18: International Conference for High Performance 10 Computing, Networking, Storage and Analysis. IEEE, 171–182. 0 X [11] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. 2014. zipf-109-106 zipf-108-106 Querying k-truss community in large and dynamic graphs. In Proceedings of Dataset the 2014 ACM SIGMOD Int. Conf. on Management of data. 1311–1322. [12] Daniel Lemire, Leonid Boytsov, and Nathan Kurz. 2016. SIMD compression and the intersection of sorted integers. Software: Practice and Experience 46, 6 Figure 7: Multi-Instance experiments on artificial datasets (2016), 723–749. with π‘˜ = 1000. [13] Santosh Pandey, Xiaoye Sherry Li, Aydin Buluc, Jiejun Xu, and Hang Liu. 2019. H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7. 500 sf-gssjoin [14] Adam Polak. 2016. Counting triangles in large graphs on GPU. In 2016 IEEE In- BI-dynamic 400 OBS ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW). 300 HI IEEE, 740–746. SIMD std::set_intersection [15] Vijayshankar Raman, Lin Qiao, Wei Han, Inderpal Narang, Ying-Lin Chen, 200 boost::dynamic_bitset Kou-Horng Yang, and Fen-Ling Ling. 2007. Lazy, adaptive rid-list intersection, 100 and its application to index anding. In Proceedings of the 2007 ACM SIGMOD Time (secs) international conference on Management of data. 773–784. 10 [16] Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and 8 Wellington Santos Martins. 2017. Fast parallel set similarity joins on many- core architectures. Journal of Information and Data Management 8, 3 (2017), 6 255–255. 4 [17] Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all 2 triangles in large graphs, an experimental study. In International workshop on 0 experimental and efficient algorithms. Springer, 606–609. ENRON ORKUT TWITTER [18] Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A com- Dataset parative study on exact triangle counting algorithms on the gpu. In Proceedings of the ACM Workshop on High Performance Graph Processing. 1–8. Figure 8: Multi-Instance experiments on real world [19] Nan Wang, Jingbo Zhang, Kian-Lee Tan, and Anthony KH Tung. 2010. On triangulation-based dense neighborhood graph discovery. Proceedings of the datasets with π‘˜ = 10000. VLDB Endowment 4, 2 (2010), 58–68. [20] Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, and Gang Wang. 2009. A batched gpu algorithm for set intersection. In 2009 10th International [3] Christos Bellas and Anastasios Gounaris. 2020. An empirical evaluation of Symposium on Pervasive Systems, Algorithms, and Networks. IEEE, 752–756. exact set similarity join techniques using GPUs. Inf. Syst. 89 (2020), 101485. [21] Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, and Jing Liu. [4] Mauro Bisson and Massimiliano Fatica. 2017. High performance exact triangle 2010. Efficient lists intersection by cpu-gpu cooperative computing. In 2010 counting on gpus. IEEE Transactions on Parallel and Distributed Systems 28, 12 IEEE International Symposium on Parallel & Distributed Processing, Workshops (2017), 3501–3510. and Phd Forum (IPDPSW). IEEE, 1–8. [5] Erik D Demaine, Alejandro LΓ³pez-Ortiz, and J Ian Munro. 2001. Experiments on adaptive set intersections for text retrieval systems. In Workshop on Algo- rithm Engineering and Experimentation. Springer, 91–104.