=Paper= {{Paper |id=Vol-2840/short2 |storemode=property |title=An Evaluation of Large Set Intersection Techniques on GPUs |pdfUrl=https://ceur-ws.org/Vol-2840/short2.pdf |volume=Vol-2840 |authors=Christos Bellas,Anastasios Gounaris |dblpUrl=https://dblp.org/rec/conf/dolap/BellasG21 }} ==An Evaluation of Large Set Intersection Techniques on GPUs== https://ceur-ws.org/Vol-2840/short2.pdf

An Evaluation of Large Set Intersection Techniques on GPUs
Christos Bellas, Anastasios Gounaris
Aristotle University of Thessaloniki
Thessaloniki, Greece
(chribell,gounaria)@csd.auth.gr

ABSTRACT the results. Last, in Section 5 we conclude and discuss possible
In this paper, we focus on large set intersection, which is a pivotal future work.
operation in information retrieval, graph analytics and database
systems. We aim to experimentally detect under which condi- 2 PRELIMINARIES
tions, using a graphics processing unit (GPU) is beneficial over In this section, we first give a formal notation for the set inter-
CPU techniques and which exact techniques are capable of yield- section problem. Next, we give an overview of the related work
ing improvements. We cover and adapt techniques initially pro- about set intersection after explaining the exact scope of this
posed for graph analytics, while we investigate new hybrids for paper.
completeness. Finally, we present a comprehensive evaluation
highlighting the main characteristics of the techniques examined 2.1 Set Intersection Problem Description
when both a single pair of two large sets are processed and all Single-instance set intersection (SISI) problem: given i) a finite
pairs in a dataset are examined. universe of elements 𝐸, and ii) a set 𝐴 = {𝑒 1𝐴 , . . . , 𝑒𝑛𝐴 } of size
{𝐴,𝐵 }
𝑛 and a set 𝐵 = {𝑒 1𝐵 , . . . , 𝑒𝑚
𝐵 } of size 𝑚, where 𝑒
𝑖 ∈ 𝐸, set
1 INTRODUCTION intersection 𝐴∩𝐵 produces a new set 𝑆 containing all the common
Set intersection is an essential operation in numerous application elements among 𝐴 and 𝐵.
domains, such as information retrieval for text search engines us- Multi-instance set intersection (MISI) problem: in the multi-
ing an inverted index [2, 5], graph analytics for triangle counting instance set intersection problem, we are given a collection of
and community detection [11, 17, 19], and database systems for 𝑘 sets 𝐶 = {𝑆 1, . . . , 𝑆𝑘 }, and we aim to find the set intersection

merging RID-lists [15] and performing fast (potentially bitwise) among all 𝑘2 of pairs (𝑆𝑖 ,𝑆 𝑗 ), where 0 < 𝑖 < 𝑗 ≤ 𝑘.
operations on data in columnar format [9]. Obviously, the solutions of SISI are of Ω(𝑛 + 𝑚) complexity,
Modern GPUs offer a high-level parallel environment at low whereas MISI solutions are of quadratic complexity in 𝑘 without
cost. As a result, over the past years, there has been a considerable allowing for any pair pruning, as, for example, in problems such
research work on improving graph analytics on a GPU, mostly as set similarity joins [3]. Therefore, the focus is not on evaluating
in the context of graph triangle counting, where set intersection the behavior of algorithms differing in asymptotic complexity,
dominates the running time [4, 7, 8, 10, 13, 18]. The majority since all techniques are of similar complexity; rather, we aim to
of these studies focus on improving the level of parallelism by assess the impact of different potentially low-level engineering
reducing redundant comparisons and distributing the workload techniques when 𝑛 and 𝑚 are at the order of millions and the
evenly among GPU threads. Set intersection on GPUs has also size of 𝐸 is orders of magnitude larger.
been examined in the context of set similarity joins [3].
In this work, we compare and evaluate state-of-the-art GPU 2.2 Scope of our work
techniques for set intersection, when the sets are large, i.e., they Over the past years, there has been a lot of research that encap-
contain millions of elements. More specifically, we gather to- sulates GPU techniques for the set intersection problem, as will
gether techniques belonging to five different rationales, namely be discussed shortly. In the majority of cases, the problem of
intersect path, optimized binary search, hash, bitmap and set sim- set intersection is tackled in the context of triangle counting to
ilarity join-based solutions, while we include novel promising accelerate graph analytics. Triangle counting is a special case of
hybrid solutions that combine existing techniques to better fit set intersection, which stems from the need to find quickly inter-
into our setting. section counts among vertex adjacency lists of small lengths. As
We perform experiments both when the intersection of one a result, the techniques proposed are tailored to specific algorith-
pair of sets (single-instance problem) and the intersections among mic optimizations. In the context of this work, we extract, adapt
all set pairs in a database are computed (multi-instance problem). and evaluate these techniques under a more demanding large
We derive useful insights, and more specifically we provide ex- set intersection scenario performing also a comparison against
perimental evidence about the superiority of two intersection all methodologies proposed that can address the SISI and MIMI
path flavors for the single-instance problem and of either bitmap- problems for large sets.
based or similarity join-based solutions for the multi-instance
problem depending on the size of the sets.1 2.3 Overview of existing solutions
The rest of the paper is organized as follows. Section 2 gives We present the relevant techniques mostly in chronological or-
a background on the problem and an overview of the related der. Ding et al. [6] were the first to implement a parallel GPU
work. Section 3 describes the evaluated GPU techniques for set set intersection algorithm. In their proposed solution, they use
intersection. In Section 4, we present the experiments and discuss the so-called Parallel Merge Find (PMF) algorithm to compute a
1 The source code is available at https://github.com/chribell/set_intersection
set intersection. In essence, given two sets, the shorter one is
partitioned into disjoint segments, with each segment assigned
© Copyright 2021 for this paper by its authors. Use permitted under Creative to a different GPU thread. Then, the last element of each segment
Commons License Attribution 4.0 International (CC BY 4.0). is searched in the longer set to find the corresponding or closest
positions for the partitioning of the longer set. As a result, each A5 8 9 11 12 13 16 18 19 22
GPU thread becomes capable of computing its own intersection
B
5
among segments in parallel. In [20], Wu et al. highlight the inef-
ficiency of the approach followed in [6] for sets of smaller size. 7
Subsequently, they propose a GPU technique for set intersection 8 5
that takes advantage of the fast on-chip shared memory. First, an 9
empty array of size equal to the size of the shorter set is allocated
10
in shared memory. Next, each GPU thread is assigned with a
specific number of elements from the shorter set and conducts 12
binary searches in the longer set. If an element is found, its cor- 16
responding cell in the shared memory array is set to 1. Finally, a 20 5
scan operation on the shared memory array is performed along
22
with a stream compaction procedure, so that threads produce the 5
final intersection. In [21], Wu et al. extend their original work Count Augmentation
described above by introducing heuristic strategies to balance
the workload across thread blocks more efficiently. Additionally, Figure 1: Intersect Path example using 4 thread blocks.
the proposal in [1] further extends the original work in [21] by
introducing a linear regression approach to reduce the search
range of a binary search and a hash segmentation approach as
an alternative to binary search. Last, in the context of set similarity join, the authors of [16]
In [8], Green et al. present the Intersect Path (IP) algorithm for propose a technique for conducting set intersections by using
set intersection, which is a variation of the well established GPU a static inverted index and atomic operations. By setting the
Merge Path algorithm for merging lists. In addition, IP can be similarity degree equal to zero, their solution can be used to solve
considered as an extension of PMF since it encapsulates a similar the SISI and MISI problems.
set partition logic. First, input sets are partitioned into segments As already mentioned, some of these works, such as [4, 8,
that can be intersected independently by multiple thread blocks. 10, 14, 18] are more complete proposals targeting the triangle
Second, for each thread block, workload is distributed in such a counting problem; here we narrow down our attention only to
way that near equal number of elements to intersect are allocated the part that is relevant to the SISI and MISI problems.
to threads. In [7], the authors extend the work of [8] by proposing
an adaptive load balancing technique and dynamically assign 3 TECHNIQUES
work to GPU threads based on work estimations. The authors In the previous section, we have identified five different method-
of [14] and [18] follow a similar approach to set intersection. ologies, based on (1) IP, (2) optimized binary search, (3) bitmap
More specifically, their main concept is, for each GPU thread, to operations, (4) hashing and (5) set similarity joins, respectively.
sequentially compute a set intersection by using a two-pointers Here, we present five different state-of-the-art GPU techniques
merge algorithm. In the context of triangle counting, such an for set intersection, one for each methodology, in more detail. In
approach is applicable since the requirement is (i) for each GPU addition, we discuss some modifications to these techniques to
thread to compute a two-set intersection count independently, target large set intersection yielding two novel hybrid solutions.
and (ii) these partial counts to be accumulated to compute the We do not consider, works such as [14, 18] that mostly rely on the
final global count. However, in the setting of set intersection over preprocessing of input data and conduct intersection with a sim-
two large sets, their solution is inferior and similar to a sequential ple merge fashion algorithm. In contrast, we evaluate techniques
CPU approach. that are more adaptable in a general set intersection scenario. We
In [4], Bisson et al. propose a different approach, namely, GPU give a concise presentation for each evaluated technique below.
set intersection to be based on bitmaps and atomic operations.
Given two sets, to compute their intersection, the GPU threads Intersect Path (IP) [8]. Given two ordered sets 𝐴 and 𝐵, IP
create the bitmap representation of the first set in parallel and considers the traversal of a grid, noted as Intersect Matrix, of size
then, iterate over the elements of the second set to search for |𝐴| × |𝐵|. Beginning from the top left corner of the grid, the path
the corresponding set bits. Based on the average set size, the can move downwards if 𝐴[𝑖] > 𝐵 [ 𝑗], to the right if 𝐴[𝑖] < 𝐵 [ 𝑗],
workload allocation is per thread, per warp or per block. and diagonally if 𝐴[𝑖] = 𝐵 [ 𝑗], until it eventually reaches the
In [10], Hu et al. demonstrate that set intersection is faster bottom right corner. There are two partitioning stages, one on
employing efficient binary search-based techniques than IP-based kernel grid level and the other on block level. On grid level,
techniques, arguing that the latter suffer from nontrivial overhead equidistant cross diagonals are placed on the Intersect Matrix.
of partitioning the input sets and non-coalesced memory accesses. The number of diagonals is equal to the number of thread blocks
On the other hand, their proposed algorithm optimizes binary plus one, in order to delimit the boundaries of each block. Using
search at a warp level to achieve coalesced memory access and to binary search, the point of intersection between a cross diagonal
alleviate the need for workload balancing. In addition, by caching and the path is found. As a result, each thread block is assigned
the first levels of the binary search tree they employ in the shared to intersect disjoint subsets of the input. In case a cross diagonal
memory, they can achieve additional speedup gains. intersects with the path inside a matrix cell, a count augmentation
In [13], Pandey et al. propose a hash-based technique for set is required beforehand, since this intersection is not assigned to
intersection. In brief, first the algorithm hashes the shorter set any thread block. An example of this scenario is shown in Figure 1.
into buckets, and then iterates over the larger set and hashes On block level, the same cross diagonal approach applies in order
each element to the corresponding bucket. Afterwards, a linear to distribute workload among threads. Each thread may conduct
search is conducted within each bucket to find the intersections. a serial or binary search based intersection.
S20 = {4, 8, 9, 13}
12
4 8 9 13
8 18
A S2 S3 S6 S12 S2 S4 S12S14 S2 S3 S6 S8 S9 S12S14S15 V

5 9 13 19 T0 T1 T2 T3

11 16 22
intersection
0 3 2 1 0 2 0 1 1 0 0 3 0 2 1 0 0 0 0 S20
B 5 7 8 9 1012162022
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10S11S12S13S14S15S16S17S18S19

Figure 2: Optimized Binary Search example. Figure 5: Static-index Intersection example using 4 GPU
threads (Adapted from [3]).
A 5 8 9 11 12 13 16 18 19 22

bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 (ii) dynamic, where bitmaps are built on the fly in the case that
& 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0
we cannot store all of them in the available global memory. The
bB 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 latter is similar to the work of [4]. Bitmaps are constructed in
parallel using the atomicOr function. In the naive scenario, each
B 5 7 8 9 10 12 16 20 22
GPU thread fetches two bitmap words, one from each set, and
conducts a popcount operation on the resulting logical AND
Figure 3: Naive Bitmap-based Intersection example. word to compute the subset intersection. An example of the
naive scenario is shown in Figure 3. In the dynamic scenario,
A 5 8 9 11 12 13 16 18 19 22 each GPU thread block first constructs the bitmap representation
of the current set, and then, for every next set, the threads iterate
bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 over its elements and check whether the respective bits are set
(see the example in Figure 4).
B 5 7 8 9 10 12 16 20 22
Static-index Intersection (sf-gssjoin) [16]. Given 𝑘 sets, sf-gssjoin
first constructs a static inverted index over all set elements. For
Figure 4: Dynamic Bitmap-based Intersection example.
each set, the inverted lists for all its elements are concatenated
in a logical vector. Next, this vector is partitioned evenly with
Optimized Binary Search (OBS) [10]. Given two ordered sets each partition assigned to a specific GPU thread. Thus, each
𝐴 and 𝐵, OBS caches the first levels of the binary search tree thread processes independently its corresponding partition and
of the larger set in shared memory to reduce expensive global contributes to the intersections of the current set with the next
memory reads. For example, as shown in Figure 2, the higher ones. To ensure correctness, threads increment the intersection
level nodes of the binary tree reside in shared memory, whereas counts by using atomic operations. An example of sf-gssjoin is
the leafs are located in global memory. The smaller set, which shown in Figure 5.
is 𝐵 in the example, is used for lookup and as a result there are All of the presented techniques with the exception of BI-naive
|𝐵| total binary search lookups. Each thread is assigned a lookup and IP, conduct a single instance set intersection on a single block
and, in each iteration, up to 32 (i.e., the size of warp) lookups or warp. This results in severe GPU underutilization, especially
are executed simultaneously. For the multi-instance problem, a for sets of size in the order of millions, since a single execution
simple optimization is to cache each set to shared memory one unit is assigned with the complete workload. To tackle this issue,
at a time, and then iterate every subsequent one to perform the we investigate the integration of the kernel grid level partitioning
intersection for every pair. This requires sets to be sorted by their of IP with OBS and HI, which have not been proposed previously
size. in the literature. We denote these novel hybrid techniques as IP-
OBS and IP-HI respectively. For the single instance scenario, in
Hash-based Intersection (HI) [13]. Given two sets 𝐴 and 𝐵, HI BI-dynamic, we modify the algorithm by constructing the bitmap
first hashes the shorter set and constructs buckets in parallel, of the first set across multiple blocks and then we partition evenly
and then, iterates and hashes every element of the larger set into the second set into disjoint subsets with each one assigned to a
the corresponding bucket as already described in Section 2. The specific block.
initial hashing of the smaller set is preferred in order to reduce
the number of collisions. Buckets are statically allocated once 4 EVALUATION
in linear global memory space. The size of each bucket, i.e. the
In this section, we evaluate the techniques in the previous section,
number of entries from the shorter set in the bucket, is stored in
while, in our experiments, we also include CPU variants for
shared memory. Thus, to ensure correctness for the MISI problem,
completeness. More specifically, we compare the GPU techniques
we only need to clear the buckets sizes in shared memory, and
against three CPU alternatives, namely, (i) SIMD, which performs
let every next short set hashing overwrite the previous one.
intersection on sorted integers using SIMD instructions [12], (ii)
Bitmap-based Intersection (BI). Given two sets 𝐴 and 𝐵, BI std::set_intersection, which uses the C++ standard library and (iii)
conducts the set intersection on their bitmap representations, boost::dynamic_bitset which uses the Boost library in a similar
with each bitmap requiring |𝐸|/8 bytes of memory regardless fashion as BI-dynamic. We measure the total intersection time
of the set sizes. More specifically, there are two flavors of BI, for the CPU techniques using the std::chrono library. For the GPU
namely (i) naive, where all bitmaps fit in global memory, and operations, we measure the total time, including transfers and
Dataset Cardinality Avg set size # Element Universe technique. Furthermore, sf-gssjoin is the second best performing
ENRON 1.0 · 104 794 1.1 · 106 technique, when manages to launch, i.e. when the static index
ORKUT 1.0 · 10 4 1001 8.7 · 106 fits in the global memory, which is not the case when the element
4 3.7 · 104
TWITTER 1.0 · 10 150 universe is 109 in our experiments. On the other hand, IP-OBS
Table 1: Real world dataset characteristics. is more robust and its performance is the closest to BI-dynamic
across both datasets. In addition, we observe a 80% performance
increase for the hybrid IP-OBS technique over the standalone
OBS. Last, even though IP and IP-HI are the worst performing
memory allocations, by using the CUDA event API. We do not GPU techniques, they manage to achieve average 5.7X speedup
measure any preprocess time. The experiments were conducted compared to the best performing CPU technique.
on a machine with an Intel i7 5820k clocked at 3.3 GHz, 32 GB As shown in Figure 8, for real world datasets with 𝑘 = 10000,
RAM at 2400 MHz and an NVIDIA Titan XP on CUDA 11.0. This sf-gssjoin is the most efficient technique across every dataset and
GPU has 30 Streaming Multiprocessors, with a total of 3840 cores, achieves average 147X speedup over the best performing CPU
12 GB of global memory and a 384-bit memory bus width. technique. Moreover, the small average set size leads in small
We experiment with artificial datasets, where set elements fol- static indices, which yields an optimal workload distribution
low the normal, uniform and zipf like distribution. To distinguish among GPU threads and overall, results in better GPU utilization.
each dataset, we use the notation Distribution-Element universe- In contrast, BI-dynamic, OBS and HI conduct set intersection
Average set size. We vary the element universe from 108 up to at block level. Thus, there is an inherent workload imbalance
109 . Respectively, we vary the average set size from 106 up to 107 . among GPU thread blocks. We note that IP, and consequently
For the multi-instance scenario, we also experiment with three its two workload allocation rationales, IP-OBS and IP-HI cannot
real world datasets previously employed in [3]. More specifically, always execute due to memory constraints. More specifically,
for each sorted dataset we extract the last 𝑘 = 10000 sets, i.e. the the required memory to store the diagonals for the grid level
largest sets. Table 1 gives an overview of the real-world datasets’
partitioning of IP is 2 × 𝑘2 × (𝑏𝑙𝑜𝑐𝑘𝑠 + 1). As a result, there
characteristics.
is an upperbound to the number of blocks to comply with the
global memory restriction. However, due to the small set sizes,
4.1 SISI experiments
we consider IP partitioning as an excessive approach to conduct
We examine the single instance scenario using twelve artificial set intersection on these datasets.
datasets (combinations of 3 distributions, 2 element universe sizes Based on our experimental evaluation, we conclude that for
and 2 set sizes). We have also experimented with several block large MISI problems, BI-dynamic and IP-OBS are preferable. On
sizes, but due to lack of space, we present the results of the best the other hand, when dealing with multiple instance small set
configuration for each technique. intersections, the grid level partitioning of IP adds overhead and
As shown in Figure 6, IP and IP-OBS are the clear winners for results in severe GPU underutilization. In such cases, standalone
the SISI epxeriments, in the sense that one of them is the best OBS and HI perform better but are surpassed by sf-gssjoin, which
overall performing technique across every dataset. Moreover, the achieves optimal GPU utilization.
differences between their performance is relatively small: up to
12% for the normal and zipf distribution, and up to 35% for the
5 CONCLUSION
uniform distribution. Overall, these techniques achieve average
2.3X speedup compared to the best performing CPU technique; In this work, we adapt and evaluate GPU set intersection tech-
the best performing CPU technique differs between the cases, niques that were previously applied to graph analytics. Although
but in average, SIMD is the most robust. In general, the speedup these techniques are better suited for intersecting sets of smaller
is similar for all distributions types and increases with higher size, we experimentally show that, through certain enhance-
universe and average set size, reaching 2.92X. The speedup of ments, they can be easily adapted for large set intersection. We
IP or IP-OBS over the worst performing CPU or GPU technique explain which are the best approaches in the single- and multi-
exceeds an order of magnitude; we have observed speedups up instance cases, and we introduce a novel hybrid, namely, com-
to 48.67X over a CPU technique and up to 10.97X over a GPU bining 𝐼𝑃 with 𝑂𝐵𝑆, which proves to be a dominant solution
technique. for the former cases, and competitive in the latter ones. Also,
In addition, IP-HI exhibits the worst performance regarding employing bitmap-based solutions pays off in the multi-instance
GPU techniques and seems unable to perform better than CPU for case. Finally, if the set sizes are relatively smaller, multi-instance
many settings; this shows that simply relying on the IP workload set intersection stands to benefit from adapting a set similarity
allocation rationale is not adequate. Finally, BI -based techniques join technique.
behave better for smaller universe sizes. Acknowledgments. The research work was supported by the
Hellenic Foundation for Research and Innovation (HFRI) under
4.2 MISI experiments the HFRI PhD Fellowship grant (Fellowship Number: 1154). In ad-
dition, the authors gratefully acknowledge the support of NVIDIA
For the multi-instance evaluation, we conduct two sets of ex-
through the donation of the GPU (GPU Grant Program).
periments. In the first one, we use artificially generated datasets
consisting of large sets that follow the zipf distribution, which
is the closest to real world. In the second one, we use real world
REFERENCES
[1] Naiyong Ao, Fan Zhang, Di Wu, Douglas S Stones, Gang Wang, Xiaoguang
datasets and evaluate the intersection techniques on smaller set Liu, Jing Liu, and Sheng Lin. 2011. Efficient parallel lists intersection and
sizes. index compression algorithms using graphics processing units. Proceedings of
As shown in Figure 7, in both artificial datasets with 𝑘 = the VLDB Endowment 4, 8 (2011), 470–481.
[2] Jérémy Barbay, Alejandro López-Ortiz, and Tyler Lu. 2006. Faster adaptive set
1000, BI-dynamic is the best performing technique and achieves intersections for text searching. In International Workshop on Experimental
average 35X speedup compared to the best performing CPU and Efficient Algorithms. Springer, 146–157.
100
BI-naive
BI-dynamic
IP
IP-OBS
IP-HI
80 SIMD
std::set_intersection
boost::dynamic_bitset

60
Time (ms)

0
7 6 7 6 7 6 7 6 7 6 7 6
9 -10 9 -10 8 -10 8 -10 9 -10 0 8 -10 8 -10 9 10 9 10 8 10 8 10
0 -1
9
-10 -10 -10 -10 uni-1
0
uni-1 uni-1
0
uni-1
0 10 - 10 - 10 - 10 -
norm norm norm norm zipf- zipf- zipf- zipf-
Dataset

Figure 6: Single-instance experiments on artificial datasets.

4000

3500
sf-gssjoin
BI-dynamic
OBS
IP-OBS
SIMD
std::set_intersection
[6] Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics pro-
3000 IP IP-HI boost::dynamic_bitset cessors for high performance IR query processing. In Proceedings of the 18th
2500 international conference on World wide web. 421–430.
2000 [7] James Fox, Oded Green, Kasimir Gabert, Xiaojing An, and David A Bader. 2018.
1500
Fast and adaptive list intersections on the gpu. In 2018 IEEE High Performance
1000
extreme Computing Conference (HPEC). IEEE, 1–7.
500
[8] Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast
Time (secs)

triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular
70
Applications: Architectures and Algorithms. 1–8.
60
[9] Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and
50
Techniques, 3rd edition. Morgan Kaufmann.
40

30
[10] Yang Hu, Hang Liu, and H Howie Huang. 2018. Tricore: Parallel triangle
20
counting on gpus. In SC18: International Conference for High Performance
10
Computing, Networking, Storage and Analysis. IEEE, 171–182.
0 X
[11] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. 2014.
zipf-109-106 zipf-108-106 Querying k-truss community in large and dynamic graphs. In Proceedings of
Dataset
the 2014 ACM SIGMOD Int. Conf. on Management of data. 1311–1322.
[12] Daniel Lemire, Leonid Boytsov, and Nathan Kurz. 2016. SIMD compression
and the intersection of sorted integers. Software: Practice and Experience 46, 6
Figure 7: Multi-Instance experiments on artificial datasets (2016), 723–749.
with 𝑘 = 1000. [13] Santosh Pandey, Xiaoye Sherry Li, Aydin Buluc, Jiejun Xu, and Hang Liu. 2019.
H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs. In 2019
IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
500
sf-gssjoin [14] Adam Polak. 2016. Counting triangles in large graphs on GPU. In 2016 IEEE In-
BI-dynamic
400
OBS
ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW).
300
HI IEEE, 740–746.
SIMD
std::set_intersection [15] Vijayshankar Raman, Lin Qiao, Wei Han, Inderpal Narang, Ying-Lin Chen,
200 boost::dynamic_bitset
Kou-Horng Yang, and Fen-Ling Ling. 2007. Lazy, adaptive rid-list intersection,
100 and its application to index anding. In Proceedings of the 2007 ACM SIGMOD
Time (secs)

international conference on Management of data. 773–784.
10 [16] Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and
8 Wellington Santos Martins. 2017. Fast parallel set similarity joins on many-
core architectures. Journal of Information and Data Management 8, 3 (2017),
6
255–255.
4
[17] Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all
2 triangles in large graphs, an experimental study. In International workshop on
0 experimental and efficient algorithms. Springer, 606–609.
ENRON ORKUT TWITTER
[18] Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A com-
Dataset
parative study on exact triangle counting algorithms on the gpu. In Proceedings
of the ACM Workshop on High Performance Graph Processing. 1–8.
Figure 8: Multi-Instance experiments on real world [19] Nan Wang, Jingbo Zhang, Kian-Lee Tan, and Anthony KH Tung. 2010. On
triangulation-based dense neighborhood graph discovery. Proceedings of the
datasets with 𝑘 = 10000. VLDB Endowment 4, 2 (2010), 58–68.
[20] Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, and Gang Wang.
2009. A batched gpu algorithm for set intersection. In 2009 10th International
[3] Christos Bellas and Anastasios Gounaris. 2020. An empirical evaluation of Symposium on Pervasive Systems, Algorithms, and Networks. IEEE, 752–756.
exact set similarity join techniques using GPUs. Inf. Syst. 89 (2020), 101485. [21] Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, and Jing Liu.
[4] Mauro Bisson and Massimiliano Fatica. 2017. High performance exact triangle 2010. Efficient lists intersection by cpu-gpu cooperative computing. In 2010
counting on gpus. IEEE Transactions on Parallel and Distributed Systems 28, 12 IEEE International Symposium on Parallel & Distributed Processing, Workshops
(2017), 3501–3510. and Phd Forum (IPDPSW). IEEE, 1–8.
[5] Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. 2001. Experiments
on adaptive set intersections for text retrieval systems. In Workshop on Algo-
rithm Engineering and Experimentation. Springer, 91–104.