=Paper= {{Paper |id=Vol-2840/short2 |storemode=property |title=An Evaluation of Large Set Intersection Techniques on GPUs |pdfUrl=https://ceur-ws.org/Vol-2840/short2.pdf |volume=Vol-2840 |authors=Christos Bellas,Anastasios Gounaris |dblpUrl=https://dblp.org/rec/conf/dolap/BellasG21 }} ==An Evaluation of Large Set Intersection Techniques on GPUs== https://ceur-ws.org/Vol-2840/short2.pdf
    An Evaluation of Large Set Intersection Techniques on GPUs
                                                        Christos Bellas, Anastasios Gounaris
                                                              Aristotle University of Thessaloniki
                                                                      Thessaloniki, Greece
                                                               (chribell,gounaria)@csd.auth.gr

ABSTRACT                                                                          the results. Last, in Section 5 we conclude and discuss possible
In this paper, we focus on large set intersection, which is a pivotal             future work.
operation in information retrieval, graph analytics and database
systems. We aim to experimentally detect under which condi-                       2     PRELIMINARIES
tions, using a graphics processing unit (GPU) is beneficial over                  In this section, we first give a formal notation for the set inter-
CPU techniques and which exact techniques are capable of yield-                   section problem. Next, we give an overview of the related work
ing improvements. We cover and adapt techniques initially pro-                    about set intersection after explaining the exact scope of this
posed for graph analytics, while we investigate new hybrids for                   paper.
completeness. Finally, we present a comprehensive evaluation
highlighting the main characteristics of the techniques examined                  2.1    Set Intersection Problem Description
when both a single pair of two large sets are processed and all                   Single-instance set intersection (SISI) problem: given i) a finite
pairs in a dataset are examined.                                                  universe of elements 𝐸, and ii) a set 𝐴 = {𝑒 1𝐴 , . . . , 𝑒𝑛𝐴 } of size
                                                                                                                                         {𝐴,𝐡 }
                                                                                  𝑛 and a set 𝐡 = {𝑒 1𝐡 , . . . , π‘’π‘š
                                                                                                                   𝐡 } of size π‘š, where 𝑒
                                                                                                                                          𝑖     ∈ 𝐸, set
1     INTRODUCTION                                                                intersection 𝐴∩𝐡 produces a new set 𝑆 containing all the common
Set intersection is an essential operation in numerous application                elements among 𝐴 and 𝐡.
domains, such as information retrieval for text search engines us-                   Multi-instance set intersection (MISI) problem: in the multi-
ing an inverted index [2, 5], graph analytics for triangle counting               instance set intersection problem, we are given a collection of
and community detection [11, 17, 19], and database systems for                    π‘˜ sets 𝐢 = {𝑆 1, . . . , π‘†π‘˜ }, and we aim to find the set intersection
                                                                                                
merging RID-lists [15] and performing fast (potentially bitwise)                  among all π‘˜2 of pairs (𝑆𝑖 ,𝑆 𝑗 ), where 0 < 𝑖 < 𝑗 ≀ π‘˜.
operations on data in columnar format [9].                                           Obviously, the solutions of SISI are of Ξ©(𝑛 + π‘š) complexity,
    Modern GPUs offer a high-level parallel environment at low                    whereas MISI solutions are of quadratic complexity in π‘˜ without
cost. As a result, over the past years, there has been a considerable             allowing for any pair pruning, as, for example, in problems such
research work on improving graph analytics on a GPU, mostly                       as set similarity joins [3]. Therefore, the focus is not on evaluating
in the context of graph triangle counting, where set intersection                 the behavior of algorithms differing in asymptotic complexity,
dominates the running time [4, 7, 8, 10, 13, 18]. The majority                    since all techniques are of similar complexity; rather, we aim to
of these studies focus on improving the level of parallelism by                   assess the impact of different potentially low-level engineering
reducing redundant comparisons and distributing the workload                      techniques when 𝑛 and π‘š are at the order of millions and the
evenly among GPU threads. Set intersection on GPUs has also                       size of 𝐸 is orders of magnitude larger.
been examined in the context of set similarity joins [3].
    In this work, we compare and evaluate state-of-the-art GPU                    2.2    Scope of our work
techniques for set intersection, when the sets are large, i.e., they              Over the past years, there has been a lot of research that encap-
contain millions of elements. More specifically, we gather to-                    sulates GPU techniques for the set intersection problem, as will
gether techniques belonging to five different rationales, namely                  be discussed shortly. In the majority of cases, the problem of
intersect path, optimized binary search, hash, bitmap and set sim-                set intersection is tackled in the context of triangle counting to
ilarity join-based solutions, while we include novel promising                    accelerate graph analytics. Triangle counting is a special case of
hybrid solutions that combine existing techniques to better fit                   set intersection, which stems from the need to find quickly inter-
into our setting.                                                                 section counts among vertex adjacency lists of small lengths. As
    We perform experiments both when the intersection of one                      a result, the techniques proposed are tailored to specific algorith-
pair of sets (single-instance problem) and the intersections among                mic optimizations. In the context of this work, we extract, adapt
all set pairs in a database are computed (multi-instance problem).                and evaluate these techniques under a more demanding large
We derive useful insights, and more specifically we provide ex-                   set intersection scenario performing also a comparison against
perimental evidence about the superiority of two intersection                     all methodologies proposed that can address the SISI and MIMI
path flavors for the single-instance problem and of either bitmap-                problems for large sets.
based or similarity join-based solutions for the multi-instance
problem depending on the size of the sets.1                                       2.3    Overview of existing solutions
    The rest of the paper is organized as follows. Section 2 gives                We present the relevant techniques mostly in chronological or-
a background on the problem and an overview of the related                        der. Ding et al. [6] were the first to implement a parallel GPU
work. Section 3 describes the evaluated GPU techniques for set                    set intersection algorithm. In their proposed solution, they use
intersection. In Section 4, we present the experiments and discuss                the so-called Parallel Merge Find (PMF) algorithm to compute a
1 The source code is available at https://github.com/chribell/set_intersection
                                                                                  set intersection. In essence, given two sets, the shorter one is
                                                                                  partitioned into disjoint segments, with each segment assigned
Β© Copyright 2021 for this paper by its authors. Use permitted under Creative      to a different GPU thread. Then, the last element of each segment
Commons License Attribution 4.0 International (CC BY 4.0).                        is searched in the longer set to find the corresponding or closest
positions for the partitioning of the longer set. As a result, each                       A5   8   9   11 12 13 16 18 19 22
GPU thread becomes capable of computing its own intersection
                                                                                   B
                                                                                     5
among segments in parallel. In [20], Wu et al. highlight the inef-
ficiency of the approach followed in [6] for sets of smaller size.                   7
Subsequently, they propose a GPU technique for set intersection                      8 5
that takes advantage of the fast on-chip shared memory. First, an                    9
empty array of size equal to the size of the shorter set is allocated
                                                                                     10
in shared memory. Next, each GPU thread is assigned with a
specific number of elements from the shorter set and conducts                        12
binary searches in the longer set. If an element is found, its cor-                  16
responding cell in the shared memory array is set to 1. Finally, a                   20 5
scan operation on the shared memory array is performed along
                                                                                     22
with a stream compaction procedure, so that threads produce the                                        5
final intersection. In [21], Wu et al. extend their original work                                          Count Augmentation
described above by introducing heuristic strategies to balance
the workload across thread blocks more efficiently. Additionally,         Figure 1: Intersect Path example using 4 thread blocks.
the proposal in [1] further extends the original work in [21] by
introducing a linear regression approach to reduce the search
range of a binary search and a hash segmentation approach as
an alternative to binary search.                                            Last, in the context of set similarity join, the authors of [16]
   In [8], Green et al. present the Intersect Path (IP) algorithm for    propose a technique for conducting set intersections by using
set intersection, which is a variation of the well established GPU       a static inverted index and atomic operations. By setting the
Merge Path algorithm for merging lists. In addition, IP can be           similarity degree equal to zero, their solution can be used to solve
considered as an extension of PMF since it encapsulates a similar        the SISI and MISI problems.
set partition logic. First, input sets are partitioned into segments        As already mentioned, some of these works, such as [4, 8,
that can be intersected independently by multiple thread blocks.         10, 14, 18] are more complete proposals targeting the triangle
Second, for each thread block, workload is distributed in such a         counting problem; here we narrow down our attention only to
way that near equal number of elements to intersect are allocated        the part that is relevant to the SISI and MISI problems.
to threads. In [7], the authors extend the work of [8] by proposing
an adaptive load balancing technique and dynamically assign              3   TECHNIQUES
work to GPU threads based on work estimations. The authors               In the previous section, we have identified five different method-
of [14] and [18] follow a similar approach to set intersection.          ologies, based on (1) IP, (2) optimized binary search, (3) bitmap
More specifically, their main concept is, for each GPU thread, to        operations, (4) hashing and (5) set similarity joins, respectively.
sequentially compute a set intersection by using a two-pointers          Here, we present five different state-of-the-art GPU techniques
merge algorithm. In the context of triangle counting, such an            for set intersection, one for each methodology, in more detail. In
approach is applicable since the requirement is (i) for each GPU         addition, we discuss some modifications to these techniques to
thread to compute a two-set intersection count independently,            target large set intersection yielding two novel hybrid solutions.
and (ii) these partial counts to be accumulated to compute the           We do not consider, works such as [14, 18] that mostly rely on the
final global count. However, in the setting of set intersection over     preprocessing of input data and conduct intersection with a sim-
two large sets, their solution is inferior and similar to a sequential   ple merge fashion algorithm. In contrast, we evaluate techniques
CPU approach.                                                            that are more adaptable in a general set intersection scenario. We
   In [4], Bisson et al. propose a different approach, namely, GPU       give a concise presentation for each evaluated technique below.
set intersection to be based on bitmaps and atomic operations.
Given two sets, to compute their intersection, the GPU threads               Intersect Path (IP) [8]. Given two ordered sets 𝐴 and 𝐡, IP
create the bitmap representation of the first set in parallel and        considers the traversal of a grid, noted as Intersect Matrix, of size
then, iterate over the elements of the second set to search for          |𝐴| Γ— |𝐡|. Beginning from the top left corner of the grid, the path
the corresponding set bits. Based on the average set size, the           can move downwards if 𝐴[𝑖] > 𝐡 [ 𝑗], to the right if 𝐴[𝑖] < 𝐡 [ 𝑗],
workload allocation is per thread, per warp or per block.                and diagonally if 𝐴[𝑖] = 𝐡 [ 𝑗], until it eventually reaches the
   In [10], Hu et al. demonstrate that set intersection is faster        bottom right corner. There are two partitioning stages, one on
employing efficient binary search-based techniques than IP-based         kernel grid level and the other on block level. On grid level,
techniques, arguing that the latter suffer from nontrivial overhead      equidistant cross diagonals are placed on the Intersect Matrix.
of partitioning the input sets and non-coalesced memory accesses.        The number of diagonals is equal to the number of thread blocks
On the other hand, their proposed algorithm optimizes binary             plus one, in order to delimit the boundaries of each block. Using
search at a warp level to achieve coalesced memory access and to         binary search, the point of intersection between a cross diagonal
alleviate the need for workload balancing. In addition, by caching       and the path is found. As a result, each thread block is assigned
the first levels of the binary search tree they employ in the shared     to intersect disjoint subsets of the input. In case a cross diagonal
memory, they can achieve additional speedup gains.                       intersects with the path inside a matrix cell, a count augmentation
   In [13], Pandey et al. propose a hash-based technique for set         is required beforehand, since this intersection is not assigned to
intersection. In brief, first the algorithm hashes the shorter set       any thread block. An example of this scenario is shown in Figure 1.
into buckets, and then iterates over the larger set and hashes           On block level, the same cross diagonal approach applies in order
each element to the corresponding bucket. Afterwards, a linear           to distribute workload among threads. Each thread may conduct
search is conducted within each bucket to find the intersections.        a serial or binary search based intersection.
                                                                                                 S20 = {4, 8, 9, 13}
                                     12
                                                                                 4        8      9                     13
                          8                    18
            A                                                                  S2 S3 S6 S12 S2 S4 S12S14 S2 S3 S6 S8 S9 S12S14S15 V

                  5             9         13        19                               T0         T1            T2            T3

                                    11         16        22
                                                                                                                             intersection
                                                                            0 3 2 1 0 2 0 1 1 0 0 3 0 2 1 0 0 0 0 S20
                      B       5 7 8 9 1012162022
                                                                            S1 S2 S3 S4 S5 S6 S7 S8 S9 S10S11S12S13S14S15S16S17S18S19

        Figure 2: Optimized Binary Search example.                      Figure 5: Static-index Intersection example using 4 GPU
                                                                        threads (Adapted from [3]).
                A 5 8 9 11 12 13 16 18 19 22

          bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0                (ii) dynamic, where bitmaps are built on the fly in the case that
          & 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0
                                                                        we cannot store all of them in the available global memory. The
          bB 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0                latter is similar to the work of [4]. Bitmaps are constructed in
                                                                        parallel using the atomicOr function. In the naive scenario, each
                  B 5 7 8 9 10 12 16 20 22
                                                                        GPU thread fetches two bitmap words, one from each set, and
                                                                        conducts a popcount operation on the resulting logical AND
   Figure 3: Naive Bitmap-based Intersection example.                   word to compute the subset intersection. An example of the
                                                                        naive scenario is shown in Figure 3. In the dynamic scenario,
                A 5 8 9 11 12 13 16 18 19 22                            each GPU thread block first constructs the bitmap representation
                                                                        of the current set, and then, for every next set, the threads iterate
          bA 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0                over its elements and check whether the respective bits are set
                                                                        (see the example in Figure 4).
                  B 5 7 8 9 10 12 16 20 22
                                                                           Static-index Intersection (sf-gssjoin) [16]. Given π‘˜ sets, sf-gssjoin
                                                                        first constructs a static inverted index over all set elements. For
 Figure 4: Dynamic Bitmap-based Intersection example.
                                                                        each set, the inverted lists for all its elements are concatenated
                                                                        in a logical vector. Next, this vector is partitioned evenly with
   Optimized Binary Search (OBS) [10]. Given two ordered sets           each partition assigned to a specific GPU thread. Thus, each
𝐴 and 𝐡, OBS caches the first levels of the binary search tree          thread processes independently its corresponding partition and
of the larger set in shared memory to reduce expensive global           contributes to the intersections of the current set with the next
memory reads. For example, as shown in Figure 2, the higher             ones. To ensure correctness, threads increment the intersection
level nodes of the binary tree reside in shared memory, whereas         counts by using atomic operations. An example of sf-gssjoin is
the leafs are located in global memory. The smaller set, which          shown in Figure 5.
is 𝐡 in the example, is used for lookup and as a result there are          All of the presented techniques with the exception of BI-naive
|𝐡| total binary search lookups. Each thread is assigned a lookup       and IP, conduct a single instance set intersection on a single block
and, in each iteration, up to 32 (i.e., the size of warp) lookups       or warp. This results in severe GPU underutilization, especially
are executed simultaneously. For the multi-instance problem, a          for sets of size in the order of millions, since a single execution
simple optimization is to cache each set to shared memory one           unit is assigned with the complete workload. To tackle this issue,
at a time, and then iterate every subsequent one to perform the         we investigate the integration of the kernel grid level partitioning
intersection for every pair. This requires sets to be sorted by their   of IP with OBS and HI, which have not been proposed previously
size.                                                                   in the literature. We denote these novel hybrid techniques as IP-
                                                                        OBS and IP-HI respectively. For the single instance scenario, in
   Hash-based Intersection (HI) [13]. Given two sets 𝐴 and 𝐡, HI        BI-dynamic, we modify the algorithm by constructing the bitmap
first hashes the shorter set and constructs buckets in parallel,        of the first set across multiple blocks and then we partition evenly
and then, iterates and hashes every element of the larger set into      the second set into disjoint subsets with each one assigned to a
the corresponding bucket as already described in Section 2. The         specific block.
initial hashing of the smaller set is preferred in order to reduce
the number of collisions. Buckets are statically allocated once         4    EVALUATION
in linear global memory space. The size of each bucket, i.e. the
                                                                        In this section, we evaluate the techniques in the previous section,
number of entries from the shorter set in the bucket, is stored in
                                                                        while, in our experiments, we also include CPU variants for
shared memory. Thus, to ensure correctness for the MISI problem,
                                                                        completeness. More specifically, we compare the GPU techniques
we only need to clear the buckets sizes in shared memory, and
                                                                        against three CPU alternatives, namely, (i) SIMD, which performs
let every next short set hashing overwrite the previous one.
                                                                        intersection on sorted integers using SIMD instructions [12], (ii)
   Bitmap-based Intersection (BI). Given two sets 𝐴 and 𝐡, BI           std::set_intersection, which uses the C++ standard library and (iii)
conducts the set intersection on their bitmap representations,          boost::dynamic_bitset which uses the Boost library in a similar
with each bitmap requiring |𝐸|/8 bytes of memory regardless             fashion as BI-dynamic. We measure the total intersection time
of the set sizes. More specifically, there are two flavors of BI,       for the CPU techniques using the std::chrono library. For the GPU
namely (i) naive, where all bitmaps fit in global memory, and           operations, we measure the total time, including transfers and
 Dataset     Cardinality Avg set size # Element Universe                technique. Furthermore, sf-gssjoin is the second best performing
 ENRON         1.0 Β· 104      794             1.1 Β· 106                 technique, when manages to launch, i.e. when the static index
 ORKUT         1.0 Β· 10 4     1001            8.7 Β· 106                 fits in the global memory, which is not the case when the element
                        4                     3.7 Β· 104
 TWITTER       1.0 Β· 10       150                                       universe is 109 in our experiments. On the other hand, IP-OBS
       Table 1: Real world dataset characteristics.                     is more robust and its performance is the closest to BI-dynamic
                                                                        across both datasets. In addition, we observe a 80% performance
                                                                        increase for the hybrid IP-OBS technique over the standalone
                                                                        OBS. Last, even though IP and IP-HI are the worst performing
memory allocations, by using the CUDA event API. We do not              GPU techniques, they manage to achieve average 5.7X speedup
measure any preprocess time. The experiments were conducted             compared to the best performing CPU technique.
on a machine with an Intel i7 5820k clocked at 3.3 GHz, 32 GB               As shown in Figure 8, for real world datasets with π‘˜ = 10000,
RAM at 2400 MHz and an NVIDIA Titan XP on CUDA 11.0. This               sf-gssjoin is the most efficient technique across every dataset and
GPU has 30 Streaming Multiprocessors, with a total of 3840 cores,       achieves average 147X speedup over the best performing CPU
12 GB of global memory and a 384-bit memory bus width.                  technique. Moreover, the small average set size leads in small
   We experiment with artificial datasets, where set elements fol-      static indices, which yields an optimal workload distribution
low the normal, uniform and zipf like distribution. To distinguish      among GPU threads and overall, results in better GPU utilization.
each dataset, we use the notation Distribution-Element universe-        In contrast, BI-dynamic, OBS and HI conduct set intersection
Average set size. We vary the element universe from 108 up to           at block level. Thus, there is an inherent workload imbalance
109 . Respectively, we vary the average set size from 106 up to 107 .   among GPU thread blocks. We note that IP, and consequently
For the multi-instance scenario, we also experiment with three          its two workload allocation rationales, IP-OBS and IP-HI cannot
real world datasets previously employed in [3]. More specifically,      always execute due to memory constraints. More specifically,
for each sorted dataset we extract the last π‘˜ = 10000 sets, i.e. the    the required memory to store the diagonals for the grid level
largest sets. Table 1 gives an overview of the real-world datasets’                                    
                                                                        partitioning of IP is 2 Γ— π‘˜2 Γ— (π‘π‘™π‘œπ‘π‘˜π‘  + 1). As a result, there
characteristics.
                                                                        is an upperbound to the number of blocks to comply with the
                                                                        global memory restriction. However, due to the small set sizes,
4.1    SISI experiments
                                                                        we consider IP partitioning as an excessive approach to conduct
We examine the single instance scenario using twelve artificial         set intersection on these datasets.
datasets (combinations of 3 distributions, 2 element universe sizes         Based on our experimental evaluation, we conclude that for
and 2 set sizes). We have also experimented with several block          large MISI problems, BI-dynamic and IP-OBS are preferable. On
sizes, but due to lack of space, we present the results of the best     the other hand, when dealing with multiple instance small set
configuration for each technique.                                       intersections, the grid level partitioning of IP adds overhead and
   As shown in Figure 6, IP and IP-OBS are the clear winners for        results in severe GPU underutilization. In such cases, standalone
the SISI epxeriments, in the sense that one of them is the best         OBS and HI perform better but are surpassed by sf-gssjoin, which
overall performing technique across every dataset. Moreover, the        achieves optimal GPU utilization.
differences between their performance is relatively small: up to
12% for the normal and zipf distribution, and up to 35% for the
                                                                        5    CONCLUSION
uniform distribution. Overall, these techniques achieve average
2.3X speedup compared to the best performing CPU technique;             In this work, we adapt and evaluate GPU set intersection tech-
the best performing CPU technique differs between the cases,            niques that were previously applied to graph analytics. Although
but in average, SIMD is the most robust. In general, the speedup        these techniques are better suited for intersecting sets of smaller
is similar for all distributions types and increases with higher        size, we experimentally show that, through certain enhance-
universe and average set size, reaching 2.92X. The speedup of           ments, they can be easily adapted for large set intersection. We
IP or IP-OBS over the worst performing CPU or GPU technique             explain which are the best approaches in the single- and multi-
exceeds an order of magnitude; we have observed speedups up             instance cases, and we introduce a novel hybrid, namely, com-
to 48.67X over a CPU technique and up to 10.97X over a GPU              bining 𝐼𝑃 with 𝑂𝐡𝑆, which proves to be a dominant solution
technique.                                                              for the former cases, and competitive in the latter ones. Also,
   In addition, IP-HI exhibits the worst performance regarding          employing bitmap-based solutions pays off in the multi-instance
GPU techniques and seems unable to perform better than CPU for          case. Finally, if the set sizes are relatively smaller, multi-instance
many settings; this shows that simply relying on the IP workload        set intersection stands to benefit from adapting a set similarity
allocation rationale is not adequate. Finally, BI -based techniques     join technique.
behave better for smaller universe sizes.                                  Acknowledgments. The research work was supported by the
                                                                        Hellenic Foundation for Research and Innovation (HFRI) under
4.2    MISI experiments                                                 the HFRI PhD Fellowship grant (Fellowship Number: 1154). In ad-
                                                                        dition, the authors gratefully acknowledge the support of NVIDIA
For the multi-instance evaluation, we conduct two sets of ex-
                                                                        through the donation of the GPU (GPU Grant Program).
periments. In the first one, we use artificially generated datasets
consisting of large sets that follow the zipf distribution, which
is the closest to real world. In the second one, we use real world
                                                                        REFERENCES
                                                                         [1] Naiyong Ao, Fan Zhang, Di Wu, Douglas S Stones, Gang Wang, Xiaoguang
datasets and evaluate the intersection techniques on smaller set             Liu, Jing Liu, and Sheng Lin. 2011. Efficient parallel lists intersection and
sizes.                                                                       index compression algorithms using graphics processing units. Proceedings of
    As shown in Figure 7, in both artificial datasets with π‘˜ =               the VLDB Endowment 4, 8 (2011), 470–481.
                                                                         [2] JΓ©rΓ©my Barbay, Alejandro LΓ³pez-Ortiz, and Tyler Lu. 2006. Faster adaptive set
1000, BI-dynamic is the best performing technique and achieves               intersections for text searching. In International Workshop on Experimental
average 35X speedup compared to the best performing CPU                      and Efficient Algorithms. Springer, 146–157.
                 100
                                                                                                                                                                                                          BI-naive
                                                                                                                                                                                                          BI-dynamic
                                                                                                                                                                                                          IP
                                                                                                                                                                                                          IP-OBS
                                                                                                                                                                                                          IP-HI
                       80                                                                                                                                                                                 SIMD
                                                                                                                                                                                                          std::set_intersection
                                                                                                                                                                                                          boost::dynamic_bitset


                       60
  Time (ms)




                       40




                       20




                         0
                                               7               6                     7              6                7                  6               7              6              7              6            7               6
                                          9 -10           9 -10                 8 -10          8 -10            9 -10                  0           8 -10          8 -10           9 10           9 10         8 10         8 10
                                                                                                                               0 -1
                                                                                                                                  9
                                     -10            -10           -10                   -10              uni-1
                                                                                                               0
                                                                                                                          uni-1             uni-1
                                                                                                                                                  0
                                                                                                                                                            uni-1
                                                                                                                                                                  0             10 -           10 -         10 -         10 -
                                 norm           norm          norm                  norm                                                                                   zipf-          zipf-       zipf-        zipf-
                                                                                                                                      Dataset


                                                                          Figure 6: Single-instance experiments on artificial datasets.

                        4000

                        3500
                                                                   sf-gssjoin
                                                                   BI-dynamic
                                                                                     OBS
                                                                                     IP-OBS
                                                                                                  SIMD
                                                                                                  std::set_intersection
                                                                                                                                             [6] Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics pro-
                        3000                                       IP                IP-HI        boost::dynamic_bitset                          cessors for high performance IR query processing. In Proceedings of the 18th
                        2500                                                                                                                     international conference on World wide web. 421–430.
                        2000                                                                                                                 [7] James Fox, Oded Green, Kasimir Gabert, Xiaojing An, and David A Bader. 2018.
                        1500
                                                                                                                                                 Fast and adaptive list intersections on the gpu. In 2018 IEEE High Performance
                        1000
                                                                                                                                                 extreme Computing Conference (HPEC). IEEE, 1–7.
                         500
                                                                                                                                             [8] Oded Green, Pavan Yalamanchili, and LluΓ­s-Miquel MunguΓ­a. 2014. Fast
        Time (secs)




                                                                                                                                                 triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular
                          70
                                                                                                                                                 Applications: Architectures and Algorithms. 1–8.
                          60
                                                                                                                                             [9] Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and
                          50
                                                                                                                                                 Techniques, 3rd edition. Morgan Kaufmann.
                          40

                          30
                                                                                                                                            [10] Yang Hu, Hang Liu, and H Howie Huang. 2018. Tricore: Parallel triangle
                          20
                                                                                                                                                 counting on gpus. In SC18: International Conference for High Performance
                          10
                                                                                                                                                 Computing, Networking, Storage and Analysis. IEEE, 171–182.
                             0      X
                                                                                                                                            [11] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. 2014.
                                           zipf-109-106                                   zipf-108-106                                           Querying k-truss community in large and dynamic graphs. In Proceedings of
                                                                      Dataset
                                                                                                                                                 the 2014 ACM SIGMOD Int. Conf. on Management of data. 1311–1322.
                                                                                                                                            [12] Daniel Lemire, Leonid Boytsov, and Nathan Kurz. 2016. SIMD compression
                                                                                                                                                 and the intersection of sorted integers. Software: Practice and Experience 46, 6
Figure 7: Multi-Instance experiments on artificial datasets                                                                                      (2016), 723–749.
with π‘˜ = 1000.                                                                                                                              [13] Santosh Pandey, Xiaoye Sherry Li, Aydin Buluc, Jiejun Xu, and Hang Liu. 2019.
                                                                                                                                                 H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs. In 2019
                                                                                                                                                 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
                       500
                                                                                               sf-gssjoin                                   [14] Adam Polak. 2016. Counting triangles in large graphs on GPU. In 2016 IEEE In-
                                                                                               BI-dynamic
                       400
                                                                                               OBS
                                                                                                                                                 ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW).
                       300
                                                                                               HI                                                IEEE, 740–746.
                                                                                               SIMD
                                                                                               std::set_intersection                        [15] Vijayshankar Raman, Lin Qiao, Wei Han, Inderpal Narang, Ying-Lin Chen,
                       200                                                                     boost::dynamic_bitset
                                                                                                                                                 Kou-Horng Yang, and Fen-Ling Ling. 2007. Lazy, adaptive rid-list intersection,
                       100                                                                                                                       and its application to index anding. In Proceedings of the 2007 ACM SIGMOD
         Time (secs)




                                                                                                                                                 international conference on Management of data. 773–784.
                        10                                                                                                                  [16] Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and
                         8                                                                                                                       Wellington Santos Martins. 2017. Fast parallel set similarity joins on many-
                                                                                                                                                 core architectures. Journal of Information and Data Management 8, 3 (2017),
                         6
                                                                                                                                                 255–255.
                         4
                                                                                                                                            [17] Thomas Schank and Dorothea Wagner. 2005. Finding, counting and listing all
                         2                                                                                                                       triangles in large graphs, an experimental study. In International workshop on
                         0                                                                                                                       experimental and efficient algorithms. Springer, 606–609.
                                        ENRON                        ORKUT                           TWITTER
                                                                                                                                            [18] Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A com-
                                                                     Dataset
                                                                                                                                                 parative study on exact triangle counting algorithms on the gpu. In Proceedings
                                                                                                                                                 of the ACM Workshop on High Performance Graph Processing. 1–8.
Figure 8: Multi-Instance experiments on real world                                                                                          [19] Nan Wang, Jingbo Zhang, Kian-Lee Tan, and Anthony KH Tung. 2010. On
                                                                                                                                                 triangulation-based dense neighborhood graph discovery. Proceedings of the
datasets with π‘˜ = 10000.                                                                                                                         VLDB Endowment 4, 2 (2010), 58–68.
                                                                                                                                            [20] Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, and Gang Wang.
                                                                                                                                                 2009. A batched gpu algorithm for set intersection. In 2009 10th International
[3] Christos Bellas and Anastasios Gounaris. 2020. An empirical evaluation of                                                                    Symposium on Pervasive Systems, Algorithms, and Networks. IEEE, 752–756.
    exact set similarity join techniques using GPUs. Inf. Syst. 89 (2020), 101485.                                                          [21] Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, and Jing Liu.
[4] Mauro Bisson and Massimiliano Fatica. 2017. High performance exact triangle                                                                  2010. Efficient lists intersection by cpu-gpu cooperative computing. In 2010
    counting on gpus. IEEE Transactions on Parallel and Distributed Systems 28, 12                                                               IEEE International Symposium on Parallel & Distributed Processing, Workshops
    (2017), 3501–3510.                                                                                                                           and Phd Forum (IPDPSW). IEEE, 1–8.
[5] Erik D Demaine, Alejandro LΓ³pez-Ortiz, and J Ian Munro. 2001. Experiments
    on adaptive set intersections for text retrieval systems. In Workshop on Algo-
    rithm Engineering and Experimentation. Springer, 91–104.