<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Evaluation of Large Set Intersection Techniques on GP Us</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christos Bellas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasios Gounaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aristotle University of Thessaloniki Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <fpage>16</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>In this paper, we focus on large set intersection, which is a pivotal operation in information retrieval, graph analytics and database systems. We aim to experimentally detect under which conditions, using a graphics processing unit (GPU) is beneficial over CPU techniques and which exact techniques are capable of yielding improvements. We cover and adapt techniques initially proposed for graph analytics, while we investigate new hybrids for completeness. Finally, we present a comprehensive evaluation highlighting the main characteristics of the techniques examined when both a single pair of two large sets are processed and all pairs in a dataset are examined.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Set intersection is an essential operation in numerous application
domains, such as information retrieval for text search engines
using an inverted index [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ], graph analytics for triangle counting
and community detection [
        <xref ref-type="bibr" rid="ref11 ref17 ref19">11, 17, 19</xref>
        ], and database systems for
merging RID-lists [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and performing fast (potentially bitwise)
operations on data in columnar format [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Modern GPUs ofer a high-level parallel environment at low
cost. As a result, over the past years, there has been a considerable
research work on improving graph analytics on a GPU, mostly
in the context of graph triangle counting, where set intersection
dominates the running time [
        <xref ref-type="bibr" rid="ref10 ref13 ref18 ref4 ref7 ref8">4, 7, 8, 10, 13, 18</xref>
        ]. The majority
of these studies focus on improving the level of parallelism by
reducing redundant comparisons and distributing the workload
evenly among GPU threads. Set intersection on GPUs has also
been examined in the context of set similarity joins [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In this work, we compare and evaluate state-of-the-art GPU
techniques for set intersection, when the sets are large, i.e., they
contain millions of elements. More specifically, we gather
together techniques belonging to five diferent rationales, namely
intersect path, optimized binary search, hash, bitmap and set
similarity join-based solutions, while we include novel promising
hybrid solutions that combine existing techniques to better fit
into our setting.</p>
      <p>We perform experiments both when the intersection of one
pair of sets (single-instance problem) and the intersections among
all set pairs in a database are computed (multi-instance problem).
We derive useful insights, and more specifically we provide
experimental evidence about the superiority of two intersection
path flavors for the single-instance problem and of either
bitmapbased or similarity join-based solutions for the multi-instance
problem depending on the size of the sets.1</p>
      <p>The rest of the paper is organized as follows. Section 2 gives
a background on the problem and an overview of the related
work. Section 3 describes the evaluated GPU techniques for set
intersection. In Section 4, we present the experiments and discuss
1The source code is available at https://github.com/chribell/set_intersection
the results. Last, in Section 5 we conclude and discuss possible
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>PRELIMINARIES</title>
      <p>In this section, we first give a formal notation for the set
intersection problem. Next, we give an overview of the related work
about set intersection after explaining the exact scope of this
paper.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Set Intersection Problem Description</title>
      <p>Single-instance set intersection (SISI) problem: given i) a finite
universe of elements , and ii) a set  = {1, . . . ,  } of size
 and a set  = {1, . . . ,  } of size , where {, } ∈ , set
intersection ∩ produces a new set  containing all the common
elements among  and .</p>
      <p>Multi-instance set intersection (MISI) problem: in the
multiinstance set intersection problem, we are given a collection of
 sets  = {1, . . . ,  }, and we aim to find the set intersection
among all 2 of pairs ( ,  ), where 0 &lt;  &lt;  ≤ .</p>
      <p>
        Obviously, the solutions of SISI are of Ω( + ) complexity,
whereas MISI solutions are of quadratic complexity in  without
allowing for any pair pruning, as, for example, in problems such
as set similarity joins [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, the focus is not on evaluating
the behavior of algorithms difering in asymptotic complexity,
since all techniques are of similar complexity; rather, we aim to
assess the impact of diferent potentially low-level engineering
techniques when  and  are at the order of millions and the
size of  is orders of magnitude larger.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Scope of our work</title>
      <p>Over the past years, there has been a lot of research that
encapsulates GPU techniques for the set intersection problem, as will
be discussed shortly. In the majority of cases, the problem of
set intersection is tackled in the context of triangle counting to
accelerate graph analytics. Triangle counting is a special case of
set intersection, which stems from the need to find quickly
intersection counts among vertex adjacency lists of small lengths. As
a result, the techniques proposed are tailored to specific
algorithmic optimizations. In the context of this work, we extract, adapt
and evaluate these techniques under a more demanding large
set intersection scenario performing also a comparison against
all methodologies proposed that can address the SISI and MIMI
problems for large sets.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Overview of existing solutions</title>
      <p>
        We present the relevant techniques mostly in chronological
order. Ding et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were the first to implement a parallel GPU
set intersection algorithm. In their proposed solution, they use
the so-called Parallel Merge Find (PMF) algorithm to compute a
set intersection. In essence, given two sets, the shorter one is
partitioned into disjoint segments, with each segment assigned
to a diferent GPU thread. Then, the last element of each segment
is searched in the longer set to find the corresponding or closest
positions for the partitioning of the longer set. As a result, each
GPU thread becomes capable of computing its own intersection
among segments in parallel. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], Wu et al. highlight the
inefifciency of the approach followed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for sets of smaller size.
Subsequently, they propose a GPU technique for set intersection
that takes advantage of the fast on-chip shared memory. First, an
empty array of size equal to the size of the shorter set is allocated
in shared memory. Next, each GPU thread is assigned with a
specific number of elements from the shorter set and conducts
binary searches in the longer set. If an element is found, its
corresponding cell in the shared memory array is set to 1. Finally, a
scan operation on the shared memory array is performed along
with a stream compaction procedure, so that threads produce the
ifnal intersection. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Wu et al. extend their original work
described above by introducing heuristic strategies to balance
the workload across thread blocks more eficiently. Additionally,
the proposal in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] further extends the original work in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] by
introducing a linear regression approach to reduce the search
range of a binary search and a hash segmentation approach as
an alternative to binary search.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Green et al. present the Intersect Path (IP) algorithm for
set intersection, which is a variation of the well established GPU
Merge Path algorithm for merging lists. In addition, IP can be
considered as an extension of PMF since it encapsulates a similar
set partition logic. First, input sets are partitioned into segments
that can be intersected independently by multiple thread blocks.
Second, for each thread block, workload is distributed in such a
way that near equal number of elements to intersect are allocated
to threads. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the authors extend the work of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by proposing
an adaptive load balancing technique and dynamically assign
work to GPU threads based on work estimations. The authors
of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] follow a similar approach to set intersection.
More specifically, their main concept is, for each GPU thread, to
sequentially compute a set intersection by using a two-pointers
merge algorithm. In the context of triangle counting, such an
approach is applicable since the requirement is (i) for each GPU
thread to compute a two-set intersection count independently,
and (ii) these partial counts to be accumulated to compute the
ifnal global count. However, in the setting of set intersection over
two large sets, their solution is inferior and similar to a sequential
CPU approach.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Bisson et al. propose a diferent approach, namely, GPU
set intersection to be based on bitmaps and atomic operations.
Given two sets, to compute their intersection, the GPU threads
create the bitmap representation of the first set in parallel and
then, iterate over the elements of the second set to search for
the corresponding set bits. Based on the average set size, the
workload allocation is per thread, per warp or per block.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Hu et al. demonstrate that set intersection is faster
employing eficient binary search-based techniques than IP-based
techniques, arguing that the latter sufer from nontrivial overhead
of partitioning the input sets and non-coalesced memory accesses.
On the other hand, their proposed algorithm optimizes binary
search at a warp level to achieve coalesced memory access and to
alleviate the need for workload balancing. In addition, by caching
the first levels of the binary search tree they employ in the shared
memory, they can achieve additional speedup gains.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Pandey et al. propose a hash-based technique for set
intersection. In brief, first the algorithm hashes the shorter set
into buckets, and then iterates over the larger set and hashes
each element to the corresponding bucket. Afterwards, a linear
search is conducted within each bucket to find the intersections.
9 11 12 13 16 18 19 22
5
      </p>
      <p>Count Augmentation</p>
      <p>
        Last, in the context of set similarity join, the authors of [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
propose a technique for conducting set intersections by using
a static inverted index and atomic operations. By setting the
similarity degree equal to zero, their solution can be used to solve
the SISI and MISI problems.
      </p>
      <p>
        As already mentioned, some of these works, such as [
        <xref ref-type="bibr" rid="ref10 ref14 ref18 ref4 ref8">4, 8,
10, 14, 18</xref>
        ] are more complete proposals targeting the triangle
counting problem; here we narrow down our attention only to
the part that is relevant to the SISI and MISI problems.
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>TECHNIQUES</title>
      <p>
        In the previous section, we have identified five diferent
methodologies, based on (1) IP, (2) optimized binary search, (3) bitmap
operations, (4) hashing and (5) set similarity joins, respectively.
Here, we present five diferent state-of-the-art GPU techniques
for set intersection, one for each methodology, in more detail. In
addition, we discuss some modifications to these techniques to
target large set intersection yielding two novel hybrid solutions.
We do not consider, works such as [
        <xref ref-type="bibr" rid="ref14 ref18">14, 18</xref>
        ] that mostly rely on the
preprocessing of input data and conduct intersection with a
simple merge fashion algorithm. In contrast, we evaluate techniques
that are more adaptable in a general set intersection scenario. We
give a concise presentation for each evaluated technique below.
      </p>
      <p>
        Intersect Path (IP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Given two ordered sets  and , IP
considers the traversal of a grid, noted as Intersect Matrix, of size
|| × | |. Beginning from the top left corner of the grid, the path
can move downwards if [] &gt;  [  ], to the right if [] &lt;  [  ],
and diagonally if [] =  [  ], until it eventually reaches the
bottom right corner. There are two partitioning stages, one on
kernel grid level and the other on block level. On grid level,
equidistant cross diagonals are placed on the Intersect Matrix.
The number of diagonals is equal to the number of thread blocks
plus one, in order to delimit the boundaries of each block. Using
binary search, the point of intersection between a cross diagonal
and the path is found. As a result, each thread block is assigned
to intersect disjoint subsets of the input. In case a cross diagonal
intersects with the path inside a matrix cell, a count augmentation
is required beforehand, since this intersection is not assigned to
any thread block. An example of this scenario is shown in Figure 1.
On block level, the same cross diagonal approach applies in order
to distribute workload among threads. Each thread may conduct
a serial or binary search based intersection.
      </p>
      <p>
        Optimized Binary Search (OBS) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Given two ordered sets
 and , OBS caches the first levels of the binary search tree
of the larger set in shared memory to reduce expensive global
memory reads. For example, as shown in Figure 2, the higher
level nodes of the binary tree reside in shared memory, whereas
the leafs are located in global memory. The smaller set, which
is  in the example, is used for lookup and as a result there are
| | total binary search lookups. Each thread is assigned a lookup
and, in each iteration, up to 32 (i.e., the size of warp) lookups
are executed simultaneously. For the multi-instance problem, a
simple optimization is to cache each set to shared memory one
at a time, and then iterate every subsequent one to perform the
intersection for every pair. This requires sets to be sorted by their
size.
      </p>
      <p>
        Hash-based Intersection (HI) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Given two sets  and , HI
ifrst hashes the shorter set and constructs buckets in parallel,
and then, iterates and hashes every element of the larger set into
the corresponding bucket as already described in Section 2. The
initial hashing of the smaller set is preferred in order to reduce
the number of collisions. Buckets are statically allocated once
in linear global memory space. The size of each bucket, i.e. the
number of entries from the shorter set in the bucket, is stored in
shared memory. Thus, to ensure correctness for the MISI problem,
we only need to clear the buckets sizes in shared memory, and
let every next short set hashing overwrite the previous one.
      </p>
      <p>
        Bitmap-based Intersection (BI). Given two sets  and , BI
conducts the set intersection on their bitmap representations,
with each bitmap requiring | |/8 bytes of memory regardless
of the set sizes. More specifically, there are two flavors of BI,
namely (i) naive, where all bitmaps fit in global memory, and
(ii) dynamic, where bitmaps are built on the fly in the case that
we cannot store all of them in the available global memory. The
latter is similar to the work of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Bitmaps are constructed in
parallel using the atomicOr function. In the naive scenario, each
GPU thread fetches two bitmap words, one from each set, and
conducts a popcount operation on the resulting logical AND
word to compute the subset intersection. An example of the
naive scenario is shown in Figure 3. In the dynamic scenario,
each GPU thread block first constructs the bitmap representation
of the current set, and then, for every next set, the threads iterate
over its elements and check whether the respective bits are set
(see the example in Figure 4).
      </p>
      <p>
        Static-index Intersection (sf-gssjoin) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Given  sets, sf-gssjoin
ifrst constructs a static inverted index over all set elements. For
each set, the inverted lists for all its elements are concatenated
in a logical vector. Next, this vector is partitioned evenly with
each partition assigned to a specific GPU thread. Thus, each
thread processes independently its corresponding partition and
contributes to the intersections of the current set with the next
ones. To ensure correctness, threads increment the intersection
counts by using atomic operations. An example of sf-gssjoin is
shown in Figure 5.
      </p>
      <p>All of the presented techniques with the exception of BI-naive
and IP, conduct a single instance set intersection on a single block
or warp. This results in severe GPU underutilization, especially
for sets of size in the order of millions, since a single execution
unit is assigned with the complete workload. To tackle this issue,
we investigate the integration of the kernel grid level partitioning
of IP with OBS and HI, which have not been proposed previously
in the literature. We denote these novel hybrid techniques as
IPOBS and IP-HI respectively. For the single instance scenario, in
BI-dynamic, we modify the algorithm by constructing the bitmap
of the first set across multiple blocks and then we partition evenly
the second set into disjoint subsets with each one assigned to a
specific block.
4</p>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION</title>
      <p>
        In this section, we evaluate the techniques in the previous section,
while, in our experiments, we also include CPU variants for
completeness. More specifically, we compare the GPU techniques
against three CPU alternatives, namely, (i) SIMD, which performs
intersection on sorted integers using SIMD instructions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], (ii)
std::set_intersection, which uses the C++ standard library and (iii)
boost::dynamic_bitset which uses the Boost library in a similar
fashion as BI-dynamic. We measure the total intersection time
for the CPU techniques using the std::chrono library. For the GPU
operations, we measure the total time, including transfers and
memory allocations, by using the CUDA event API. We do not
measure any preprocess time. The experiments were conducted
on a machine with an Intel i7 5820k clocked at 3.3 GHz, 32 GB
RAM at 2400 MHz and an NVIDIA Titan XP on CUDA 11.0. This
GPU has 30 Streaming Multiprocessors, with a total of 3840 cores,
12 GB of global memory and a 384-bit memory bus width.
      </p>
      <p>
        We experiment with artificial datasets, where set elements
follow the normal, uniform and zipf like distribution. To distinguish
each dataset, we use the notation Distribution-Element
universeAverage set size. We vary the element universe from 108 up to
109. Respectively, we vary the average set size from 106 up to 107.
For the multi-instance scenario, we also experiment with three
real world datasets previously employed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. More specifically,
for each sorted dataset we extract the last  = 10000 sets, i.e. the
largest sets. Table 1 gives an overview of the real-world datasets’
characteristics.
4.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>SISI experiments</title>
      <p>We examine the single instance scenario using twelve artificial
datasets (combinations of 3 distributions, 2 element universe sizes
and 2 set sizes). We have also experimented with several block
sizes, but due to lack of space, we present the results of the best
configuration for each technique.</p>
      <p>As shown in Figure 6, IP and IP-OBS are the clear winners for
the SISI epxeriments, in the sense that one of them is the best
overall performing technique across every dataset. Moreover, the
diferences between their performance is relatively small: up to
12% for the normal and zipf distribution, and up to 35% for the
uniform distribution. Overall, these techniques achieve average
2.3X speedup compared to the best performing CPU technique;
the best performing CPU technique difers between the cases,
but in average, SIMD is the most robust. In general, the speedup
is similar for all distributions types and increases with higher
universe and average set size, reaching 2.92X. The speedup of
IP or IP-OBS over the worst performing CPU or GPU technique
exceeds an order of magnitude; we have observed speedups up
to 48.67X over a CPU technique and up to 10.97X over a GPU
technique.</p>
      <p>In addition, IP-HI exhibits the worst performance regarding
GPU techniques and seems unable to perform better than CPU for
many settings; this shows that simply relying on the IP workload
allocation rationale is not adequate. Finally, BI -based techniques
behave better for smaller universe sizes.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>MISI experiments</title>
      <p>For the multi-instance evaluation, we conduct two sets of
experiments. In the first one, we use artificially generated datasets
consisting of large sets that follow the zipf distribution, which
is the closest to real world. In the second one, we use real world
datasets and evaluate the intersection techniques on smaller set
sizes.</p>
      <p>As shown in Figure 7, in both artificial datasets with  =
1000, BI-dynamic is the best performing technique and achieves
average 35X speedup compared to the best performing CPU
technique. Furthermore, sf-gssjoin is the second best performing
technique, when manages to launch, i.e. when the static index
ifts in the global memory, which is not the case when the element
universe is 109 in our experiments. On the other hand, IP-OBS
is more robust and its performance is the closest to BI-dynamic
across both datasets. In addition, we observe a 80% performance
increase for the hybrid IP-OBS technique over the standalone
OBS. Last, even though IP and IP-HI are the worst performing
GPU techniques, they manage to achieve average 5.7X speedup
compared to the best performing CPU technique.</p>
      <p>As shown in Figure 8, for real world datasets with  = 10000,
sf-gssjoin is the most eficient technique across every dataset and
achieves average 147X speedup over the best performing CPU
technique. Moreover, the small average set size leads in small
static indices, which yields an optimal workload distribution
among GPU threads and overall, results in better GPU utilization.
In contrast, BI-dynamic, OBS and HI conduct set intersection
at block level. Thus, there is an inherent workload imbalance
among GPU thread blocks. We note that IP, and consequently
its two workload allocation rationales, IP-OBS and IP-HI cannot
always execute due to memory constraints. More specifically,
the required memory to store the diagonals for the grid level
partitioning of IP is 2 × 2 × ( + 1). As a result, there
is an upperbound to the number of blocks to comply with the
global memory restriction. However, due to the small set sizes,
we consider IP partitioning as an excessive approach to conduct
set intersection on these datasets.</p>
      <p>Based on our experimental evaluation, we conclude that for
large MISI problems, BI-dynamic and IP-OBS are preferable. On
the other hand, when dealing with multiple instance small set
intersections, the grid level partitioning of IP adds overhead and
results in severe GPU underutilization. In such cases, standalone
OBS and HI perform better but are surpassed by sf-gssjoin, which
achieves optimal GPU utilization.
5</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION</title>
      <p>In this work, we adapt and evaluate GPU set intersection
techniques that were previously applied to graph analytics. Although
these techniques are better suited for intersecting sets of smaller
size, we experimentally show that, through certain
enhancements, they can be easily adapted for large set intersection. We
explain which are the best approaches in the single- and
multiinstance cases, and we introduce a novel hybrid, namely,
combining   with , which proves to be a dominant solution
for the former cases, and competitive in the latter ones. Also,
employing bitmap-based solutions pays of in the multi-instance
case. Finally, if the set sizes are relatively smaller, multi-instance
set intersection stands to benefit from adapting a set similarity
join technique.</p>
      <p>Acknowledgments. The research work was supported by the
Hellenic Foundation for Research and Innovation (HFRI) under
the HFRI PhD Fellowship grant (Fellowship Number: 1154). In
addition, the authors gratefully acknowledge the support of NVIDIA
through the donation of the GPU (GPU Grant Program).
4000
3500
3000
2500
2000
1500
1000
) 500
s
c
e
s
(e 70
iTm 60
50
40
30
20
10
0
500
400
300
200
iTm 8
6
4
2
0
sf-gssjoin
BI-dynamic
IP</p>
      <p>OBS
IP-OBS
IP-HI</p>
      <p>SIMD
std::set_intersection
boost::dynamic_bitset
sf-gssjoin
BI-dynamic
OBS
HI
SIMD
std::set_intersection
boost::dynamic_bitset
X
zipf-109-106</p>
      <p>Dataset
zipf-108-106</p>
      <p>TWITTER
Dataset</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Naiyong</given-names>
            <surname>Ao</surname>
          </string-name>
          , Fan Zhang, Di Wu, Douglas S Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and
          <string-name>
            <given-names>Sheng</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Eficient parallel lists intersection and index compression algorithms using graphics processing units</article-title>
          .
          <source>Proceedings of the VLDB Endowment 4</source>
          ,
          <issue>8</issue>
          (
          <year>2011</year>
          ),
          <fpage>470</fpage>
          -
          <lpage>481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jérémy</given-names>
            <surname>Barbay</surname>
          </string-name>
          , Alejandro López-Ortiz, and
          <string-name>
            <given-names>Tyler</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Faster adaptive set intersections for text searching</article-title>
          .
          <source>In International Workshop on Experimental and Eficient Algorithms</source>
          . Springer,
          <fpage>146</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Christos</given-names>
            <surname>Bellas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anastasios</given-names>
            <surname>Gounaris</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>An empirical evaluation of exact set similarity join techniques using GPUs</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>89</volume>
          (
          <year>2020</year>
          ),
          <fpage>101485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mauro</given-names>
            <surname>Bisson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Massimiliano</given-names>
            <surname>Fatica</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>High performance exact triangle counting on gpus</article-title>
          .
          <source>IEEE Transactions on Parallel and Distributed Systems</source>
          <volume>28</volume>
          ,
          <issue>12</issue>
          (
          <year>2017</year>
          ),
          <fpage>3501</fpage>
          -
          <lpage>3510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Erik</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Demaine</surname>
            , Alejandro López-Ortiz, and
            <given-names>J Ian</given-names>
          </string-name>
          <string-name>
            <surname>Munro</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Experiments on adaptive set intersections for text retrieval systems</article-title>
          .
          <source>In Workshop on Algorithm Engineering and Experimentation</source>
          . Springer,
          <fpage>91</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Shuai</given-names>
            <surname>Ding</surname>
          </string-name>
          , Jinru He,
          <string-name>
            <surname>Hao Yan</surname>
            , and
            <given-names>Torsten</given-names>
          </string-name>
          <string-name>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Using graphics processors for high performance IR query processing</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web. 421-430.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>James</given-names>
            <surname>Fox</surname>
          </string-name>
          , Oded Green, Kasimir Gabert,
          <source>Xiaojing An, and David A Bader</source>
          .
          <year>2018</year>
          .
          <article-title>Fast and adaptive list intersections on the gpu</article-title>
          .
          <source>In 2018 IEEE High Performance extreme Computing Conference (HPEC)</source>
          .
          <source>IEEE</source>
          , 1-
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Oded</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Pavan</given-names>
            <surname>Yalamanchili</surname>
          </string-name>
          , and
          <string-name>
            <surname>Lluís-Miquel Munguía</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Fast triangle counting on the GPU</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms</source>
          . 1-
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jiawei</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Micheline</given-names>
            <surname>Kamber</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Pei</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Data Mining: Concepts and Techniques, 3rd edition</article-title>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yang</surname>
            <given-names>Hu</given-names>
          </string-name>
          , Hang Liu, and
          <string-name>
            <given-names>H Howie</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Tricore: Parallel triangle counting on gpus</article-title>
          . In SC18:
          <article-title>International Conference for High Performance Computing, Networking, Storage and Analysis</article-title>
          . IEEE,
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Xin</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Hong Cheng, Lu Qin, Wentao Tian, and Jefrey Xu Yu.
          <year>2014</year>
          .
          <article-title>Querying k-truss community in large and dynamic graphs</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD Int. Conf. on Management of data. 1311-1322.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Lemire</surname>
          </string-name>
          , Leonid Boytsov, and
          <string-name>
            <given-names>Nathan</given-names>
            <surname>Kurz</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>SIMD compression and the intersection of sorted integers</article-title>
          .
          <source>Software: Practice and Experience</source>
          <volume>46</volume>
          ,
          <issue>6</issue>
          (
          <year>2016</year>
          ),
          <fpage>723</fpage>
          -
          <lpage>749</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Santosh</surname>
            <given-names>Pandey</given-names>
          </string-name>
          , Xiaoye Sherry Li,
          <string-name>
            <given-names>Aydin</given-names>
            <surname>Buluc</surname>
          </string-name>
          , Jiejun Xu, and Hang Liu.
          <year>2019</year>
          . H-INDEX:
          <article-title>Hash-Indexing for Parallel Triangle Counting on GPUs</article-title>
          .
          <source>In 2019 IEEE High Performance Extreme Computing Conference (HPEC)</source>
          .
          <source>IEEE</source>
          , 1-
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Polak</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Counting triangles in large graphs on GPU</article-title>
          .
          <source>In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)</source>
          . IEEE,
          <fpage>740</fpage>
          -
          <lpage>746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Vijayshankar</surname>
            <given-names>Raman</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Qiao</given-names>
          </string-name>
          , Wei Han, Inderpal Narang,
          <string-name>
            <surname>Ying-Lin</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kou-Horng Yang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fen-Ling Ling</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Lazy, adaptive rid-list intersection, and its application to index anding</article-title>
          .
          <source>In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. 773-784.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Sidney</given-names>
            <surname>Ribeiro-Junior</surname>
          </string-name>
          , Rafael David Quirino, Leonardo Andrade Ribeiro, and Wellington Santos Martins.
          <year>2017</year>
          .
          <article-title>Fast parallel set similarity joins on manycore architectures</article-title>
          .
          <source>Journal of Information and Data Management</source>
          <volume>8</volume>
          ,
          <issue>3</issue>
          (
          <year>2017</year>
          ),
          <fpage>255</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Schank</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dorothea</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Finding, counting and listing all triangles in large graphs, an experimental study</article-title>
          .
          <source>In International workshop on experimental and eficient algorithms</source>
          . Springer,
          <fpage>606</fpage>
          -
          <lpage>609</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Leyuan</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yangzihao</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carl Yang</surname>
            ,
            <given-names>and John D</given-names>
          </string-name>
          <string-name>
            <surname>Owens</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A comparative study on exact triangle counting algorithms on the gpu</article-title>
          .
          <source>In Proceedings of the ACM Workshop on High Performance Graph Processing. 1-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Nan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Jingbo Zhang,
          <string-name>
            <surname>Kian-Lee Tan</surname>
          </string-name>
          , and
          <source>Anthony KH Tung</source>
          .
          <year>2010</year>
          .
          <article-title>On triangulation-based dense neighborhood graph discovery</article-title>
          .
          <source>Proceedings of the VLDB Endowment 4</source>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <fpage>58</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Di</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, and
          <string-name>
            <given-names>Gang</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A batched gpu algorithm for set intersection</article-title>
          .
          <source>In 2009 10th International Symposium on Pervasive Systems</source>
          , Algorithms, and Networks. IEEE,
          <fpage>752</fpage>
          -
          <lpage>756</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Di</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, and Jing Liu.
          <year>2010</year>
          .
          <article-title>Eficient lists intersection by cpu-gpu cooperative computing</article-title>
          .
          <source>In 2010 IEEE International Symposium on Parallel &amp; Distributed Processing, Workshops and Phd Forum (IPDPSW)</source>
          .
          <source>IEEE</source>
          , 1-
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>