=Paper=
{{Paper
|id=Vol-2369/short14
|storemode=property
|title=Dynamic Pipelining of Multidimensional Range Queries
|pdfUrl=https://ceur-ws.org/Vol-2369/short14.pdf
|volume=Vol-2369
|authors=Amalia Duch,Daniel Lugosi,Edelmira Pasarella,Cristina Zoltan
|dblpUrl=https://dblp.org/rec/conf/amw/DuchLPZ19
}}
==Dynamic Pipelining of Multidimensional Range Queries==
<pdf width="1500px">https://ceur-ws.org/Vol-2369/short14.pdf</pdf>
<pre>
 Dynamic Pipelining of Multidimensional Range
                   Queries?

      Amalia Duch, Daniel Lugosi, Edelmira Pasarella, and Cristina Zoltan

                       Universitat Politècnica de Catalunya, Spain
                         {duch,edelmira,zoltan}@cs.upc.edu


       Abstract. The problem of evaluating orthogonal range queries efficiently
       has been studied widely in the data structures community. It has been
       common wisdom for several years that for queries containing more than
       20% of the elements of the dataset a linear scanning of the data was
       the most efficient solution. In recent experimental works using modern
       hardware–with main memory and parallelism– the conclusion is that lin-
       ear scan is preferable for almost every query configuration (even contain-
       ing a 1% of the data). In this work we propose an alternative approach to
       evaluate multidimensional range queries based on the dynamic pipeline
       paradigm –using main memory and concurrency. Our aim is to prove that
       under this framework, it is possible to beat the performance of linear
       scanning by the one of hierarchical multidimensional data structures–
       such as kd trees, quad trees, R trees or similar.


1     Introduction

It is a common task nowadays to ask Google Maps for the closest gas station
or TripAdvisor for good restaurants around a specific area. These requests are
examples of a computing task –frequent in a wide range of applications– that is
formally called the Associative Retrieval problem [2, 4]. In associative retrieval
we consider a collection F of n records. Each record (or key) is an ordered k-tuple
(k ≥ 2) x = (x1 , . . . , xk ) of         Q values (the attributes or coordinates of the record)
drawn from domain D = 1≤j≤k Dj , where each Dj is totally ordered.
     A range query over F is the retrieval of those of its records that fall inside
a given region. Specifically, we consider orthogonal range queries Q, in which
the region is specified by a sequence of s unidimensional ranges, this is, Q =
(i1 , l1 , u1 ), . . . , (is , ls , us ), where 1 ≤ s ≤ k, ij 6= ij 0 ∀j 6= j 0 (with 1 ≤ j ≤ s
and 1 ≤ j 0 ≤ s) and every triplet (ij , lj , uj ) fix the lower (lj ) and upper (uj )
boundaries of the unidimensional range for coordinate ij (1 ≤ ij ≤ k). If s = k
then we say that Q is a complete range query. Otherwise, we say that it is partial.
     In order to efficiently deal with orthogonal range queries the storage of the
records in F should be crucial. Extensive collections of general purpose mul-
tidimensional data structures –such as kd-trees, quad-trees or R-trees– have
been proposed theoretically as adequate storage methods [2, 4] to support range
?
    Work supported by grant GRAMM (TIN2017-86727-C2-1-R) and EU FEDER funds.
queries. However, in practice, the usefulness of this approach heavily relies on the
selectivity and configuration of the sequence of range queries and, unfortunately
common wisdom told that a simple scan beats multidimensional data structures
for queries accessing more than 15%-20% of a data collection [5].
    Recently, multidimensional range queries as well as the efficiency of hier-
archical multidimensional data structures to support them have been revisited
under a modern hardware perspective [5]. Moreover, in [5] the authors state that
in current machines –using main memory and parallelisation– data structures
are useless even for very selective range queries an thus, they conclude that in
current machines scanning should be favoured over parallel versions of such data
structures.
    In this work, we propose a new way to parallelise the multidimensional range
query problem using the dynamic pipeline model [1, 3]. Our aim is to prove
that, with our algorithm, the use of hierarchical multidimensional data struc-
tures would be preferable over scanning for range queries containing a sub-linear
number of elements of the collection.


2   Dynamic Pipeline Algorithms

We propose an algorithm based on a dynamic pipeline [1, 3] of processes via
an asynchronous model of computation, synchronised by means of channels. In
general, a dynamic pipeline is a data-driven unidimensional and unidirectional
chain of stages connected by means of data channels. This computational struc-
ture stretches and shrinks depending on the spawning and the lifetime of its
stages. A dynamic pipeline is similar to an ordinary pipeline, except that the
number of stages that it contains is not fixed but dynamically generated at
runtime. In fact, it is self-adaptive to the characteristics of a specific query.
    Algorithms under this paradigm must specify four kind of stages: input, out-
put, generator and filter stages as well as the number and the type of the I/O
unidirectional channels. The input and output stages are the interface of the
pipeline, managing the input and output data respectively. Input data is fed
to the input stage and the output stage will produce results. The generator is
responsible to create the (parameterised) filter stages.
    We now describe two algorithms to solve range queries based on the dynamic
pipeline model: a naı̈ve algorithm equivalent to a linear scan of the whole dataset
followed by the algorithm that we propose, based on a preprocessing of the
dataset by means of data structures such as kd trees, quad trees or R trees. We
will describe both algorithms for a single range query since the same process is
applicable to a sequence of queries iterating on the process for a single one.

Naı̈ve Algorithm. We start by describing a naı̈ve algorithm equivalent to a
concurrent approach of the complete scan of the data set.
   To answer Q using the pipeline approach it is necessary to have a recursive
process that constructs a sequence of filters (processes). Every filter stands for
one of the s unidimensional ranges of Q, let us say j (1 ≤ j ≤ s), and it discards
from further consideration all the points of the data set that are outside range
(ij , lj , uj ), that is, all those points with ij -th coordinate smaller than lj or greater
than uj . Since the query has s ranges, the pipe will end up with s processes acting
concurrently.
     This naı̈ve algorithm starts by setting an initial pipeline consisting of 3 stages
–the input, the generator and the output stages, in this sequential order– and
two channels –the first carry the sequence of triplets of Q and the second the
sequence of points in F.
     The process starts by feeding (in sequential order) the input stage with the
triplets of Q carried by the first channel. The configuration of the pipe evolves
(stretches) as follows. The input stage passes the data from the first channel (a
triplet of the query at a time) to its successor neighbour. At the very beginning
the triplet (i1 , l1 , u1 ) is passed from the input stage to the generator stage.
Every time a triplet arrives to the generator a filter stage, standing for this
triplet, is created as the stage immediately previous to the generator stage. So,
at the very beginning, a filter f1 for the first range of Q is created. The pipeline
consists now of four stages: input, f1 , generator and output, in this order. When
triplet (ij , lij , uij ) passes through the input stage, it passes also through f1 , since
ij 6= i1 , it passes through f2 , . . . , fj , up to arrive to the generator where filter
fj+1 is created. The process continues until the elements carried by the first
channel are all treated and the channel is empty. The pipe now, regarding the
initial pipe, has s additional stages, one per each triplet of the query.
     The next step is to treat the data carried by the second channel, the points.
Every point of the data set passes through the pipe. At every filter fi , as we
already mentioned, if the i-th coordinate of the point is outside the range stored
at fi the point is discarded, otherwise it is passed to next stage. Therefore, all
the points that arrive to the output stage should be reported as part of Q.
     This naı̈ve algorithm will force to read and check every point in the orig-
inal set, having therefore complexity proportional to n, independently of the
configuration of Q.

Proposed Algorithm. To improve the efficiency of the naı̈ve algorithm we
propose a preprocessing of the dataset by means of a hierarchical multidimen-
sional data structure. The data structure can be any of the classical ones [2,
4] (such as kd trees, quad trees, R trees, etc.) with the unique requirement
that it divides the domain D of the points into a partition of m k-dimensional
hyperrectagles that are called bounding boxes, where m ≥ 1 is the number of
elements in the partition and depends on the kind of tree used and the num-
ber of levels of that tree. Each bounding box BB is defined by a sequence of
k ranges, this is, BB = (1, l1 , u1 ), . . . , (k, lk , uk ). In our preliminar experiments
we use quad tries [2, 4] to preprocess the data points and to end up with a
sequence S = BB1 , . . . , BBm of bounding boxes. It is worth noting that the
dynamic pipeline algorithm works identically with any other multidimensional
data structure that fits the previous requirement.
    Now, instead of directly filtering points, the pipe will filter, first, the sequence
S of bounding boxes produced by the data structure (it will have then 3 channels
instead of two, as before). The bounding boxes will pass through the pipe by
their specification and not by the points that they contain. Every filter stage,
checking for the i-th range of the query, will discard from further consideration
all the bounding boxes whose i-th range does not intersect the i-th range of the
query and will pass the intersecting bounding boxes to next stage. Additionally,
the filtering of bounding boxes divides them into two groups: BBC (the group
of bounding boxes which are completely contained into Q) and BBP (the group
of bounding boxes that intersect Q but are not contained in it).
    The algorithm will output all points that are inside of BBC bounding boxes
(since they are all in Q) and it will look and filter all the points in BBP bounding
boxes to decide whether they are in Q.
    The total number of treated points corresponds to the number of points con-
tained in the BBC and BBP bounding boxes, which can be considerably less than
the total number of points in F (but this highly depends on the configuration of
the queries, data points, and chosen data structure). Additionally, the algorithm
incurs in the cost of checking the m bounding boxes of S. Our proposal is to
maintain this cost negligible compared to the number of points that have to be
checked by choosing correctly the number of levels of the tree data structure
during the preprocessing of the data.


3   Ongoing Work

We have implemented quad tries in the C++ programming language produc-
ing with this program the sequence of m files that containing the points inside
the corresponding bounding box in sequence S. Besides, we have implemented
the pipeline in the Go programming language (because of its mechanisms of
go-routines and channels). Our preliminar experiments show that our proposed
algorithm beats the naı̈ve one treating systematically half the points of the data
set for queries containing up to a 25% of the points of F. We plan to conduct
further experiments according to the following guidelines: (a) Considering huge
datasets and allocating their corresponding (tree-like) hierarchical representa-
tions in the RAM, we envision that the performance of our algorithm overcomes
the results presented in [5] under similar conditions and thus, it overcomes the
complexity of linear scanning. We will study, then, the incidence of stressing the
population of the memory in order to find insights regarding the percentage of
memory that can be used for storage purposes without affecting the performance
of the schedule and the memory management of the Go system; (b) Under the
premise that the set of points is uniformly distributed, we plan to measure the
incidence of the chosen level of the data structure (and thus of the number m
of bounding boxes to be filtered) on the performance of our algorithm; (c) The
dimensionality of the data increases the parallelism of our algorithm –which
depends on the query– so we are interested in studying how, eventually, our al-
gorithm is more suitable than other proposals in high dimensional settings; (d)
Finally, in order to study the scalability and real applicability of our model, we
plan to conduct our experiments with big real datasets and benchmarks.
References
1. J. Aráoz and C. Zoltan. Parallel triangles counting using pipelining. CoRR,
   abs/1510.03354, 2015.
2. V. Gaede and O. Günther. Multidimensional access methods. ACM Computing
   Surveys, 30(2):170–231, 1998.
3. E. Pasarella, M-E. Vidal, and C. Zoltan. Comparing mapreduce and pipeline im-
   plementations for counting triangles. Electronic proceedings in theoretical computer
   science, 237:20–33, 2017.
4. H. Samet. Foundations of Multidimensional and Metric Data Structures.
5. S. Sprenger, P. Schäfer, and U. Leser. Multidimensional range queries on modern
   hardware. In Proceedings of the 30th International Conference on Scientific and
   Statistical Database Management, SSDBM 2018, Bozen-Bolzano, Italy, July 09-11,
   2018, pages 4:1–4:12, 2018.

</pre>