An Adaptive Approximate Algorithm For Join

                                                    Oxana Dolmatova
                                             Saint Petersburg State University
                                           oxana.dolmatova@gmail.com
                                                 Advisor: Boris Novikov


                        Abstract                                which gives an advantage in the sense that we can
                                                                pipeline a result, even without getting all the data.
This paper describes the initial state of                            Also a new algorithm will be approximate; it means
development adaptive approximate algorithm for                  that the quality of the final result depends on the
join combining three modifications of the                       amount of allocated resources. Or rather, in this case,
traditional techniques                                               the term approximate means incomplete result that
    The main result of the research will be the new             is some of relevant output tuples may be missed in the
algorithm suitable for pipelined inputs, a cost                 result and all pairs that will be included definitely will
model for this algorithm as well as experimental                satisfy the join predicate.
evaluation to estimate the accuracy of cost model                    The exact definition and method of measuring the
and identification the various properties of the                quality of result we are planning for the future work.
algorithm.                                                      Considering quality very tentatively, it looks like the
                                                                precision. We use the term precision in the usual sense:
1 Introduction                                                  it is the ratio between obtained tuples and all the pairs
                                                                that fit the predicate.
In this paper we consider query evaluation in the                    The abstract concept of term “resource” may include
distributed heterogeneous systems, real-time systems,
                                                                several different performance measures of a query
mobile systems or environment with data 1 of varying
                                                                execution. Usually the most important is the execution
intensity. In all these cases we would like to perform
                                                                time (either CPU or elapsed), amount of I/O, or, in a
queries with in predictable response time and perhaps
                                                                distributed mobile environment, the battery energy.
not exactly because of lack of resources or time                     That is, with limited resources, the answer will be
constraints.                                                    incomplete. However, those properties provide control
    In contrast with traditional databases this context
                                                                trade-off between quality and cost.
creates additional challenges.
                                                                     The following section overviews the different
    As a consequence it is clear that the approximate
                                                                variation of join algorithms such as adaptive and
algorithms for computationally intensive operations are
                                                                similarity schemes. Modifications of the three standard
unavoidable in modern environments.                             algorithms for join you can find in part 3. The section 4
    In the context of complex queries, join is one of the       introduces a small description and main idea of the new
most important and intensively used operations in the
                                                                join technique. The last part of this paper offers our
users’ search operations.
                                                                conclusions and certain ideas for future work.
    There are three types of algorithms in traditional
databases for join and all three algorithms are designed
to obtain the exact answer. Adaptive execution of join          2 ReleatedWork
will combine all the benefits of the existing standard          The inspiration of this work was the work of [5]. It
algorithms, whether sorted inputs, the presence of              formed the basis of the main idea of the new algorithm
permanent indexes for one or two threads or a different         for the approximate join.
input sizes of the unsorted pipelines.                              Author is building an effective technique for
    It is significant that we call the input data - pipelined   adaptive query, based on the standard algorithms, and
data, there is one difference between stream and                taking the best of all of them. In contrast to this paper
pipelined data. To use our adaptive algorithm we have           the algorithm is not designed to work with the streams
to know the right and left streams’ sizes to choose the         and quality management through resources.
right strategy. Nevertheless the processing of incoming             An adaptive algorithm for similarity join evaluation
objects will be the same as work with streaming data,           can be found in [6].
                                                                    The proposed technique is designed similar to those
                                                                and suitable to the pipelined architecture. It means that
Proceedings of the Ninth Spring Researcher's Colloquium         the algorithm is incremental and begins returning results
On Database and Information Systems SYRCoDIS,                   even before streams have completely been consumed.
Kazan, Russia, 2013
    This method is an aggregation of the best from             similar behavior to the standard hash join with some
SHJoin [9] and SSJoin [2].                                     peculiarities.
    An adaptive nested loop-based algorithm for stream             First of all the normal join uses only one hash table
input was presented in [1]. The main advantage over            to calculate its results, while the symmetric hash join
traditional join techniques is that this algorithm can start   uses two tables, one for each input. Second, the normal
producing join results as soon as the first input object       join is a blocking operator and as a consequence not
are available, thus improving pipelining by smoothing          suitable for pipelined architectures.
join result production and by masking source or                     To begin yielding results it has to entirely build a
network delays.                                                hash table over one of the inputs, meaning that no result
    Then the authors present a sophisticated version of        is returned before at least one of the tables has been
the algorithm suitable for multiple input streams.             completely read. The symmetric hash join on the other
Thereby this research exploits temporary delays when           hand is a non blocking operator. The two hash tables are
new data is not available for processing data obtained.        built in parallel while reading the objects from both
    A stream input for join algorithm with emphasis on         inputs.
the limited internal memory was discussed in [3].                  The algorithm is able to return result as soon as first
    The authors consider the task of sliding windows           objects are coming from both streams. It does not need
join. In case of resource shortage, tuples have to be          to wait for complete input.
dropped before they expire thus this algorithm can be
called approximate. The same authors propose a way to          3.2 Nonblocking merge-based join
estimate the error of the executed operation.                  That is probably the easiest algorithm to modify to
    The join algorithm suitable for integration was            approximate mode. There are sorted input pipelines, and
introduced in [7]. The authors propose the trade-offs          their size. For the approximate performance we stop the
between the completeness of a join result, and its             execution when the resources allocated to the operation
execution efficiency: users can choose a faster                (for example, time) running out.
execution, at the price of missing more matches,                   It should be noted that if the input data are not
resulting in a lower result completeness; or a more            sorted, this algorithm cannot be used because it is not
complete join result, at the cost of lower performance.        possible to sort the pipelined data.
     That proposal looks like one of our future goals. At
the same time, the term adaptive means no variation            3.3 Nonblocked nested loop-based join
between the three standard algorithms for join, and
aggregation exact join and similarity join, what               Since we have a normal, two-input, join, respectively,
distinguishes this paper from our own.                         there are also two cycles in nested loop. Here it is
    The [10] introduces the adaptive hash join algorithm       necessary to consider the following cases.
in connection with a problem of multiuser environment.             1. There are the pipelines without any index and
Authors present a modified join technique that is                      the one of the threads is much smaller than the
designed to work with dynamic changes in the amount                    other and we have the sufficient amount of
of available memory.                                                   resources. Then, for each obtained object in the
    A general aim of this work is to regulate resource                 big pipeline algorithm takes a loop with already
usage and to provide the way that allows query                         read data from small input, until allocated time
execution to run concurrently with other applications.                 runs out.
    Therefore there is a wide range of very different              2. There are two small pipelines and we have
algorithms for the join operation. However almost all of               sufficient amount of resources to read and
them are tailored for either similarity join or to join for            process it. Hence, it will be exact query
stream input. As well as some of algorithms represent                  execution with exact standard nested loop
an adaptive technique for the exact query execution.               3. There are two pipelines and its sizes are not
    While the ultimate goal of this work is the detailed               allowed to read pipeline completely because of
approximate adaptive algorithm for join.                               lack of resources. Then the algorithm turn scans
                                                                       each obtained object by comparing it with those
                                                                       of the objects from another pipeline, which we
3 Traditional Basic Algoritms                                          have already obtained. When the resources run
Detailed description of the following standard                         out, the algorithm will stop, and we will get an
algorithms with history, options and optimization and                  approximate result.
cost models can be found in [5].
   Here we will describe the properties of modified            4 Description of the new algorithm
algorithms that will be the basis of a new adaptive
algorithm.                                                         The choice of the strategy depends on the properties
                                                               of the input data: how much different in size input
3.1 Hash-based join                                            pipelines and how many pipelines are sorted, and how
                                                               many resources were allocated for the join execution.
We choose the symmetric hash join modification for                 If both pipelines are sorted by the join attribute, then
this section. As introduced in [9] this method has a           the new algorithm is similar to the non blocked merge
                                                               join, referred to in 3.2.
    If only one of the pipelined inputs is sorted, the           And at the end we are going to conduct the
algorithm behaves like an aggregation from the merge         experiments for both estimate the accuracy of the
join and the nested loop. In this case an algorithm has      constructed cost model and to compare the performance
different behaviors depending on the input size and          of different implementations approximate join with our
resources available for the operation:                       algorithm.
          The sorted input fits into the memory                 For example, it will be interesting to see how the
             which allocated for join execution. And we      number of found pairs of answer is depending on the
             have sufficient (that term will be specify in   initial data: sorted and unsorted input pipelines, small
             cost model in the future work) amount of        and large amount of input data, the large difference in
             processing resources which allows read all      the amount of input data or the comparable size of the
             this small input. Then algorithm reads this     right and left thread.
             input and after that reads as many objects          Such experimental results can show that sensible
             from other inputs as can be located in the      savings in join execution time can be achieved in
             remaining memory. The algorithm sorts           practice; at the expense of a modest reduction in result
             read objects from second input and makes        completeness. Those experiments will be assigning to
             sort-merge join of those two sets of objects.   determine the characteristics and properties of the new
             After that if we still have the resources the   algorithm.
             algorithm reads other block of objects from
             second and sorts it etc.                        References
          The sorted inputs can fit into the memory
             but we haven’t sufficient amount of                 [1] Mihaela A. Bornea, Vasilis Vassalos, Yannis
             processing resources. The algorithm                      Kotidis, Antonios Deligiannakis: Adaptive
             receives part of first input and part of                 Join Operators for Result Rate Optimization on
             second input, sorts it and performs sort-                Streaming Inputs.IEEE Trans. Knowl. Data
             merge join of those two sets.                            Eng. (TKDE) 22(8):1110-1125 (2010)
          Any of inputs cannot fit into the memory              [2] S. Chaudhuri, V. Ganti, and R. Kaushik. A
             and we have the sufficient amount of                     primitive operator for similarityjoins in data
             resources. Then the algorithm reads several              cleaning. In L. Liu, A. Reuter, K.-Y. Whang,
             blocks from sorted input and block from                  and J. Zhang,editors, ICDE, page 5. IEEE
             unsorted input sorts this block and makes                Computer Society, 2006.
             sort-merge join. As long as there are               [3] Abhinandan Das, Johannes Gehrke, Mirek
             resources the technique obtains the blocks               Riedewald: Approximate Join Processing Over
             of both inputs and makes merge sort.                     Data Streams. SIGMOD 2003:40-51
     If any of pipelines are not sorted the algorithm’s          [4] Oxana Dolmatova, Anna Yarygina, Boris
    behavior looks like in the previous case. The                     Novikov: Cost Models for Approximate Query
    difference is in the number of objects and loops                  Evaluation               Algorithms. DB&Local
    which algorithms makes and reads in given amount                  Proceedings 2012:20-28
    of resources.                                                [5] Goetz Graefe: New algorithms for join and
                                                                      grouping operations. Computer Science - R&D
                                                                      27(1): 3-27 (2012)
5 Conclusion and future work                                     [6] Roald Lengu, Paolo Missier, Alvaro A. A.
    We are going to implement our algorithm with                      Fernandes, Giovanna Guerrini, Marco Mesiti:
assumption that the algorithm of join will be used in the             Symmetric set hash join: A pipelined
context of [8].                                                       approximate             join         algorithm,
    The optimization is essential for any high-                       http://unina.stidue.net/Universita'%20di%20Ge
performance querying system. Actually, the task of a                  nova/GuerriniG/reports/sshjoinTR.pdf
query optimizer is to choose an algebraic expression of          [7] Paolo Missier, Alvaro A. A. Fernandes, Roald
minimal execution cost among several equivalent                       Lengu, Giovanna Guerrini, Marco Mesiti: Data
expressions. Because of any high-quality optimizer is                 Quality support to on-the-fly data integration
inevitably a cost-based one and, hence, the cost model                using Adaptive Query Processing. SEBD
is one of the critical core components of the optimizer               2009: 213-220
we are going to develop a cost model for our new                 [8] Boris Novikov, Natalia Vassilieva, Anna
algorithm.                                                            Yarygina: Querying big data. CompSysTech
    As the query evaluation is approximate, the quality               2012: 1-10
of the output can be lower than the exact. Hence it is           [9] A. N. Wilschut and P. M. G. Apers. Dataflow
necessary to estimate an adequate quality of the result               query execution in a parallelmain-memory
which will determine the accuracy of the obtained                     environment. In PDIS, pages 68{77. IEEE
answer.                                                               Computer Society,1991
    As a result it will be possible to control trade-off         [10] Hansjörg Zeller, Jim Gray: An Adaptive Hash
between resource and quality.                                         Join Algorithm for Multiuser Environments.
                                                                      VLDB 1990: 186-197