<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Adaptive Approximate Algorithm For Join</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oxana Dolmatova</string-name>
          <email>oxana.dolmatova@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the Ninth Spring Researcher's Colloquium On Database and Information Systems SYRCoDIS</institution>
          ,
          <addr-line>Kazan, Russia, 2013</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saint Petersburg State University Advisor: Boris Novikov</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the initial state of development adaptive approximate algorithm for join combining three modifications of the traditional techniques The main result of the research will be the new algorithm suitable for pipelined inputs, a cost model for this algorithm as well as experimental evaluation to estimate the accuracy of cost model and identification the various properties of the algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In this paper we consider query evaluation in the
distributed heterogeneous systems, real-time systems,
mobile systems or environment with data 1of varying
intensity. In all these cases we would like to perform
queries with in predictable response time and perhaps
not exactly because of lack of resources or time
constraints.</p>
      <p>In contrast with traditional databases this context
creates additional challenges.</p>
      <p>As a consequence it is clear that the approximate
algorithms for computationally intensive operations are
unavoidable in modern environments.</p>
      <p>In the context of complex queries, join is one of the
most important and intensively used operations in the
users’ search operations.</p>
      <p>There are three types of algorithms in traditional
databases for join and all three algorithms are designed
to obtain the exact answer. Adaptive execution of join
will combine all the benefits of the existing standard
algorithms, whether sorted inputs, the presence of
permanent indexes for one or two threads or a different
input sizes of the unsorted pipelines.</p>
      <p>It is significant that we call the input data - pipelined
data, there is one difference between stream and
pipelined data. To use our adaptive algorithm we have
to know the right and left streams’ sizes to choose the
right strategy. Nevertheless the processing of incoming
objects will be the same as work with streaming data,
which gives an advantage in the sense that we can
pipeline a result, even without getting all the data.</p>
      <p>Also a new algorithm will be approximate; it means
that the quality of the final result depends on the
amount of allocated resources. Or rather, in this case,
the term approximate means incomplete result that
is some of relevant output tuples may be missed in the
result and all pairs that will be included definitely will
satisfy the join predicate.</p>
      <p>The exact definition and method of measuring the
quality of result we are planning for the future work.
Considering quality very tentatively, it looks like the
precision. We use the term precision in the usual sense:
it is the ratio between obtained tuples and all the pairs
that fit the predicate.</p>
      <p>The abstract concept of term “resource” may include
several different performance measures of a query
execution. Usually the most important is the execution
time (either CPU or elapsed), amount of I/O, or, in a
distributed mobile environment, the battery energy.</p>
      <p>That is, with limited resources, the answer will be
incomplete. However, those properties provide control
trade-off between quality and cost.</p>
      <p>The following section overviews the different
variation of join algorithms such as adaptive and
similarity schemes. Modifications of the three standard
algorithms for join you can find in part 3. The section 4
introduces a small description and main idea of the new
join technique. The last part of this paper offers our
conclusions and certain ideas for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2 ReleatedWork</title>
      <p>
        The inspiration of this work was the work of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It
formed the basis of the main idea of the new algorithm
for the approximate join.
      </p>
      <p>Author is building an effective technique for
adaptive query, based on the standard algorithms, and
taking the best of all of them. In contrast to this paper
the algorithm is not designed to work with the streams
and quality management through resources.</p>
      <p>
        An adaptive algorithm for similarity join evaluation
can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The proposed technique is designed similar to those
and suitable to the pipelined architecture. It means that
the algorithm is incremental and begins returning results
even before streams have completely been consumed.</p>
      <p>
        This method is an aggregation of the best from
SHJoin [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and SSJoin [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        An adaptive nested loop-based algorithm for stream
input was presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The main advantage over
traditional join techniques is that this algorithm can start
producing join results as soon as the first input object
are available, thus improving pipelining by smoothing
join result production and by masking source or
network delays.
      </p>
      <p>Then the authors present a sophisticated version of
the algorithm suitable for multiple input streams.
Thereby this research exploits temporary delays when
new data is not available for processing data obtained.</p>
      <p>
        A stream input for join algorithm with emphasis on
the limited internal memory was discussed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The authors consider the task of sliding windows
join. In case of resource shortage, tuples have to be
dropped before they expire thus this algorithm can be
called approximate. The same authors propose a way to
estimate the error of the executed operation.</p>
      <p>
        The join algorithm suitable for integration was
introduced in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The authors propose the trade-offs
between the completeness of a join result, and its
execution efficiency: users can choose a faster
execution, at the price of missing more matches,
resulting in a lower result completeness; or a more
complete join result, at the cost of lower performance.
      </p>
      <p>That proposal looks like one of our future goals. At
the same time, the term adaptive means no variation
between the three standard algorithms for join, and
aggregation exact join and similarity join, what
distinguishes this paper from our own.</p>
      <p>
        The [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] introduces the adaptive hash join algorithm
in connection with a problem of multiuser environment.
Authors present a modified join technique that is
designed to work with dynamic changes in the amount
of available memory.
      </p>
      <p>A general aim of this work is to regulate resource
usage and to provide the way that allows query
execution to run concurrently with other applications.</p>
      <p>Therefore there is a wide range of very different
algorithms for the join operation. However almost all of
them are tailored for either similarity join or to join for
stream input. As well as some of algorithms represent
an adaptive technique for the exact query execution.</p>
      <p>While the ultimate goal of this work is the detailed
approximate adaptive algorithm for join.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Traditional Basic Algoritms</title>
      <p>
        Detailed description of the following standard
algorithms with history, options and optimization and
cost models can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Here we will describe the properties of modified
algorithms that will be the basis of a new adaptive
algorithm.</p>
      <sec id="sec-3-1">
        <title>3.1 Hash-based join</title>
        <p>
          We choose the symmetric hash join modification for
this section. As introduced in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] this method has a
similar behavior to the standard hash join with some
peculiarities.
        </p>
        <p>First of all the normal join uses only one hash table
to calculate its results, while the symmetric hash join
uses two tables, one for each input. Second, the normal
join is a blocking operator and as a consequence not
suitable for pipelined architectures.</p>
        <p>To begin yielding results it has to entirely build a
hash table over one of the inputs, meaning that no result
is returned before at least one of the tables has been
completely read. The symmetric hash join on the other
hand is a non blocking operator. The two hash tables are
built in parallel while reading the objects from both
inputs.</p>
        <p>The algorithm is able to return result as soon as first
objects are coming from both streams. It does not need
to wait for complete input.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Nonblocking merge-based join</title>
        <p>That is probably the easiest algorithm to modify to
approximate mode. There are sorted input pipelines, and
their size. For the approximate performance we stop the
execution when the resources allocated to the operation
(for example, time) running out.</p>
        <p>It should be noted that if the input data are not
sorted, this algorithm cannot be used because it is not
possible to sort the pipelined data.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Nonblocked nested loop-based join</title>
        <p>Since we have a normal, two-input, join, respectively,
there are also two cycles in nested loop. Here it is
necessary to consider the following cases.</p>
        <p>1. There are the pipelines without any index and
the one of the threads is much smaller than the
other and we have the sufficient amount of
resources. Then, for each obtained object in the
big pipeline algorithm takes a loop with already
read data from small input, until allocated time
runs out.
2. There are two small pipelines and we have
sufficient amount of resources to read and
process it. Hence, it will be exact query
execution with exact standard nested loop
3. There are two pipelines and its sizes are not
allowed to read pipeline completely because of
lack of resources. Then the algorithm turn scans
each obtained object by comparing it with those
of the objects from another pipeline, which we
have already obtained. When the resources run
out, the algorithm will stop, and we will get an
approximate result.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Description of the new algorithm</title>
      <p>The choice of the strategy depends on the properties
of the input data: how much different in size input
pipelines and how many pipelines are sorted, and how
many resources were allocated for the join execution.</p>
      <p>If both pipelines are sorted by the join attribute, then
the new algorithm is similar to the non blocked merge
join, referred to in 3.2.</p>
      <p>If only one of the pipelined inputs is sorted, the
algorithm behaves like an aggregation from the merge
join and the nested loop. In this case an algorithm has
different behaviors depending on the input size and
resources available for the operation:
 The sorted input fits into the memory
which allocated for join execution. And we
have sufficient (that term will be specify in
cost model in the future work) amount of
processing resources which allows read all
this small input. Then algorithm reads this
input and after that reads as many objects
from other inputs as can be located in the
remaining memory. The algorithm sorts
read objects from second input and makes
sort-merge join of those two sets of objects.</p>
      <p>After that if we still have the resources the
algorithm reads other block of objects from
second and sorts it etc.
 The sorted inputs can fit into the memory
but we haven’t sufficient amount of
processing resources. The algorithm
receives part of first input and part of
second input, sorts it and performs
sortmerge join of those two sets.
 Any of inputs cannot fit into the memory
and we have the sufficient amount of
resources. Then the algorithm reads several
blocks from sorted input and block from
unsorted input sorts this block and makes
sort-merge join. As long as there are
resources the technique obtains the blocks
of both inputs and makes merge sort.</p>
      <p>If any of pipelines are not sorted the algorithm’s
behavior looks like in the previous case. The
difference is in the number of objects and loops
which algorithms makes and reads in given amount
of resources.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion and future work</title>
      <p>
        We are going to implement our algorithm with
assumption that the algorithm of join will be used in the
context of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The optimization is essential for any
highperformance querying system. Actually, the task of a
query optimizer is to choose an algebraic expression of
minimal execution cost among several equivalent
expressions. Because of any high-quality optimizer is
inevitably a cost-based one and, hence, the cost model
is one of the critical core components of the optimizer
we are going to develop a cost model for our new
algorithm.</p>
      <p>As the query evaluation is approximate, the quality
of the output can be lower than the exact. Hence it is
necessary to estimate an adequate quality of the result
which will determine the accuracy of the obtained
answer.</p>
      <p>As a result it will be possible to control trade-off
between resource and quality.</p>
      <p>And at the end we are going to conduct the
experiments for both estimate the accuracy of the
constructed cost model and to compare the performance
of different implementations approximate join with our
algorithm.</p>
      <p>For example, it will be interesting to see how the
number of found pairs of answer is depending on the
initial data: sorted and unsorted input pipelines, small
and large amount of input data, the large difference in
the amount of input data or the comparable size of the
right and left thread.</p>
      <p>Such experimental results can show that sensible
savings in join execution time can be achieved in
practice; at the expense of a modest reduction in result
completeness. Those experiments will be assigning to
determine the characteristics and properties of the new
algorithm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mihaela</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bornea</surname>
          </string-name>
          , Vasilis Vassalos, Yannis Kotidis, Antonios Deligiannakis:
          <article-title>Adaptive Join Operators for Result Rate Optimization on Streaming Inputs</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng. (TKDE) 22</source>
          (
          <issue>8</issue>
          ):
          <fpage>1110</fpage>
          -
          <lpage>1125</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          .
          <article-title>A primitive operator for similarityjoins in data cleaning</article-title>
          . In L. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reuter</surname>
          </string-name>
          , K.-Y. Whang, and J. Zhang,editors,
          <source>ICDE, page 5. IEEE Computer Society</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Abhinandan</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Johannes Gehrke</surname>
          </string-name>
          ,
          <source>Mirek Riedewald: Approximate Join Processing Over Data Streams. SIGMOD</source>
          <year>2003</year>
          :
          <fpage>40</fpage>
          -
          <lpage>51</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Oxana</given-names>
            <surname>Dolmatova</surname>
          </string-name>
          , Anna Yarygina, Boris Novikov:
          <article-title>Cost Models for Approximate Query Evaluation Algorithms</article-title>
          .
          <source>DB&amp;Local Proceedings</source>
          <year>2012</year>
          :
          <fpage>20</fpage>
          -
          <lpage>28</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Goetz</given-names>
            <surname>Graefe</surname>
          </string-name>
          :
          <article-title>New algorithms for join and grouping operations</article-title>
          .
          <source>Computer Science - R&amp;D</source>
          <volume>27</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Roald</given-names>
            <surname>Lengu</surname>
          </string-name>
          , Paolo Missier,
          <string-name>
            <given-names>Alvaro A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , Giovanna Guerrini, Marco Mesiti:
          <article-title>Symmetric set hash join: A pipelined approximate join algorithm</article-title>
          , http://unina.stidue.net/Universita'%20di%20Ge nova/GuerriniG/reports/sshjoinTR.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alvaro A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , Roald Lengu, Giovanna Guerrini, Marco Mesiti:
          <article-title>Data Quality support to on-the-fly data integration using Adaptive Query Processing</article-title>
          .
          <source>SEBD</source>
          <year>2009</year>
          :
          <fpage>213</fpage>
          -
          <lpage>220</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Boris</given-names>
            <surname>Novikov</surname>
          </string-name>
          , Natalia Vassilieva, Anna Yarygina:
          <article-title>Querying big data</article-title>
          .
          <source>CompSysTech</source>
          <year>2012</year>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Wilschut</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. M. G.</given-names>
            <surname>Apers</surname>
          </string-name>
          .
          <article-title>Dataflow query execution in a parallelmain-memory environment</article-title>
          .
          <source>In PDIS</source>
          , pages
          <volume>68</volume>
          {
          <fpage>77</fpage>
          . IEEE Computer Society,1991
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hansjörg</surname>
            <given-names>Zeller</given-names>
          </string-name>
          , Jim Gray:
          <article-title>An Adaptive Hash Join Algorithm for Multiuser Environments</article-title>
          .
          <source>VLDB</source>
          <year>1990</year>
          :
          <fpage>186</fpage>
          -
          <lpage>197</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>