An Adaptive Approximate Algorithm For Join Oxana Dolmatova Saint Petersburg State University oxana.dolmatova@gmail.com Advisor: Boris Novikov Abstract which gives an advantage in the sense that we can pipeline a result, even without getting all the data. This paper describes the initial state of Also a new algorithm will be approximate; it means development adaptive approximate algorithm for that the quality of the final result depends on the join combining three modifications of the amount of allocated resources. Or rather, in this case, traditional techniques the term approximate means incomplete result that The main result of the research will be the new is some of relevant output tuples may be missed in the algorithm suitable for pipelined inputs, a cost result and all pairs that will be included definitely will model for this algorithm as well as experimental satisfy the join predicate. evaluation to estimate the accuracy of cost model The exact definition and method of measuring the and identification the various properties of the quality of result we are planning for the future work. algorithm. Considering quality very tentatively, it looks like the precision. We use the term precision in the usual sense: 1 Introduction it is the ratio between obtained tuples and all the pairs that fit the predicate. In this paper we consider query evaluation in the The abstract concept of term “resource” may include distributed heterogeneous systems, real-time systems, several different performance measures of a query mobile systems or environment with data 1 of varying execution. Usually the most important is the execution intensity. In all these cases we would like to perform time (either CPU or elapsed), amount of I/O, or, in a queries with in predictable response time and perhaps distributed mobile environment, the battery energy. not exactly because of lack of resources or time That is, with limited resources, the answer will be constraints. incomplete. However, those properties provide control In contrast with traditional databases this context trade-off between quality and cost. creates additional challenges. The following section overviews the different As a consequence it is clear that the approximate variation of join algorithms such as adaptive and algorithms for computationally intensive operations are similarity schemes. Modifications of the three standard unavoidable in modern environments. algorithms for join you can find in part 3. The section 4 In the context of complex queries, join is one of the introduces a small description and main idea of the new most important and intensively used operations in the join technique. The last part of this paper offers our users’ search operations. conclusions and certain ideas for future work. There are three types of algorithms in traditional databases for join and all three algorithms are designed to obtain the exact answer. Adaptive execution of join 2 ReleatedWork will combine all the benefits of the existing standard The inspiration of this work was the work of [5]. It algorithms, whether sorted inputs, the presence of formed the basis of the main idea of the new algorithm permanent indexes for one or two threads or a different for the approximate join. input sizes of the unsorted pipelines. Author is building an effective technique for It is significant that we call the input data - pipelined adaptive query, based on the standard algorithms, and data, there is one difference between stream and taking the best of all of them. In contrast to this paper pipelined data. To use our adaptive algorithm we have the algorithm is not designed to work with the streams to know the right and left streams’ sizes to choose the and quality management through resources. right strategy. Nevertheless the processing of incoming An adaptive algorithm for similarity join evaluation objects will be the same as work with streaming data, can be found in [6]. The proposed technique is designed similar to those and suitable to the pipelined architecture. It means that Proceedings of the Ninth Spring Researcher's Colloquium the algorithm is incremental and begins returning results On Database and Information Systems SYRCoDIS, even before streams have completely been consumed. Kazan, Russia, 2013 This method is an aggregation of the best from similar behavior to the standard hash join with some SHJoin [9] and SSJoin [2]. peculiarities. An adaptive nested loop-based algorithm for stream First of all the normal join uses only one hash table input was presented in [1]. The main advantage over to calculate its results, while the symmetric hash join traditional join techniques is that this algorithm can start uses two tables, one for each input. Second, the normal producing join results as soon as the first input object join is a blocking operator and as a consequence not are available, thus improving pipelining by smoothing suitable for pipelined architectures. join result production and by masking source or To begin yielding results it has to entirely build a network delays. hash table over one of the inputs, meaning that no result Then the authors present a sophisticated version of is returned before at least one of the tables has been the algorithm suitable for multiple input streams. completely read. The symmetric hash join on the other Thereby this research exploits temporary delays when hand is a non blocking operator. The two hash tables are new data is not available for processing data obtained. built in parallel while reading the objects from both A stream input for join algorithm with emphasis on inputs. the limited internal memory was discussed in [3]. The algorithm is able to return result as soon as first The authors consider the task of sliding windows objects are coming from both streams. It does not need join. In case of resource shortage, tuples have to be to wait for complete input. dropped before they expire thus this algorithm can be called approximate. The same authors propose a way to 3.2 Nonblocking merge-based join estimate the error of the executed operation. That is probably the easiest algorithm to modify to The join algorithm suitable for integration was approximate mode. There are sorted input pipelines, and introduced in [7]. The authors propose the trade-offs their size. For the approximate performance we stop the between the completeness of a join result, and its execution when the resources allocated to the operation execution efficiency: users can choose a faster (for example, time) running out. execution, at the price of missing more matches, It should be noted that if the input data are not resulting in a lower result completeness; or a more sorted, this algorithm cannot be used because it is not complete join result, at the cost of lower performance. possible to sort the pipelined data. That proposal looks like one of our future goals. At the same time, the term adaptive means no variation 3.3 Nonblocked nested loop-based join between the three standard algorithms for join, and aggregation exact join and similarity join, what Since we have a normal, two-input, join, respectively, distinguishes this paper from our own. there are also two cycles in nested loop. Here it is The [10] introduces the adaptive hash join algorithm necessary to consider the following cases. in connection with a problem of multiuser environment. 1. There are the pipelines without any index and Authors present a modified join technique that is the one of the threads is much smaller than the designed to work with dynamic changes in the amount other and we have the sufficient amount of of available memory. resources. Then, for each obtained object in the A general aim of this work is to regulate resource big pipeline algorithm takes a loop with already usage and to provide the way that allows query read data from small input, until allocated time execution to run concurrently with other applications. runs out. Therefore there is a wide range of very different 2. There are two small pipelines and we have algorithms for the join operation. However almost all of sufficient amount of resources to read and them are tailored for either similarity join or to join for process it. Hence, it will be exact query stream input. As well as some of algorithms represent execution with exact standard nested loop an adaptive technique for the exact query execution. 3. There are two pipelines and its sizes are not While the ultimate goal of this work is the detailed allowed to read pipeline completely because of approximate adaptive algorithm for join. lack of resources. Then the algorithm turn scans each obtained object by comparing it with those of the objects from another pipeline, which we 3 Traditional Basic Algoritms have already obtained. When the resources run Detailed description of the following standard out, the algorithm will stop, and we will get an algorithms with history, options and optimization and approximate result. cost models can be found in [5]. Here we will describe the properties of modified 4 Description of the new algorithm algorithms that will be the basis of a new adaptive algorithm. The choice of the strategy depends on the properties of the input data: how much different in size input 3.1 Hash-based join pipelines and how many pipelines are sorted, and how many resources were allocated for the join execution. We choose the symmetric hash join modification for If both pipelines are sorted by the join attribute, then this section. As introduced in [9] this method has a the new algorithm is similar to the non blocked merge join, referred to in 3.2. If only one of the pipelined inputs is sorted, the And at the end we are going to conduct the algorithm behaves like an aggregation from the merge experiments for both estimate the accuracy of the join and the nested loop. In this case an algorithm has constructed cost model and to compare the performance different behaviors depending on the input size and of different implementations approximate join with our resources available for the operation: algorithm.  The sorted input fits into the memory For example, it will be interesting to see how the which allocated for join execution. And we number of found pairs of answer is depending on the have sufficient (that term will be specify in initial data: sorted and unsorted input pipelines, small cost model in the future work) amount of and large amount of input data, the large difference in processing resources which allows read all the amount of input data or the comparable size of the this small input. Then algorithm reads this right and left thread. input and after that reads as many objects Such experimental results can show that sensible from other inputs as can be located in the savings in join execution time can be achieved in remaining memory. The algorithm sorts practice; at the expense of a modest reduction in result read objects from second input and makes completeness. Those experiments will be assigning to sort-merge join of those two sets of objects. determine the characteristics and properties of the new After that if we still have the resources the algorithm. algorithm reads other block of objects from second and sorts it etc. References  The sorted inputs can fit into the memory but we haven’t sufficient amount of [1] Mihaela A. Bornea, Vasilis Vassalos, Yannis processing resources. The algorithm Kotidis, Antonios Deligiannakis: Adaptive receives part of first input and part of Join Operators for Result Rate Optimization on second input, sorts it and performs sort- Streaming Inputs.IEEE Trans. Knowl. Data merge join of those two sets. Eng. (TKDE) 22(8):1110-1125 (2010)  Any of inputs cannot fit into the memory [2] S. Chaudhuri, V. Ganti, and R. Kaushik. A and we have the sufficient amount of primitive operator for similarityjoins in data resources. Then the algorithm reads several cleaning. In L. Liu, A. Reuter, K.-Y. Whang, blocks from sorted input and block from and J. Zhang,editors, ICDE, page 5. IEEE unsorted input sorts this block and makes Computer Society, 2006. sort-merge join. As long as there are [3] Abhinandan Das, Johannes Gehrke, Mirek resources the technique obtains the blocks Riedewald: Approximate Join Processing Over of both inputs and makes merge sort. Data Streams. SIGMOD 2003:40-51 If any of pipelines are not sorted the algorithm’s [4] Oxana Dolmatova, Anna Yarygina, Boris behavior looks like in the previous case. The Novikov: Cost Models for Approximate Query difference is in the number of objects and loops Evaluation Algorithms. DB&Local which algorithms makes and reads in given amount Proceedings 2012:20-28 of resources. [5] Goetz Graefe: New algorithms for join and grouping operations. Computer Science - R&D 27(1): 3-27 (2012) 5 Conclusion and future work [6] Roald Lengu, Paolo Missier, Alvaro A. A. We are going to implement our algorithm with Fernandes, Giovanna Guerrini, Marco Mesiti: assumption that the algorithm of join will be used in the Symmetric set hash join: A pipelined context of [8]. approximate join algorithm, The optimization is essential for any high- http://unina.stidue.net/Universita'%20di%20Ge performance querying system. Actually, the task of a nova/GuerriniG/reports/sshjoinTR.pdf query optimizer is to choose an algebraic expression of [7] Paolo Missier, Alvaro A. A. Fernandes, Roald minimal execution cost among several equivalent Lengu, Giovanna Guerrini, Marco Mesiti: Data expressions. Because of any high-quality optimizer is Quality support to on-the-fly data integration inevitably a cost-based one and, hence, the cost model using Adaptive Query Processing. SEBD is one of the critical core components of the optimizer 2009: 213-220 we are going to develop a cost model for our new [8] Boris Novikov, Natalia Vassilieva, Anna algorithm. Yarygina: Querying big data. CompSysTech As the query evaluation is approximate, the quality 2012: 1-10 of the output can be lower than the exact. Hence it is [9] A. N. Wilschut and P. M. G. Apers. Dataflow necessary to estimate an adequate quality of the result query execution in a parallelmain-memory which will determine the accuracy of the obtained environment. In PDIS, pages 68{77. IEEE answer. Computer Society,1991 As a result it will be possible to control trade-off [10] Hansjörg Zeller, Jim Gray: An Adaptive Hash between resource and quality. Join Algorithm for Multiuser Environments. VLDB 1990: 186-197