A Balanced Solution for the Partition-based Spatial Merge Join in MapReduce Sara Migliorini Alberto Belussi Dept. of Computer Science, University of Verona Dept. of Computer Science, University of Verona sara.migliorini@univr.it alberto.belussi@univr.it ABSTRACT of bytes as splitting criteria, without considering the content of Several MapReduce frameworks have been developed in recent such data. In other words, given the original dataset, its records years in order to cope with the need to process an increasing are placed inside the current partition (or split) until a given size amount of data. Moreover, some extensions of them have been threshold is reached, then another partition is initialized, and so proposed to deal with particular kind of information, like the on. Such an approach can be suitable in the general case when spatial one. In this paper we will refer to SpatialHadoop, a spatial all the dataset content has to be processed. However, in other extension of Apache Hadoop which provides a rich set of spatial cases when data are analyzed using selective queries based on data types and operations. In the geo-spatial domain, spatial some attributes, such as time intervals or spatial regions, this join is considered a fundamental operation for performing data approach may be very inefficient, since the correlation between analysis. However, the join operation is generally classified as data instances is completely ignored. For this reason, in the spatial a critical task to be performed in MapReduce, since it requires context, some global indexing (or partitioning) techniques have to process two datasets at time. Several different solutions have been developed, which try to place inside the same partition data been proposed in literature for efficiently performing a spatial which are spatially correlated in some way (i.e., placing inside join which may or may not require the presence of a spatial the same partition nearby objects). index computed on both datasets or only one of them. As already In geo-spatial applications, spatial join is considered an es- discussed in literature, the efficiency of such operation depends sential operation for data analysis [4, 5], since it allows the data on the ability to both prune unnecessary data as soon as possible analyst to discover important correlations between data or enrich and to provide a balanced amount of work to be done by each the available information. A spatial join is a multi-dimensional parallelly executed task. join specifically tailored for spatial data, in particular given two In this paper, we take a step forward in this direction by propos- datasets A and B each one characterized by a geometric attribute, ing an evolution of the Partition-based Spatial Merge Join algo- called д and f respectively, a spatial join A ◃▹ B returns the set of rithm which tries to completely exploit the benefit of the paral- pairs (a, b) such that a ∈ A and b ∈ B and the geometric intersec- lelism induced by the MapReduce framework. In particular, we tion between the attributes a.f and b.д is not empty. For instance, concentrate on the partition phase which has to produce filtered we can consider the case in which a geographer needs to compute balanced and meaningful subdivisions of the original datasets. the spatial join between two huge datasets, one containing the main roads and one the water areas, in order to predict the main KEYWORDS roads of her country which could be subject to flooding risks. Similarly, a geographer may need to perform a join between a Big Data, Partitioning, Spatial join, Balanced tasks, MapReduce dataset containing the main roads and another one representing the administrative subdivisions of her country, in order to study 1 INTRODUCTION the density of the road network in each state. The MapReduce paradigm has been specifically developed for However the join operation has been traditionally considered processing huge amount of data in an efficient way. In particular, a critical operation in MapReduce for two main reasons: (1) the it requires to subdivide the desired analysis operation into two need to process two distinct datasets (files) at the same time, (2) subsequent phases: the first one is called map and it performs in the need to perform some pruning or filtering operation in order parallel the same operation on independent chunk of data, pro- to reduce the amount of unnecessary comparisons. In order to ducing a set of partial results. These partial results are (possibly) solve such problems, many efforts have been devoted in recent combined by the second phase which is called reduce and may years leading to the development of different MapReduce imple- be parallelized as well. Nowadays there are several frameworks mentation of the join [6, 10] each one applicable to a particular that implement the MapReduce paradigm, undoubtedly the most context. This also holds in the spatial context, where several famous ones are Apache Hadoop [15] and Apache Spark [17]. algorithms have been defined and implemented, which essen- Moreover, in order to cope with the specific needs of particular tially differ for the use of a spatial index and in the way this kinds of information, such as the spatial and temporal one, some index is built and used. At this regard SpatialHadoop is one of extensions have been developed which provides the implemen- the available framework which provides an implementation for tation of the necessary data types and operations. As regards to all these algorithms [9] that can also be combined with different the spatial context, notable systems are SpatialHadoop [8] and kinds of indexes (partitioning techniques) [7]. In general, none GeoSpark [16]. of these algorithms can be considered better than the others, but The basic partitioning approach traditionally implemented by the choice might depend on the characteristics of the involved a MapReduce framework essentially uses a predefined amount datasets [2]. As we will describe in Sect. 2, the application of a spatial par- © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, titioning technique before the execution of a join operation is Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- particularly useful in order to both discard entire partitions from tribution 4.0 International (CC BY 4.0) the analysis, or to balance the amount of work to be done in parallel (typically by each map task). Relatively to this aspect, in intersection, potentially reducing the number of map tasks and literature has been extensively studied the importance of balanc- the number of comparisons to be performed inside them. ing the amount of work to be done by each map task in order to Fig. 1 illustrates the application of Djni and Djgi to the same completely exploit the parallelism induced by the MapReduce pair of datasets containing rectangles: dataset D i contains the paradigm [1, 3]. However, the preliminary application of a spatial blue rectangles, while D j contains the orange ones, for both partitioning technique comes with its cost and sometimes it is datasets the corresponding MBR is also depicted. As you can no- justified only if such new organization of data can be reused tice for both algorithms, the input datasets have been subdivided several time, absorbing such initial cost. Moreover, the right par- into a certain number of partitions (or splits), but while for Djni titioning technique to be applied may depends not only on the each split can contain data coming from any zone of the space, dataset characteristics, such as its distribution [3], but also from for Djgi each split contains only data that are located nearby, the type of analysis will be performed [12]. reducing both the number of split combinations to be considered From these considerations, since the join operation requires and the amount of unnecessary comparisons to be done. This to process two distinct datasets together, the choice of the right is a consequence of the fact that the splits of Djni cover almost partitioning technique has necessarily to consider both datasets the whole MBR of the dataset (see spliti and split j rectangles in together in order to balance the amount of work to be done by Fig. 1), while the splits (cells) of Djgi cover a small part of the each map task. As we will better explain in Sect. 3, given two reference space (see c i and k j cells in Fig. 1), thus reducing the datasets A and B, each one individually partitioned with the most number of geometric intersections to compute. appropriate partitioning technique, it may happen that the splits obtained by combining them for the join will not necessarily &]o ] &]o i •‰o]ší produce the overall better setting w.r.t. the balancing criterion. •‰o]š] •‰o]š o i In this paper we consider the implementation of the Partition •‰o]š] Based Spatial Merge Join [13] provided by SpatialHadoop, de- noted as Sjmr, which is the only spatial join algorithm that does íÆu •‰o]šu ] not rely on the preliminary application of individual spatial par- ~ • •‰o]š ‰ i tition techniques on the input datasets, but it defines a partition grid based on both datasets together. Given such algorithm we enhance the definition of a common partitioning grid in order vÆu to both preliminary prune unnecessary data and promote the i /v Æ] /v Æi ] ] balancing of the obtained splits. Such partitioning technique has to consider the spatial characteristics of both dataset together, lZ such as the spatial distribution of its objects as well as the global ] l] covered area. li ll ~ • lZ i íÆð 2 BACKGROUND Figure 1: Example of execution of Djni and Djgi on the SpatialHadoop provides four different spatial join algorithms same pair of datasets. which essentially differ for the use of spatial indexes or for the way data are repartitioned on the fly. In particular, the simplest As you can notice, when the Djgi algorithm is applied, the solution is represented by the Djni algorithm. Djni stands for dimension of each combined split can be very different from Distributed Join with No Index, it uses the default random parti- each other, depending on the size and shape of the intersection tioning technique, which is provided by any MapReduce frame- between the original splits. In order to promote the definition of work and is based only on the size constraint. In other words, it balanced map tasks, the Djre algorithm has been developed. Djre does not involve any preliminary repartition of the data based stands for Distributed Join with Repartition, in this case only one on their spatial properties (no index or global repartitioning is of the two input datasets has been previously partitioned by applied). Since Djni cannot rely on any spatial property, given using the most appropriate spatial index, while the other one is two datasets D i and D j subdivided into n and m partitions, re- repartitioned by using the subdivision (or grid) induced by the spectively, the number of map tasks to be instantiated is equal to first one. In this case, the number of map tasks to be instantiated the Cartesian product n × m. This case represents the worst case is equal to the number of partitions of the previous dataset that scenario for both the number of map tasks to be instantiated and intersect also the other one. Moreover, while the shape of the the amount of data to be processed. Moreover, since each pair of obtained splits is uniform, there could be great differences in the input partitions can contain data from any region of the space, number of objects contained in each split, particularly if the two there could be some map tasks that produce a huge number of datasets covers only a partial overlapping space, or they cover intersecting pairs, while others could have to compare only very the same space with different distributions. far geometries that do not partecipate to the result. Sjmr is the only spatial join algorithm which does not assume A first improvement of Djni is represented by the Djgi algo- that the data have been preliminary partitioned with respect rithm which starts from indexed data. Djgi stands for Distributed to spatial criteria, but it performs itself the best subdivision by Join with Grid Index, in this case both input datasets are assumed considering both datasets together. Sjmr stands for Spatial Join to be previously repartitioned by using one of the available spa- MapReduce and it is the MapReduce implementation of the Parti- tial indexes: the most suitable one given the dataset characteris- tion Based Spatial Merge Join [13]. It uses a common global grid tics [3]. In this case, each partition is identified by a global MBR for partitioning data before executing the required comparisons. (Minimum Bounding Rectangle) and map tasks are instantiated As illustrated in Fig. 2, this grid is regular, namely the shape and only for those pairs of partitions that have a not empty MBR dimension of its cells is uniform and is automatically determined based only on the input file size. The grid definition does not take in bytes of the dataset A ∪ B, while sp is the size in bytes of the care of neither the dataset distribution, nor the reference space default split size. individually covered by the two input datasets. In this paper, we extend the global partitioning performed by the Sjmr at the beginning in order to take care of also these aspects. Grid ixj Di cij cij Dj Join cij Figure 3: (a) Example of application of Sjmr to two com- Figure 2: Example of execution of the Sjmr algorithm. In pletely overlapping datasets. (b) Example of unbalanced this case a global partitioning grid I∪ is build which in- grid produced by Djgi: the four left upper cells have been cludes the union of the MBRs of the two datasets. Each highlighted for showing their different shape and dimen- cell c i j is separately processed considering the geometries sion. coming from both datasets. This subdivision may be preferable to the application of indi- Tab. 1 provides a brief comparison between the available spa- vidual indexes w.r.t. the balancing criteria, because it considers tial join algorithms. The first column Op reports the name of both datasets together. Fig. 3.b illustrates an example in which the spatial join algorithm. Column BR indicates if the algorithm the splits produced by using separate individual indexes (one in requires the use of a modified binary reader in order to load two green and one in red) can generate not only a greater number dataset at time. In particular, a tick indicates that the algorithm of combinations of intersecting splits, but also very unbalanced considers each dataset individually and contemporarily loads a combinations, like the ones highlighted in the upper left corner partition from each of them. Conversely, in the Sjmr case, the of Fig. 3.b where one green cell is combined with four red cells. algorithm uses the default Hadoop reader and is able to load data The default construction of the grid in Sjmr considers the coming from the two inputs by simply merging the original files. union of the two datasets. However, if their original reference Notice that the modified binary reader induces some additional spaces are not completely overlapping, this grid can contain cells problems w.r.t. the locality principle: the system guarantees that outside the join reference space that include data coming only at least one of the two inputs is locally loaded by the computation from one dataset. Some exemplifying situations are illustrated in node, but the second one can be read both locally or remotely, Fig. 4: in the first case the reference space of one dataset is con- inducing some overhead. Conversely, with the Sjmr each map tained inside the other one, while in the second case the reference task is always able to locally read its input, with great advan- spaces of the two datasets are shifted. These situations reveal tage in terms of performance of the I/O operations. Column In that it is not necessary to compute the global grid starting from reports the number of datasets that are assumed to be indexed the union of the MBRs of the two datasets, but in order to reduce before the spatial join execution, while column Rep indicates the amount of unnecessary work, we can instead consider the if a repartition of one dataset is applied before the join. Finally, intersection between the MBRs of the two input datasets during column Ref reports a reference to the original algorithm. the grid definition. Moreover, in case of uniformly distributed datasets, the definition of the grid starting from the intersection Table 1: Summary of the various spatial join operators. of the MBRs ensures that the obtained splits are also more likely to be balanced. Op BR In Rep Ref Djni X 0 ✗ Block Nested Loop Join Djgi X 2 ✗ Grid File Spatial Join algorithm [11] Djre X 1 X Bulk-Index Join [14] Sjmr ✗ 0 ✗ Partition Based Spatial Merge Join [13] 3 MOTIVATING EXAMPLES In this section we illustrates some example of situations in which the application of the classical partitioning technique provided by the Sjmr algorithm can produce unbalanced situations. Figure 4: Example of reference spaces that are not com- We will start by considering the best case regarding the join pletely overlapping: (a) one reference space inside the between two spatial datasets A and B whose geometries are uni- other, (b) two shifted reference spaces. formly distributed around the same reference space. This situa- tion is illustrated in Fig. 3.a. The global uniform grid is computed Finally, we consider the case in which the two datasets are not starting from the MBR of the union of the two datasets (i.e., uniformly distributed. In this case the use of cells with the same MBR(A ∪ B)) and by uniformly p subdividing the space both hori- shape and dimension can induce unbalanced split dimensions, as zontally and vertically into ⌈ ds/sp⌉ cells, where ds is the size illustrate in Fig. 5. Definition 4.5 (Balanced Partitioning). A partitioning P for a dataset D is said to be balanced if and only if: ∀pi , p j ∈ P : abs(|pi | − |p j |) < ε (2) where |pi | denotes the cardinality of partition pi . The aim of this paper is to obtain balanced spatially-enhanced partitions which contains data from both input datasets involved in the join operation. Figure 5: Unbalanced situation: the cells located closer to 5 PROPOSED SOLUTION the boundary are more populated then the central ones. This section presents the proposed solution, called Esjmr (En- hanced Sjmr), together with a detailed description of the differ- ences introduced w.r.t. the original Sjmr algorithm provided by 4 PROBLEM FORMULATION SpatialHadoop. This section formalizes the problem of obtaining balanced parti- Sjmr is composed of three MapReduce jobs: the first two are tions in the context of the spatial join execution. responsible for computing the MBR of the two involved datasets separately. The union of these two MBRs is then used by the Definition 4.1 (Spatial Dataset). A spatial dataset D = {r 1 , . . . , r n } third job for computing the global uniform grid and performing is a collection of records r i each one characterized by a spatial the spatial join. In particular, as regards to the third job, during attribute д. the map phase, each mapper assigns its input geometries to one or more cells of the uniform grid, then the reducers (potentially In the following, we will use the notation r i .д in order to one for each grid cell) receives the geometries contained in a denote the spatial attribute д contained in the record r i . cell and computes the join on them by executing a plane-sweep Definition 4.2 (Partitioning). Given a dataset D = {r 1 , . . . , r n }, algorithm. Some expedients are used to avoid the production of a partitioning P is a collection of subsets of D: duplicated tuples in the final output. In Esjmr we propose a unique job for computing the MBR of P = {p1 , . . . , ph } such that ∀pi ∈ P (pi ⊆ D ∧ D = ∪i pi ) (1) both datasets together, as illustrated in Algorithm 1. In particular, the produced output is already the intersection of the dataset The general notion of partitioning does not require that any MBRs. Notice that, in order to read the two input dataset, the job other specific property holds between the contained objects. Con- uses a strategy already known in literature [6, 15] that combines versely, the notion of spatially-enhanced partitioning is defined the two inputs into a unique file by keeping a reference to the in order to place nearby objects inside the same partition. source. In the algorithms this additional reference is denoted Given a subdivision of the space covered by a dataset D and by representing the value of a tuple as ⟨r i , f ⟩ where r i is the represented by a set of cells G = {c 1 , . . . , ch }, each cell c k will original record contained in the input file, while f with f = contain only the records of D whose spatial attribute д has a not {1, 2} denotes the fact that the record r i originally belongs to the empty intersection with c k . Partitions are then defined starting first or second input file. Moreover, together with the MBR of from this grid subdivision, so that each partition corresponds to the intersection, we produce also an estimation of the number a cell of G. of geometries contained in such intersection. This estimation Definition 4.3 (Minimum Bounding Rectangle). Given a geom- is useful in order to properly instantiate the partitioning grid, etry д defined by a not empty set of 2D coordinates, i.e. д = indeed the intersection should contain less geometries than the {(x 1 , y1 ), . . . (x n , yn )}, the Minimum Bounding Rectangle (MBR) union of the two original datasets, particularly in the case their of д is the rectangle representing the maximum extension of reference spaces are not completely overlapping. д. In other words, the MBR of д is defined by the coordinates In Algorithm 1 each mapper is responsible for updating the min({x 1 , . . . , x n }), min({y1 , . . . , yn }), max({x 1 , . . . , x n }), max( dataset MBRs based on the geometries contained in its split, in {y1 , . . . , yn }). The definition of MBR can be easily extended to a the psedo-code r i .д stands for the geometric attribute д contained set of geometries as well. in the record r i . Notice that instead of building only two sep- arate MBRs, each mapper maintains and updates two ordered In the following we will use MBR(д), MBR(D) or MBR(pi ) to lists of partial MBRs (one for each input file). When a new ge- denote the MBR of a generic geometry д, of a dataset D or a ometry is processed, the first overlapping MBR intersecting it partition pi , respectively. is updated accordingly, or a new partial MBR is added to the corresponding list. A counter is also maintained for each partial Definition 4.4 (Spatially-enhanced partitioning). Given a dataset MBR which represents the number of geometries intersecting it. D and a grid G = {c 1 , . . . , ch } covering MBR(D), a spatially- This counter will be used for estimating the number of geome- enhanced partitioning P of D is a collection of subsets of D such tries belonging to the dataset intersection. Clearly, this may be that: an overestimation, but it is more indicative than considering the P = {p1 , . . . , ph } such that ∀pi ∈ P sum of the cardinality of the two datasets. At the end of each (pi ⊆ D ∧ D = ∪i pi ∧ ∀r j ∈ pi (r j .д ∩ MBR(pi ) , ∅)) map, the Cleanup procedure will perform some aggregations of adjacent overlapping MBRs, so that the unique reducer will re- Independently from the kind of considered partitions (i.e., spa- ceive a limited amount of MBRs. The reducer can easily compute tially-enhanced or not), we can define the concept of balanced both the intersection of the dataset MBRs and an estimation of partitioning as follows. the number of geometries in the intersection. Notice that in all Algorithm 1: MBR computation Algorithm 2: Partitioning computation 1 class Mapper 1 class Mapper 2 mbr i1 ←− mbr i2 ←− ∅ 2 G ←− grid computed using the previous job 3 method Map( _ , ⟨r i , f ⟩) 3 th ←− cell occupation threshold 4 if f = 1 then 4 method Map( _ , ⟨r i , f ⟩) 5 if ∃x ∈ mbr i1 (x .mbr ∩ Mbr(r i .д)) then 5 cl ←− IntersectingCells(G, r i .д) 6 x .mbr ←− Extend(x .mbr, Mbr(r i .д)) 6 for c ∈ cl do 7 x .count ←− x .count + 1 7 Write(c, ⟨r i , f ⟩) 8 else 9 x .mbr ←− Mbr(r i , д) 8 class Reducer 10 x .count ←− 1 9 method Reduce(c, l = {⟨r i , f ) . . . } if |l | > th then mbr i1 .SortedAdd(x) 10 11 11 ll ←− ∅ 12 else 12 while ll = ∅ ∧ BigSplits(ll) do 13 if ∃y ∈ mbr i2 (y.mbr ∩ Mbr(r i .д)) then 13 ll ←− Split(c, ll) 14 y.mbr ←− Extend(y.mbr, Mbr(r i .д)) 14 for l ∈ ll do 15 y.count ←− y.count + 1 15 WriteInSplit(l) 16 else 16 else 17 y.mbr ←− Mbr(r i , д) 17 WriteInSplit(l) 18 y.count ←− 1 19 mbr i2 .SortedAdd(y) method Cleanup() uniform grid G which has an extension equal to the MBR of 20 21 Compact(mbr i1 ) the datasets intersection and cells with uniform size computed 22 for x ∈ mbr i1 do using the dimension (number of objects) of the intersection and 23 Write( _ , ⟨x .mbr, x .count, 1⟩) the given split size, (2) a threshold value th representing the 24 Compact(mbr i2 ) maximum number of objects to be included in each split. 25 for y ∈ mbr i2 do More specifically, given the dataset D ∩ obtained from the inter- 26 Write( _ , ⟨y.mbr, y.count, 2⟩) section of the two input datasets, whose size in bytes is denoted as size(D ∩ ), and the desired size in bytes of a split, denoted as 27 class Reducer size(split), the estimated initial number of cells of the grid G is 28 mbr 1 ←− mbr 2 ←− mbr ←− EmptyMbr computed as #cells = ⌈size(D ∩ )/size(split)⌉. √ Given such √ estima- 29 l 1 ←− l 2 ←− ∅ tion, the grid G will have the dimension ⌈ #cells⌉ × ⌈ √#cells⌉, 30 method Reduce( _, ⟨mbr i , counti , f ⟩) while the cell weight will be equal to width(MBR(D ∩ ))/⌈√ #cells⌉ 31 if f = 1 then and the cell height will be equal to heiдht(MBR(D ∩ ))/⌈ #cells⌉. 32 mbr 1 ←− ⟨Extend(mbr 1 , mbr i ) Function IntersectingCells(G, д) returns the set of cells of 33 l 1 ←− l 1 ∪ {(mbr i , counti )} G which have a not empty intersection with д. This set can be ef- ficiently obtained by considering the MBR of д and thanks to the 34 else fixed dimension of the cells. In other words, the index of the first 35 mbr 2 ←− ⟨Extend(mbr 2 , mbr i ) intersected cell is obtained by dividing the minimum x and the 36 l 2 ←− l 2 ∪ {(mbr i , counti )} minumum y of Mbr(д) for the cell width and height, respectively. 37 method Cleanup() In this way the map tasks produce an initial uniform subdivision 38 mbr ←− mbr 1 ∩ mbr 2 of the geometries which can be enough if geometries are uni- 39 for x ∈ l 1 ∪ l 2 do formly distributed. Otherwise, some recursive subdivisions of an 40 if x .mbr ∩ mbr then overcrowded cell can be necessary. 41 count ←− count + x .count The reducers, potentially one for each not empty cell produced by the mappers, are responsible for checking the degree of occu- 42 Write( _ , ⟨mbr, count⟩) pancy of such cells. In particular, if the degree of occupancy is less than the given threshold, the geometries contained in the cell can be directly written in a split by the function WriteInSplit(). Otherwise, the geometries inside the cell will be partitioned again algorithms, the symbol “_” denotes a dummy serial identifier for by recursively subdividing the cell into four splits (like in a quad- a MapReduce input for which we do not take care of. tree index). Function Split is responsible for doing such recursive Given the MBR of the dataset intersection which represents subdivision, while function BigSplit returns true if the degree the grid extension, the second task is responsible for performing of occupancy of c overcomes the threshold th. the balanced partitioning of the two input datasets. It is very The last job is responsible for performing the spatial join be- similar to the map phase of the second Sjmr task, but it also uses tween the geometries that belong to the intersection, namely the a reduce phase for refining the obtained partitions and producing geometries of D ∩ . This is done during a map phase, where each more balanced splits. This second job is described in Algorithm 2, mapper receives a split, namely the geometries of both datasets where it is assumed that each map task knows: (1) the initial that belong to a particular cell. The first operation to be done is subdividing the geometries into two lists (one for each dataset) 7 CONCLUSION and then the plain sweep algorithm is performed to compute In recent years the amount of spatial data to be processed is the intersecting pairs. Some expedients can be applied in order continuously increasing, thanks also to the spread of IoT and to avoid the production of duplicated pairs during the writing mobile devices with geo-spatial capabilities. For this reason, some performed by the mappers, as already done by SpatialHadoop. MapReduce frameworks have been developed which are specifi- cally tailored for modeling and processing spatial data. Algorithm 3: Spatial Join Among the possible spatial operation, spatial join is certainly 1 class Mapper considered a fundamental one for performing meaningful geo- 2 l 1 ←− l 2 ←− {} spatial analysis. However, the join is traditionally considered a 3 method Map( _ , ⟨r i , f ⟩) critical operation to be performed in the MapReduce context, 4 if f = 1 then since it requires to process two distinct datasets (files) at time. 5 l 1 ←− l 1 .SortedAdd(r i ) For this reason, several join solutions have been developed both in the general case and in the spatial one. As already discussed in 6 else literature, the performances of such algorithms greatly depends 7 l 2 ←− l 2 .SortedAdd(r i ) on the ability to prune unnecessary data as soon as possibile, and 8 method Cleanup() to produce balanced partitions. Having balanced partitions means 9 pl ←− PlaneSweep(l 1 , l 2 ) that the parallel map tasks have essentially the same amount of 10 for ⟨r 1 , r 2 ⟩ ∈ pl do work to do. In this way, we can maximize the benefits induced 11 Write(r 1 , r 2 ) by the parallelism. This paper takes inspiration from the Sjmr algorithm provided by SpatialHadoop which is the only one that does not require a preliminary construction of a spatial index on both input datasets and also considers both datasets together during the repartition. 6 EXPERIMENTS AND VALIDATION Starting from this algorithm, we propose an enhanced version This section presents some preliminary results obtained by apply- of Sjmr, called Esjmr, with the aim to produce both more bal- ing the proposed Esjmr technique, together with a comparison anced and filtered partitions. Some preliminar experiments have with the partitioning used by Sjmr. In particular, we are inter- been performed in order to check the benefits of the partitioning ested in analysing the degree of balancing obtained by the two induced by Esjmr w.r.t. the one produce by Sjmr. Such initial partitioning techniques. results encourages further development in this direction and will Such results are illustrate in Table 2 where two real world cases guide the definition of future works. are considered: (1) the join between the water area (WA) and the primary roads of USA (PR), and (2) the join between the roads REFERENCES (RD) and the administrative subdivisions (AS) of Australia. In the [1] A. Belussi, D. Carra, S. Migliorini, M. Negri, and G. Pelagatti. 2018. What table column |Mbr∪ | reports the number of geometries contained Makes Spatial Data Big? A Discussion on How to Partition Spatial Data. In in the union of the two MBRs, while |Mbr∩ | the number of 10th International Conference on Geographic Information Science. 1–15. https: //doi.org/10.1109/ICDE.2015.7113382 geometries in the intersection between the two MBRs. As you [2] A. Belussi, S. Migliorini, and A. Eldawy. 2018. A Cost Model for Spatial Join can notice, in both cases this number is quite different: the use of Operations in SpatialHadoop. Technical Report RR108/2018. Dept. of Computer Science, University of Verona. https://iris.univr.it/handle/11562/981957 the intersection allow to prune unnecessary geometries as soon [3] A. Belussi, S. Migliorini, and A. Eldawy. 2018. Detecting Skewness of Big as possible. Column #splits reports the number of not empty cells Spatial Data in SpatialHadoop. In Proc. of the 26th ACM SIGSPATIAL Interna- produced by the two techniques, while column %RDS reports the tional Conference on Advances in Geographic Information Systems. 432–435. https://doi.org/10.1145/3274895.3274923 relative standard deviation between the size of the splits. This [4] A. Belussi, S. Migliorini, M. Negri, and G. Pelagatti. 2015. Validation of Spatial last measure is intended as a measure of the balancing degree Integrity Constraints in City Models. In Proc. of the 4th ACM SIGSPATIAL Int. between the split sizes. Workshop on Mobile Geographic Information Systems (MobiGIS âĂŹ15). 70–79. https://doi.org/10.1145/2834126.2834137 As regards to the number of splits, while Sjmr is able to pro- [5] A. Belussi, M. Negri, and G. Pelagatti. 2006. Modelling Spatial Whole-Part duce less splits (smaller number of join mappers), Esjmr guar- Relationships using an ISO-TC211 Conformant Approach. Inf. Softw. Technol. 48, 11 (2006), 1095–1103. https://doi.org/10.1016/j.infsof.2006.02.002 antees more balanced parallel tasks. Consider for example the [6] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. 2010. A first case: given the original uniform grid, six splits contains a Comparison of Join Algorithms for Log Processing in MapReduce. In Proc. of number of geometries less than the desired threshold, while the the 2010 ACM SIGMOD Int. Conf. on Management of Data. 975–986. https: //doi.org/10.1145/1807167.1807273 remaining two contain more than half of the geometries, so they [7] A. Eldawy, L. Alarabi, and M. F. Mokbel. 2015. Spatial Partitioning Techniques are recursively subdivided by Esjmr. in SpatialHadoop. Proc. VLDB Endow. 8, 12 (2015), 1602–1605. https://doi.org/ 10.14778/2824032.2824057 [8] A. Eldawy and M. F. Mokbel. 2015. SpatialHadoop: A MapReduce framework Table 2: Comparison of the experimental results obtained for spatial data. In 2015 IEEE 31st International Conference on Data Engineering. by applying Sjmr and Esjmr algorihtms. 1352–1363. https://doi.org/10.1109/ICDE.2015.7113382 [9] A. Eldawy and M. F. Mokbel. 2017. Spatial Join with Hadoop. In Encyclopedia of GIS. Springer, 2032–2036. https://doi.org/10.1007/978-3-319-17885-1_1570 Datasets Sjmr Esjmr [10] J. Gu, S. Peng, X. S. Wang, W. Rao, M. Yang, and Y. Cao. 2014. Cost-Based Join |Mbr∪ | #splits %RDS |Mbr∩ | #splits %RDS Algorithm Selection in Hadoop. In 15th Int. Conf. on Web Information Systems WA ◃▹ PR 2,305,162 8 181% 2,007,414 22 50% Engineering. 246–261. https://doi.org/10.1007/978-3-319-11746-1_18 [11] L. Harada, M. Nakano, M. Kitsuregawa, and M. Takagi. 1990. Query Processing AS ◃▹ PR 1,245,200 5 188% 1,244,800 14 65% for Multi-Attribute Clustered Records. In Proceedings of 16th Int. Conf. on Very Large Data Bases. 59–70. [12] S. Migliorini, A. Belussi, E. Quintarelli, and D. Carra. 2020. A Context-based The preliminary experiments encourage the investigation in Approach for Partitioning Big Data. In Proceedings 23rd International Confer- this direction, since the proposed technique ensures the construc- ence on Extending Database Technology (EDBT). 1–4. https://doi.org/10.1109/ tion or more balanced splits. ICDE.1999.754937 [13] J. M. Patel and D. J. DeWitt. 1996. Partition Based Spatial-merge Join. SIGMOD Rec. 25, 2 (June 1996), 259–270. https://doi.org/10.1145/235968.233338 [14] J. van den Bercken, B. Seeger, and P. Widmayer. 1999. The bulk index join: a generic approach to processing non-equijoins. In Proceedings 15th Int. Conf. on Data Engineering. 257–. https://doi.org/10.1109/ICDE.1999.754937 [15] T. White. 2015. Hadoop: The Definitive Guide (4th ed.). O’Reilly Media, Inc. [16] J. Yu, J. Wu, and M. Sarwat. 2015. GeoSpark: a cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems. 70:1– 70:4. https://doi.org/10.1145/2820783.2820860 [17] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, and et al. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (2016), 56–65. https://doi.org/10.1145/2934664