Merging Frequent Summaries M. Cafaro, M. Pulimeno University of Salento, Italy {massimo.cafaro, marco.pulimeno}@unisalento.it Abstract. Recently, an algorithm for merging counter-based data sum- maries which are the output of the Frequent algorithm (Frequent sum- maries) has been proposed by Agarwal et al. In this paper, we present a new algorithm for merging Frequent summaries. Our algorithm is fast and simple to implement, and retains the same computational complex- ity of the algorithm presented by Agarwal et al. while providing better frequency estimation. 1 Introduction In 2011, we presented an algorithm [1] for merging in parallel counter-based data summaries which are the output of the Frequent [2] algorithm. Recently, we also designed a parallel algorithm for merging Space Saving summaries [3] and an algorithm for mining frequent items in the time fading model [4]. In 2012, a new algorithm for merging counter-based data summaries which are the output of the Frequent algorithm has been proposed by Agarwal et al. [5]. Given a data set A of n items t1 , t2 , . . . , tn , the frequency of an item i is fi = |{j |tj = i} |. Let f˜i be the frequency reported by the algorithm for item i. The absolute error of item i is defined as the difference |fi − f˜i |. The (absolute) total error is then the sum of the absolute errors related to the items reported by an algorithm. In this paper, we present a new algorithm for merging Frequent summaries (based on our previous algorithm) which is fast and simple to implement, and retains the same computational complexity of the algorithm presented in [5] while providing better frequency estimation. We briefly recall notations and definitions used in the sequel. Definition 1. Given a multiset N , with |N | = n, and 2 ≤ k ≤ n, a frequent item (or k–majority element) is an element x ∈ N whose frequency fN (x) is such that fN (x) ≥ nk + 1. The frequent items (or k–majority) problem takes   an array N of n numbers (a multiset), and requires as output the set as input S = x ∈ N : fN (x) ≥ nk + 1 Copyright c by the paper’s authors. Copying permitted for private and academic pur- poses. V. Biló, A. Caruso (Eds.): ICTCS 2016, Proceedings of the 17th Italian Conference on Theoretical Computer Science, 73100 Lecce, Italy, September 7–9 2016, pp. 280–285 published in CEUR Workshop Proceedins Vol-1720 at http://ceur-ws.org/Vol-1720 281 Definition 2. Merged summary. Given k, the k-majority parameter, let A1 and A2 be respectively the data sets from which the data summaries S1 and S2 are derived by an application of the Frequent algorithm, and let n = |A1 | + |A2 |. The merged summary M is the multiset which containsU all of the k-majority elements, i.e.,  all of the elements whose frequency in A1 A2 is greater than or equal to nk + 1. Moreover, all of the guarantees assured by Frequent Uon its output continue to hold for the summary M with reference to the input A1 A2 . Definition 3. 2-way merging problem. Input: k, the k-majority parameter; two summaries S1 and S2 derived by an application of the Frequent algorithm. Output: The merged summary M. The paper is organized as follows. We present in Section 3 our algorithm. In Section 4, the proposed algorithm is analyzed in terms of correctness, computa- tional complexity and total error committed. Full details and proofs will appear in a forthcoming extended version. We draw our conclusions in Section 5. 2 Related Work In [5], Agarwal et al. introduced an algorithm for merging two data summaries S1 and S2 outputted by the Frequent algorithm. In the following, given a counter Ci , the notation Cie refers to the item monitored by the i–th counter, whilst Cif refers to its estimated frequency. Algorithm 1 Merging Algorithm by Agarwal et al. Require: S1 ; an array of counters; S2 ; an array of counters; k, k-majority parameter (the number of counters is k − 1); Ensure: an array containing k–majority candidate elements 1: procedure m e rg e(S1 , S2 , k) . a merged summary of S1 and S2 2: S ← c o m b i n e(S1 , S2 ); 3: if S.nz > k − 1 then . prune counters in S 4: for i = k to 2k − 2 do 5: Cif ← Cif − Ck−1 f ; 6: end for 7: end if 8: return S[k . . . (2k − 2)]; . return the last k − 1 counters 9: end procedure The algorithm works as follows. It starts combining as usual the two data summaries, by adding the frequencies of counters monitoring the same item. This could entail, for Frequent summaries, the use of up to 2k − 2 counters in the worst case, when S1 and S2 share no item. Let S be the combined summary, and S.nz the number of nonzero counters. Moreover, assume, without loss of generality, 282 that the total number of counters in S, denoted by S.length, is exactly 2k − 2 and they are stored in sorted ascending order. Indeed, it is always possible to pad the first S.length − S.nz positions in S with dummy counters whose frequency is zero. If S.nz ≤ k − 1 the algorithm returns the last k − 1 counters of S. Otherwise, a pruning operation is required. Then, the algorithm subtracts from the last k − 1 counters the frequency of the Ck−1 -th counter and returns the pruned counters. The algorithm requires in the worst case time linear in the total number of counters, i.e., O(k) if implemented as described in [5] using an hash table. We now analyze the total error committed by this algorithm. Clearly, combin- ing the two data summaries can be done without any additional error. However, the pruning operation occurring when the size of S is greater than k − 1 induces f a total error ET = (k − 1)Ck−1 , i.e., k − 1 times the frequency of the Ck−1 -th counter in S. The authors proved that the additional error introduced by the merge is within the error bound guaranteed by Frequent. 3 New Merging Algorithm In this Section we present our algorithm, shown in pseudo-code as Algorithm 2 for merging two Frequent summaries. Algorithm 2 Merging Algorithm for Frequent summaries. Require: S1 ; an array of counters; S2 ; an array of counters; k, k-majority parameter (the number of counters is k − 1); Ensure: an array containing k–majority candidate elements 1: procedure m e rg e(S1 , S2 , k) . a merged summary of S1 and S2 2: S ← c o m b i n e(S1 , S2 ); 3: if S.nz ≤ k − 1 then 4: return S[k . . . (2k − 2)]; . return the last k − 1 counters 5: else. build the merged summary M, consisting of counters monitoring item ei with frequency fi , i = 1, . . . , k − 1, as follows: 6: e1 ← Cke 7: f1 ← Ckf − Ck−1 f ; 8: M[1] ← (e1 , f1 ); 9: for i = 2 to k − 1 do e 10: ei ← Ck−1+i f f f 11: fi ← Ck−1+i − Ck−1 + Ci−1 ; 12: M[i] ← (ei , fi ); 13: end for 14: return M; 15: end if 16: end procedure Algorithm 2 starts by combining the two input summaries into a combined summary S. Then, if the number of nonzero counters in S is less than or equal 283 to k − 1, the algorithm returns as merged summary the last k − 1 counters of S. Otherwise, the last k − 1 counters are first updated using exact closed-form equations and then reported as output. Actually, these determining equations produce the same merged summary that we would obtain applying the Frequent algorithm to the combined summary S, a procedure we described and proved to be correct in [1]. Indeed, in [1] a slightly modified version of Frequent is used on S, in which the update step is carefully modified so that each update still requires O(1) time in the worst case. These modifications simply consist in one- shot updates: for each item in S to be processed, we increment one-shot the counter in charge of monitoring it by a number of occurrences equal to the item’s counter in S. In the next Section, we shall show the determining equations, state the correctness of the algorithm and analyze its complexity in the worst case and the total error committed. The main result of the paper is the proof that the following properties hold for our algorithm: (i) it retains the same complexity of the Algorithm proposed by Agarwal et al [5], and (ii) its total error committed is smaller or equal. 4 Analysis 4.1 Complexity Analysis Lemma 1. The computational complexity of our Algorithm 2 is O(k) in the worst case. 4.2 Correctness of Algorithm 2 By construction, U the combine step producing S preserves the frequent items in S1 S2 since no item is discarded and no occurrences are lost. Therefore, it suffices to show that our closed-form equations produce the same merged summary which would be outputted by an application of Frequent (the one-shot update version) to the combined summary. Let S.length = 2k − 2 and assume k ≤ S.nz ≤ 2k − 2. We denote by Cj the j–th counter in S, j = 1, . . . , 2k − 2, and by eij and mij , respectively, the item monitored by the j–th counter of Frequent (denoted as Mj ) and its value at the end of the i–th update step, i = 0, . . . , k − 1 and j = 1, . . . , k − 1. We define e0j = Cje and m0j = Cjf , j = 1, . . . , k − 1. Indeed, the step zero reflects the situation in which we have already filled the first k − 1 counters in the Frequent data structure with the corresponding initial k − 1 counters in S. This is correct owing to the following facts: (i) the counters in S are stored in ascending sorted order with respect to the frequencies, (ii) the items in S are distinct and (iii) Frequent works by assigning an item which is not currently monitored to a new counter if available and maintaining the ascending sorted order with respect to the frequencies. Theorem 1. For each update step i = 1, . . . , k − 1 and position j = 1, . . . , k − 1, the values eij and mij can be defined as follows: eij = Ci+j e j = 1, . . . , k − 1 (1) 284 ( f Ci+j − Cif j = 1, . . . , k − i mij = (2) Ci+j − Cif + Ci+j−k f f j = k − i + 1, . . . , k − 1 4.3 Total Error Committed By Algorithm 2 In what follows, we assume that after the combine step we are left with a data summary S consisting of more than k − 1 nonzero counters. Otherwise, both algo- rithms do not commit any additional error, owing to the fact that the combine step obviously does not incur any error. Therefore, assuming that S consists of more than k − 1 nonzero counters, the total error committed by our algorithm is the total error committed by Frequent when applied to S. The counters’ frequencies at the end of the (k − 1)–th update step are mk−1 j , j = 1, . . . , k − 1. Consequently, since Frequent underestimates the frequencies, the total error committed is k−1 X f ET = Ck−1+j − mk−1 j (3) j=1 We claim that the total error committed by Algorithm 2 is less than or equal to the total error committed by algorithm [5]. Theorem 2. The following inequality holds k−1 X f f (Ck−1+j − mk−1 j ) ≤ (k − 1)Ck−1 . (4) j=1 5 Conclusions In this paper we have introduced a new algorithm for merging Frequent summaries and compared it to the algorithm proposed by Agarwal et al. from a theoretical perspective. Our algorithm uses exact closed-form equations for determining the outputs; we have shown that it retains the same computational complexity, whilst providing better frequency estimation. Future work includes designing and carrying out several numerical experiments in order to compare the two algorithms we have discussed from a quantitative perspective. References 1. Cafaro, M., Tempesta, P.: Finding frequent items in parallel. Concurr. Comput. : Pract. Exper. 23 (2011) 1774–1788 2. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: ESA. (2002) 348–360 3. Cafaro, M., Pulimeno, M., Tempesta, P.: A parallel space saving algorithm for frequent items and the hurwitz zeta distribution. Information Sciences 329 (2016) 1 – 19 285 4. Cafaro, M., Pulimeno, M., Epicoco, I., Aloisio, G.: Mining frequent items in the time fading model. Information Sciences 370 - 371 (2016) 221 – 238 5. Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J., Wei, Z., Yi, K.: Mergeable summaries. In: Proceedings of the 31st Symposium on Principles of Database Systems. PODS ’12, New York, NY, USA, ACM (2012) 23–34