Summary Extraction on Data Streams in Embedded Systems Sebastian Buschjäger, Katharina Morik, and Maik Schmidt TU Dortmund University Computer Science VIII: Artificial Intelligence Unit {sebastian.buschjaeger, katharina.morik}@tu-dortmund.de http://www-ai.cs.uni-dortmund.de/ Abstract. More and more data is created by humans and cyber-physical systems having sensing, acting and networking capabilities. Together, these systems form the Internet of Things (IoT). The realtime analysis of its data may provide us with valuable insights about the complex inner processes of the IoT. Moreover, these insights offer new opportunities ranging from sensor monitoring to actor control. The volume and velocity of the data at the distributed nodes challenge human as well as machine monitoring of the IoT. Broadcasting all measurements to a central node might exceed the network capacity as well as the resources at the central node or the human attention span. Hence, data should be reduced already at the local nodes such that the submitted information can be used for efficient monitoring. There are several methods that aim at data summarization ranging from clustering, aggregation to compression. Where most of the approaches transform the representation, we want to select unchanged data items from the data stream, already while they are generated by the cyber- physical system and at the cyber-physical system. The observations are selected independent of their frequencies. They are meant to be efficiently transmitted. The ideal case is that no important measurement is missing in the selection and that no redundant items are transmitted. The data summary is easily interpreted and is available in realtime. We focus on submodular function maximization due to its strong theoretical back- ground. We investigate its use for data summarization and enhance the Sieve-Streaming algorithm for data summarization on data streams such that it delivers smaller sets with high recall. Keywords: Data summarization, Embedded systems, Streaming Data 1 Introduction With increasing processing capabilities, more and more data is gathered every day at virtually any place on earth. Most of this data is produced by small embedded electronics with sensing, acting and networking capabilities [11]. To capitalize on the vast amount of data produced by these IoT devices, machine learning has proven a valuable tool, but is usually limited to large server systems due to resource constraints. In order to speed up the analysis for monitoring, we wish to move parts of the analysis to the measuring device by the means of data summarization. For numerical data streams, moving averages have been used. For categorial values, data summarization describes the extraction of representative items from a data set and has been used for texts [7,8], speech [14], and images [13]. Informally, a summary should contain a few, but informative items, where each item reflects one hidden state or class of the underlying system. Given such a summary, we can use it to – perform certain actions, once a new item is added to the summary – compare summaries of different devices and points in time – provide a quick overview for human experts Fig. 1: Use cases of data summarization in IoT streaming settings. Each device gathers a data summary on its own and communicates it to central server or to other IoT devices. Human experts can examine summaries and perform actions if needed. Data summarization has been viewed as a submodular function maximization problem, which offers an appealing theoretical framework with strong quality bounds. In this paper, we focus on data summarization on data streams by the means of submodular function maximization, namely the Sieve-Streaming algorithm [1]. The main question when computing a data summary is, what items are “novel” enough to be included into the summary and which are too similar to already seen items. If this novelty threshold is chosen too small, the sum- mary will contain redundant items, whereas if the novelty is expected to be too high, we will not add any item to the summary. Sieve-Streaming deals with this challenge by managing multiple summaries in parallel, each with its own novelty threshold. We will exploit the ideas of Sieve-Streaming for monitoring of distributed measuring devices. Hence, we first investigate, whether the Sieve-Streaming is appropriate for our setting. Additionally, we extend the algorithm for the use in embedded systems: – We reduce the number of summaries by tighter bounds of the utility thresh- old used by Sieve-Streaming without sacrificing its theoretical guarantees. – By dynamic thresholding, we further improve the quality of Sieve-Streaming without increasing its computational costs. – We evaluate the enhancements empirically. The paper is organized as the following. The next section will give a brief overview of the Sieve-Streaming algorithm. Section 3 shows our enhancements of the Sieve-Streaming algorithm in detail. Section 4 presents experiments on three different data sets to evaluate data summarization in general and our enhancements of Sieve-Streaming in detail. We conclude the paper in section 5. 2 Sieve-Streaming In this section, we shortly present the Sieve-Streaming algorithm and introduce the notation used in this paper. A more detailed overview of submodular function maximization can be found in [4]. Submodular function maximization considers the problem of selecting a set S ⊆ V with fixed size |S| = K from a ground set V , such that a set function f : 2V → R is maximized, which assigns a utility score to each possible subset S ⊆ V . For set functions, the marginal gain is given by the increase in f (S) when adding an element e ∈ V to S: ∆f (e|S) = f (S ∪ {e}) − f (S) The function f is called monotone, iff for all e ∈ V and for all S ⊆ V it holds that ∆f (e|S) ≥ 0. Furthermore, we call f submodular iff for all A ⊆ B ⊆ V and e ∈ V \ B it holds that ∆f (e|A) ≥ ∆f (e|B) Submodular functions incorporate a notion of diminishing returns: Adding a new element to a smaller set will increase the utility value more than adding the same element to a larger set. This property intuitively captures one aspect of summarization, in which we seek small but expressive summaries and thus adding elements to a summary should be done with care. Now, let us apply monotone submodular function maximization in a stream- ing setting, so that data items e arrive one at a time and we have to decide immediately, whether we want to add e to S, or not. A number of different ap- proaches for streaming settings have been proposed in literature with different constraints and assumptions about the data (see [9] for a very recent overview). The approach of Badanidiyuru and colleagues fits our settings best [1]. Submodularity and monotony allows us to greedily increase the utility value by simply adding new items to the summary. Since the maximum summary size is restricted to K, this approach simply picks the first K elements without considering other items in the stream. Thus, we should add items to the summary only, if they offer a reasonable increase in the summary’s utility value, i.e. add enough novelty. Let OP T = maxS⊆V,|S|=K f (S) denote the maximum possible function value. Assume we know OP T beforehand and we have already selected some points in T −f (S) S. In this situation we could expect the next item e to add at least OPK−|S| to the utility so that we can reach OP T with the remaining K − |S| elements. In a sense, this approach sieves out elements with marginal gains below the given threshold. However, this approach does not work, if the optimal summary contains many el- ements with a gain just below the threshold and one element with large marginal gain. In this case, we could miss the items with smaller marginal gain waiting for an item with large gain. To counter this behavior, the authors of [1] propose to lower the threshold by 21 , adding e to S if the following holds: OP T /2 − f (S) ∆f (e|S) ≥ (1) K − |S| Knowing the optimal utility value OP T is part of the problem we would like to solve. Usually, this value is unknown beforehand and needs to be estimated. In order to guarantee a good estimate we can generate multiple estimates for OP T before running the algorithm and manage multiple sieves each with their own threshold values in parallel. We can narrow down the value of OP T by using the properties of submodularity. To do so, let m = maxe∈V f ({e}) denote the maximum value of a singleton item. Given e occurs at least once in the data stream, the worst optimal summary would contain e at least once. If e however occurs at least K times in the data stream, then the best summary would contain e K-times, so that: m ≤ OP T ≤ K · m (2) Therefore, knowing the maximum singleton value f ({e}) already gives a rough estimate of the range of values we can expect from f . This knowledge can now be used to sample different threshold values from the interval [m, Km] such that one of these thresholds will be close to OP T . More formally, we manage different summaries in parallel, each using one threshold from the set O = {(1 + ε)i | i ∈ Z, m ≤ (1 + ε)i ≤ k · m}, so that for at least one v ∈ O it holds: (1 − ε)OP T ≤ v ≤ OP T . Algorithm 1 summaries the proposed method. It can be shown, that algorithm 1 extracts a summary with: 1 f (S) ≥ ( − ε)OP T (3) 2 2.1 Utility function Given a summary S = {e1 , . . . , eK } we can express the similarity between ele- ments ei and ej using a kernel function k(ei , ej ), which gives rise to the kernel ma- trix Σ = [k(ei , ej )]ij containing all similarity pairs. This matrix has the largest Algorithm 1 Sieve-Streaming with known singleton maximum value m. 1: O = {(1 + ε)i | i ∈ Z, m ≤ (1 + ε)i ≤ k · m} 2: for v ∈ O do 3: Sv = ∅ 4: end for 5: for next item e do 6: for v ∈ O do 7: if ∆f (e|Sv ) ≥ v/2−f K−|S| (s) and |Sv | < K then 8: Sv = Sv ∪ {e} 9: end if 10: end for 11: end for values on the diagonal as items are the most similar to themselves, whereas val- ues on the off-diagonal indicate the similarity between distinct elements and thus are usually smaller. Intuitively, we seek an expressive summary, so that Σ con- tains many values near 0 on its off-diagonal. This intuition is formally captured by the Informative Vector Machine [5] proposing the function: 1 f (S) = logdet(I + σΣ) 2 where I indicates the K × K identity matrix and σ > 0 denotes some scaling parameter. In [10] it was shown, that this function is monotone submodular. 3 Embedded Summary Extraction In the previous section, we presented the Sieve-Streaming algorithm. Now, we will introduce the changes that make it better suited for summarization in em- bedded systems. First we show how to reduce the number of sieves without sacrificing performance guarantees. Second, we introduce dynamic thresholding which further increases the utility value. Last, we consider the trade-off between memory consumption and runtime for an efficient implementation. Reduce the number of sieves: Sieve-Streaming needs to keep track of mul- tiple sieves in parallel with corresponding thresholds chosen from the interval [m, Km]. By taking the special structure of f into account, we can reduce the size of this interval and thus the overall memory requirements. To do so, we assume that the kernel function k(·, ·) and its corresponding kernel matrix are positive definite. The upper bound Km of the threshold v is sharp and thus cannot be improved. Intuitively, a summary reaches its maximum function value when all entries on the off-diagonal are 0. In this case logdet(I + σ 2 Σ) equals log((1 + σ)K ) = Klog(1 + σ) = K · m, which is exactly what we derived using submodularity. The lower bound m, however, can be improved substantially: A summary reaches its minimum utility when all entries on the off-diagonal equal the largest pos- sible kernel value C, that is, we picked the same element K times. W.l.o.g we assume that C = 1 in the following, e.g. we use the RBF kernel k(x, y) = exp(− 21 ||x − y||2 ). Note, that it is possible to scale every positive definite ker- nel to the interval [0, 1] without damaging its positive definite property (cf. [3]). By using Sylvester’s determinant theorem, we see that logdet(I + σΣ) = log(I +σ1T 1) = log(1+σK) > m = log(1+σ) where 1 denotes the vector, where all entries are 1. In order to show that this intuition really creates PK a lower bound, we use the characteristic polynomial det(λI − (I + σΣ)) = i=1 (−1)k λK−i ai . The determinant can PKbe computed in terms of characteristic PK−1 coefficients ai , that is det(I + σΣ) = i=0 ai = 1 + det(σΣ) + tr(σΣ) + i=2 ai . Since I + σΣ is positive definite, each coefficient ai can be expressed as the sum of eigenvalues of I + σΣ, which are also all positive [2]. Since σΣ is also positive definite, we see that det(σΣ) ≥ 0. This leads then to K−1 X det(I + σΣ) = 1 + det(σΣ) + tr(σΣ) + ai ≥ 1 + tr(σΣ) = 1 + σK i=2 and thus: logdet(I + σΣ) ≥ log(1 + σK). Dynamic Thresholding: Sieve-Streaming creates a fixed number of sieves |O| in the beginning of the algorithm. Due to the thresholding approach, sieves with smaller thresholds will accept more items and thus fill-up more quickly, whereas sieves with large thresholds are more picky. Once a summary is full the corresponding sieve is not used anymore. This allows us to create another sieve in its place with a different threshold while keeping the overall number of sieves constant. In turn, we can use this new threshold to refine the estimate of OP T . Let vc denote the largest threshold corresponding to a full summary and let vo denote the smallest threshold corresponding to a non-full summary. Note that by construction vo > vc and that - given the current summaries - OP T seems to be in the interval [vc , vo ]. Therefore, we can refine the estimate of OP T if we create a new sieve in [vc , vo ]. With this approach “older” sieves with larger threshold may contain more items than “newer” sieves. In turn, one of these “older” sieves may become full at one point, whereas the newly created sieves with smaller threshold are still accepting elements. Since the thresholds for these sieves are smaller than that the sieve that just became full, we know that these sieves will not offer any better utility values. Therefore, we can close these sieves once a sieve with larger threshold closes. In practice one rather unintuitive effect may happen, in which all summaries of sieves might be full at some point in time. In pure Sieve-Streaming, sieves are created in the interval [m, Km], where the upper bound Km represents the largest utility possible. This bound is independent from the data and therefore we expect that at least one sieve will be open all the time aiming for an extreme case with utility Km. Recall however, that only the half of each threshold (see eq. 1) is used when the marginal gain of a new item is evaluated. Thus, Sieve-Streaming effectively uses 12 Km as upper bound. This is enough to ensure a ( 12 − ε) approximation, but might be sub-optimal in some cases. To counter this, we gradually increase the interval of thresholds to 2Km, once we notice that all sieves are closed. Implementation: In a naive approach, each sieve needs to store K elements and compute logdet(I +σΣ) on-the-fly as required. Computation of the determi- nant is generally performed in cubic time, whereas the computation of Σ takes O(K 2 · d) leading to a total of O(K 3 + K 2 · d) per element. To improve the overall speed, we can trade some runtime with memory if we store Σ permanently. Since k(·, ·) is positive definite, I + σΣ is also positive definite. Thus, we may use a Cholesky decomposition I + σΣ = LT L where L denotes a lower triangular matrix to compute the determinant. Adding a new row / column vector to a Cholesky decomposition can be per- formed in O(K 2 ) time. Thus, when adding a new element to the summary, we compute all necessary kernel values and update the Cholesky decomposition ac- cordingly leading to O(K 2 + K · d) computations per element. However, with this method the memory requirements increase, since Σ and L need to be stored continuously, leading to O(2K 2 + K) memory instead of O(K) memory. 4 Experiments In this section we experimentally evaluate the enhancements introduced to Sieve- Streaming. More specifically, we want to answer the following questions: 1. Are extracted summaries meaningful in a sense, that they represent the hidden states of a system? 2. How does the runtime decrease when a reduced number of sieves is used? 3. How does the utility value increase when dynamic thresholding is used and how does this affect the quality of the summary? Answering the first question is difficult, because we need to know the data generation process in detail to detect the hidden states of a system. For real- world data this is nearly impossible. Hence, we use data sets D = {(xi , yi )|i = 1, . . . , N } usually found in classification tasks. Here, each observation xi ∈ Rd is also associated with a label yi . We assume that each label yi represents one hidden state of the system in the data generation process. Thus, we compute a summary of the training data xi without considering the labels yi and then report the number of different labels represented by the best summary found across all sieves. With respect to the notation used in classification tasks, we denote this measure as recall. To answer the second and third question, we measure the computation time per data item and the best utility value of each algorithm (pure, reduced, and dynamic). All experiments were performed on an Intel i7-6700 machine with 3.4 Ghz and 32 GB RAM. We repeated each experiment 10 times with shuffled data sets and always report average results. Our implementation is available at https: //bitbucket.org/sbuschjaeger/iotstreaming2017. Figure 3 contains all experimental results. Each column represents one data set whereas each row represents one measure. We use one synthetic data set and two real-world data sets. Synthetic data: We composed a synthetic data set containing eight four- dimensional Gaussian distributions as a Gaussian mixture model. Each Gaus- sian mean value µ = (µi )i is uniform randomly generated with −2 ≤ µi ≤ 3 for i = 1, 2, 3, 4. Variance values are chosen uniformly random from (0, 0.8]. For the data generation process we first randomly pick one of the Gaussian distributions and then sample a new data item x ∈ R4 from this distribution. We repeat this process 10000 times. To simulate infrequent states, e.g. failure, we introduced different probabilities for each Gaussian. Two Gaussian are rare to be selected with a probability of 0.001, whereas the remaining six Gaussian receive the remaining probability mass equally, that is each has a probability of 0.133 to be used for data generation. ||x −x ||2 We use the RBF Kernel K(xi , xj ) = exp(− i 10 j 2 ) and set σ = 1 and ε = 0.1. Since there are eight hidden states in total, a summary of size K = 8 should be enough in theory to detect all states of the system. To account for some error however, we vary K from 10 to 24. The first image in the first row of Figure 3 shows the recall of all three algorithms on the synthetic data set. Sieve-Streaming and its tuned counterpart achieve roughly 60% recall for K = 10 and then steadily increase their recall with rising K reaching nearly 100% at K = 20. Dynamic Sieve-Streaming shows a similar performance, but behaves slightly better for smaller K. Interestingly, for K = 16 all variants produce the same output and for K = 18 Dynamic Sieve-Streaming is outperformed by pure Sieve-Streaming. For larger K however, Dynamic-Sieve Streaming detects all states with 100% recall for K = 24. Note that a potential “failure” state has a probability of 0.001, that is roughly every 1000 measurements such “failure” may occur. Usually, a human operator would need to examine all 1000 states in this case to determine if there is a failure or not. Using a summary with K = 24 however, a human operator only needs to examine 24 different measurments, reducing the overall workload for the human operator by 99.976%. Looking at the second row in Figure 3, we see that Sieve-Streaming and its implementation on a smaller interval offers the same utility, which can be expected. Dynamic Sieve-Streaming consistently achieves higher utility values explaining the higher recall. The last row of plots depict the runtime results. Sieve-Streaming needs around 0.45 ms per element, whereas its reduced variant is faster for all K. Dynamic Sieve-Streaming on the other hand needs up to 0.52 ms per element, due to its dynamic sieve management. UJIndoor Location: As a second data set, we use the UJIndoorLoc data set [12]. This data set contains the GPS position and RSSI fingerprints of nearby WiFi hotspots as well as the corresponding building, floor and room number of 17 smartphone users throughout their daily routine at the university of Jaume in Spain. The dataset in total contains 19937 measurements and is usually used for prediction tasks in which one tries to infer the GPS position based on the RSSI fingerprints. We abandon the RSSI fingerprints and try to predict the semantic location, that is the building, floor and room number directly based on the current GPS position. We normalize the GPS data and use the RBF Kernel K(xi , xj ) = ||x −x ||2 i j 2 exp(− 0.005 ). We set σ = 1 and ε = 0.1. On average, each user visits 79 different locations, thus we try different K from 80 to 130. The second column in Figure 3 shows the results for this data set. The recall is displayed in row one. Standard Sieve-Streaming and its reduced counterpart achieve between 50% and 60% recall, whereas Dynamic Sieve-Streaming man- ages to detect roughly 10% more items offering a recall of 60% to 70%. Again, this increase in recall can be explained with a higher utility value: Dynamic Sieve-Streaming increases the utility value consistently by roughly 10 points compared to Sieve-Streaming. The runtime of the three algorithms (second col- umn and third row of Figure 3) paints a different picture than before. Again, pure Sieve-Streaming is the slowest algorithm, whereas Sieve-Streaming on the smaller interval reduced the overall computational costs by around 0.2ms per element. For smaller K, Dynamic Sieve-Streaming can be found in the middle of the two algorithms. For larger K however, Dynamic Sieve-Streaming seems to become faster than the reduced version of Sieve-Streaming. The reasons for that is, that Sieve-Streaming may exit a data set early if all sieves are full. Since Dynamic Sieve-Streaming reopens new sieves, it will run through the complete data set. This in turn, enables cache locality and branch prediction to take full effect, so that the runtime per element decreases. MNIST: As a third data set, we use the well-known MNIST data set [6]. This data set contains 60000 28 × 28 grayscale images of the handwritten digits 0 to 9 and is usually used as a baseline measure for image classification. Much in line with the usual classification task, we try to extract a summary con- taining one representative image for each digit. However, we will do this in an unsupervised manner. Again, we normalize the data and use the RBF kernel ||x −x ||2 K(xi , xj ) = exp(− i784j 2 ). We set σ = 1 and ε = 0.1. Since there are 10 classes in total, we vary K from 8 to 16. The results are shown in the third column of figure 3. All algorithms performed nearly the same, with Dynamic Sieve-Streaming being slightly better for all val- ues of K reaching a recall between 60 and 80%. Again, the utility value for Dynamic Sieve-Streaming is consistently larger than for pure and reduced Sieve- Streaming explaining the increased recall performance. Figure 2 displays two summaries for K = 8 computed by pure Sieve-Streaming and dynamic Sieve- Streaming respectively. Both algorithms show a similar summary, but pure Sieve- Streaming selects multiple ones and fives. Dynamic Sieve-Streaming also selects multiple ones, but does not choose to include two fives into the summary and thus can add six, seven and eight to the summary increasing the recall signif- icantly. Looking at the runtime, standard Sieve-Streaming is again the slowest algorithm variant reaching a computation time of 1ms per element, whereas re- duced and dynamic Sieve-Streaming are both faster with a maximum of 0.65 ms per element. Note, that we again observe the effect, that Dynamic Sieve- Streaming seems to be faster than its reduced counterpart due to cache locality and branch prediction. 5 Conclusion In this paper we tackled the problem of data summarization on data streams with small, embedded systems. More specifically, we tried to answer the following questions: – Is submodular function maximization a suitable framework for data sum- marization and can summarization help us to detect the current state of a system? – How can we make data summarization on data streams more suitable for small embedded systems frequently found in IoT settings? In order to answer these questions, we first presented the Sieve-Streaming algo- rithm from [1]. Sieve-Streaming works on the assumption of submodularity and uses this to sieve out unwanted items. The main advantage of Sieve-Streaming is its strong and universal theoretical foundation offering the approximation guar- antee of at least ( 12 − ε). This theoretical foundation however, leaves room for a more specialized version of Sieve-Streaming tailored towards data summarization. Therefore, we introduced a more refined version of Sieve-Streaming for data summarization by reducing the overall number of Sieves needed. Additionally, we introduced a dynamic ver- sion of Sieve-Streaming independent from the summarization task which is able to reuse sieves with different thresholds. We note, that both enhancements do not jeopardize the theoretical analysis of Sieve-Streaming and still maintain the ( 12 − ε) approximation guarantee. We evaluated our algorithmic changes on three data sets. The experiments have shown that we are able to detect the states of systems using the presented data summarization approach and that an increased utility function usually comes with an increased recall of system states. Thus, Dynamic Sieve-Streaming is a valuable enhancement of the standard Sieve-Streaming algorithm consistently increasing the recall leading to a 99.976% reduction of workload for human op- erators in our experiments. Dynamic Sieve-Streaming has its limitations. We did not always detect all hidden states in the experiments. Future work will investigate other kernel func- tions, which could have been tuned more. (a) Pure Sieve-Streaming (b) Dynamic Sieve-Streaming Fig. 2: Extracted summary for MNIST using pure Sieve-Streaming and Dynamic Sieve- Streaming for K = 8. 1 1 1 0.8 0.8 0.6 0.6 0.5 0.4 0.4 0.2 0.2 0 0 0 1012141618202224 80 90 100110120130 8 10 12 14 16 Recall synthtic data. Recall UJIndoorLoc Recall MNIST. 70 6 4.5 60 5 4 50 3.5 4 3 3 40 1012141618202224 80 90 100110120130 8 10 12 14 16 Utility synthtic data. Utility UJIndoorLoc Utility MNIST. 1 0.5 1.5 0.8 0.45 1 0.6 0.4 1012141618202224 80 90 100110120130 8 10 12 14 16 Runtime synthtic data. Runtime UJIndoorLoc Runtime MNIST. Pure Reduced Dynamic Fig. 3: Experiments on synthetic data (left column), UJIndoor location data (middle column) and MNIST (right column). The first row depicts recall (higher is better), the second row shows the maximum utility value (higher is better) and the third row displays the runtime (in milliseconds) per element (smaller is better) on the datasets. Additionally, we observe that Sieve-Streaming and its newer enhancements do not deal with concept drift. Here, further work is also needed. In sum, dynamic Sieve-Streaming is a helpful tool for exploring and monitor- ing data streams with a solid theoretical foundation. At the same time, we also see a potential for new research directions including concept drift and specialized kernel functions. Notes and Comments. Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 (http://sfb876.tu-dortmund.de) ”Providing Information by Resource-Constrained Analysis”, project A1. References 1. Badanidiyuru, A., Mirzasoleiman, B., Karbasi, A., Krause, A.: Streaming submod- ular maximization: Massive data summarization on the fly. In: ACM SIGKDD (2014) 2. Brooks, B.P.: The coefficients of the characteristic polynomial in terms of the eigenvalues and the elements of an n× n matrix. Applied mathematics letters (2006) 3. Graf, A.B., Borer, S.: Normalization in support vector machines. In: DAGM Sym- posium of Pattern Recognition (2001) 4. Krause, A., Golovin, D.: Submodular function maximization. In: Tractability: Prac- tical Approaches to Hard Problems. Cambridge University Press (2014) 5. Lawrence, N., Seeger, M., Herbrich, R., et al.: Fast sparse gaussian process meth- ods: The informative vector machine. In: NIPS (2003) 6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE (1998) 7. Lin, H., Bilmes, J.: A class of submodular functions for document summarization. In: Meeting of the Association for Computational Linguistics (2011) 8. Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A.: Fast constrained submodular maximization: Personalized data summarization. In: ICML (2016) 9. Mirzasoleiman, B., Karbasi, A., Krause, A.: Deletion-robust submodular maxi- mization: Data summarization with ”the right to be forgotten”. In: ICML (2017) 10. Seeger, M.: Greedy forward selection in the informative vector machine. Tech. rep., University of California at Berkeley (2004) 11. Stolpe, M.: The internet of things: Opportunities and challenges for distributed data analysis. ACM SIGKDD Explorations Newsletter (2016) 12. Torres-Sospedra, J., Montoliu, R., Martı́nez-Usó, A., Avariento, J.P., Arnau, T.J., Benedito-Bordonau, M., Huerta, J.: Ujiindoorloc: A new multi-building and multi- floor database for wlan fingerprint-based indoor localization problems. In: IPIN (2014) 13. Tschiatschek, S., Iyer, R.K., Wei, H., Bilmes, J.A.: Learning mixtures of submod- ular functions for image collection summarization. In: NIPS (2014) 14. Wei, K., Liu, Y., Kirchhoff, K., Bilmes, J.A.: Using document summarization tech- niques for speech data subset selection. In: HLT-NAACL (2013)