Augmented Business Intelligence Matteo Francia Matteo Golfarelli Stefano Rizzi DISI - University of Bologna DISI - University of Bologna DISI - University of Bologna m.francia@unibo.it matteo.golfarelli@unibo.it stefano.rizzi@unibo.it ABSTRACT OLAP queries on the enterprise multidimensional cubes. The Augmented reality allows users to superimpose digital informa- quantity of data returned must be limited in size and focused tion (typically, of operational type) upon real world entities. The on the context to meet performance constraints and be easily synergy of analytical frameworks and augmented reality opens interpretable by the user; furthermore, the intrinsic dynamics of the door to a new wave of situated OLAP, in which users within AR applications asks for right-time (reasonably a few seconds) a physical environment are provided with immersive analyses of responsiveness of A-BI. local contextual data. In this paper we propose an approach that, As shown in Figure 1, given a user situated in an environment based on the sensed augmented context (provided by wearable and equipped with a smart device capable of perceiving relevant and smart devices), proposes a set of relevant analytical queries elements (i.e., the context) of such environment by means of to the user. This is done by relying on a mapping between the sensors, A-BI provides real-time generation and visualization entities that can be recognized by the devices and the elements of multidimensional analyses out of contextual features. The of the enterprise data, and also taking into account the queries context is modeled as a set of recognizable environment elements preferred by users during previous interactions that occurred (e.g., a package) plus a set of user/environmental information in similar contexts. A set of experimental tests evaluates the (e.g., the user role and the room temperature); we assume the proposed approach in terms of efficiency and effectiveness. pattern recognition capabilities necessary to recognize them are provided by the AR smart device (specifically, through the Context generation component in Figure 1). 1 INTRODUCTION A-BI keeps track of user feedbacks on queries, and adopts col- With the disruptive advances in pervasive computing and indus- laborative filtering to provide a more focused experience. With try 4.0, business intelligence is shifting its focus to the integration reference to Figure 1, this task is carried out by the Context in- of (internal) enterprise and (external) contextual data. In this con- terpretation component by relying on the query log. Although text, the synergy of analytical frameworks and augmented reality A-BI supports the possibility to learn meaningful queries from opens the door to situated OLAP, in which users within a physi- the log, its capability of returning the right information primarily cal environment are provided with immersive analyses of local comes from some a-priori knowledge provided by domain ex- contextual data. Indeed, Augmented Reality (AR), a variation of perts. This choice is not simply a solution to the well-known cold virtual reality, allows users to superimpose digital information start problem (i.e., the problem of providing significant recom- upon real world entities [1], thus determining an augmented mendations when user feedbacks are still very few); it is rather environment. Nowadays digital data returned to users are typi- a design choice aimed at enabling the system to give a useful cally operative, meaning that they either provide information on answer in complex context scenarios, where learning from the the current state of visualized objects (e.g., the temperature of a log would require too many examples. The a-priori knowledge is machine) or on the current operations to be carried out (e.g., the modeled through a set of weighted mappings between the poten- instruction to use the machinery). Conversely, limited attention tially recognizable environment entities (stored in the dictionary) has been devoted to provide analytical reports that can be useful and the cube multidimensional ones. Instead of proposing the to let the user compare the current behavior of the visualized most relevant query only (called maximal query), A-BI proposes objects with their historical behavior. a set of alternative queries to the user; all of them are related This new goal opens relevant research challenges and revamps to the current context but they are different enough to offer to many issues related to business intelligence and recommenda- the user different flavors of the same information. This phase is tion systems [2]. Indeed, when working with high-dimensional implemented by the Diversification component. contextual data (the multi-dimensional nature of the context is A-BI can be applied to different application domains ranging well understood [3, 4]), identifying insightful queries and visual- from healthcare to factories; for this reason, the main model- izations is not trivial [5]: How is the perceived context mapped ing choices underlying our approach (e.g., how to define the to enterprise data? Which data is salient with respect to the user relevance of an object in a context) have been formalized in a analysis? How can data be retrieved in real time? How do users domain-independent fashion, while domain-dependent examples interact with the retrieved information? are provided in the context of AR in a factory where a smart To the best of our knowledge, none of the context-aware rec- device is coupled with a sensor data warehouse [6]. To better ommender systems proposed in literature addresses the above understand the user/system interaction suppose that Bob, a ma- questions with reference to situated analytics in general, and to chinery inspector, is situated in a factory to investigate a malfunc- augmented reality in particular. In this paper we envision and tioning packaging machine that —possibly due to overheating— formalize a foundation for Augmented Business Intelligence (A-BI), produces dented packages. Environmental data is gathered and a framework empowering AR users with analytical information integrated from wearable and smart devices into a single context under visualization and time constraints. The analytical informa- representation. So, when looking at the machine, the AR hel- tion returned comes in the form of reports obtained by running met Bob is wearing recognizes the machinery and, knowing the role of Bob, suggests the set of analytical queries that are more © 2019 Copyright held by the author(s). Published in the Workshop Proceedings relevant according both to a-priori knowledge and to previous of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org. feedbacks given by users in similar contexts. Relevant figures Analytical Report Log diversified queries Data mart Context generation Year DeviceType Diversification Month Device Date Context maximal queries MaintenanceActivity dist = 0.5m Duration Maint.Type dist = 0.8m Context dist = 1m interpretation DeviceType Building dist = 0m Time Device Room Measurement Dictionary Value Meas.Type Figure 1: Overview of the A-BI framework; engaged entities are enclosed in dotted rectangles, while grayed elements are outside the paper scope could be the number of defective packages along time and the results [10] or databases [11], user interests [12], and social data dates and types of previous maintenance activities for that ma- [13]. Given such heterogeneity, other contributions address the chinery. Bob can either execute one of the proposed queries or integration of contextual data (e.g., [13–15]) to provide a common express a new query to obtain a different report. Finally, Bob ground for further analyses and recommendations. gives his feedback on the proposed queries, which is stored in The previous context types have been widely adopted in sev- the log. eral recommender system applications where the recommenda- To sum up, the main contributions of this paper are: tion process is activated by an explicit user-defined input state- (1) We envision an A-BI framework, its functional architec- ment (e.g., query or keywords). Examples of applications are ture, and the user/system interaction process; web query categorization [16, 17], recommendation [12, 18], and (2) We provide a method to model the a-priori system knowl- diversification [19]; query completion [8, 9]; localized web key- edge by mapping context entities to relevant multidimen- words suggestion [7]; and interactive exploration of databases sional elements; [20]. The two main differences between A-BI and these work are (3) We provide an efficient algorithm to generate relevant and (1) the multidimensional nature of the data handled and returned, diverse queries to be returned to the user; and (2) the nature of the context as well as the type of user/system (4) We propose a collaborative filtering approach to let the interaction that triggers the recommendation. As to (1), multi- system learn from user feedbacks. dimensional and hierarchical data support recommendations at different granularities, which intrinsically makes finding the best The remainder of the paper is organized as follows. Section 2 recommendation more complex; as to (2), physical contexts re- describes the related work in the field of context-aware recom- quire ad hoc solutions to choose the relevant context elements mendation systems. Section 3 formalizes the A-BI framework. due to application specificities (e.g., engagement) and to the pos- Section 4 explains how the context is used to derive a starting sibility of having in the context elements that are perceived but query for the diversification process by also considering the user that are not relevant for the user. feedbacks stored in the log, while Section 5 describes how rele- Recommender systems in business intelligence applications vant and diverse queries are generated from that query. Section 6 are well surveyed in [21]. Recommendations typically involve includes the results of experimental tests that measure the per- multidimensional queries [22] or sessions [17] (i.e., query se- formance of A-BI. Finally, Section 7 sums up our contribution quences) using query logs as contexts. These approaches are and gives future research directions. based on collaborative filtering techniques that do not synthesize new queries from existing ones, but pick queries from the log 2 RELATED WORK depending on their similarity score. Conversely, A-BI allows the A-BI can be classified as a recommender system in the area of generation of queries not already present in the log by combin- of business intelligence, based on a context made of augmented ing similar queries from the log and contextual information into entities. Although the huge amount of work in each area, to the a set of diverse queries. This assumption collocates A-BI as a best of our knowledge no approach lies at their intersection. hybrid approach to recommendation [21], differentiating A-BI Over the years, scholars have highlighted the importance of from the above-mentioned contributions in multidimensional exploiting contextual information to provide focused recommen- recommendation systems. Note that diversification and multiple dations with the nature of contexts being quite heterogeneous, recommendations are used to better meet user interests [23]. A for instance space and time [7], query logs [8, 9], statistics on further advantage of A-BI over pure collaborative filtering ap- of queries that was first studied in [28]. A GPSJ query is composed proaches is that A-BI does not suffer from the so-called cold start of joins, selection predicates, and aggregations. problem, since it is able to return an appropriate recommendation Definition 3.4 (Query). A query q on cube C = (H, M, ω) is a even when the log is empty [24]. triple q = (Gq , Pq , Mq ) where Gq is the query group-by set, i.e., In the area of AR, contexts play an even more central role. a subset of levels in the hierarchies of H ; Pq is a set of Boolean There, a context is the set of entities recognized in the environ- clauses whose conjunction defines the selection predicate for q; ment that acts as situated stimulus (i.e., object properties) to be Mq ∈ M is the set of measures whose values are returned by q. translated into inputs for a search query, is augmented with vir- tual information, and is returned to the user [25]. Scholars focus Example 3.5. With reference to the MaintenanceActivity cube, on finding proper visualization of relevant information [25, 26], the query asking for the average duration of maintenance ac- with visualization typically embedded in physical objects. While tivities for each month of 2018 and each device type is defined [26] considers multidimensional data, it is not specified how by the process of information retrieval and analysis of data at mul- Gq ={Month, DeviceType} tiple level of details is carried out. Also, these approaches do not include collaborative filtering to discover potentially useful Pq ={(Year = 2018)} information. Interestingly, although [26] does not consider ana- Mq ={Duration} lytical data, it introduces a mantra for situated analytics: “details  first, analysis, then context-on-demand” which contradicts the well-known mantra “overview first, zoom and filter, then details- Enterprise cubes are the data source for the analytical reports on-demand” [27] of classical visualization systems. Indeed, when to be returned to users according to the environment as perceived it comes to pure augmented visualizations, information is directly by the AR device. The set of data possibly perceived are listed attached to single entities [25, 26], assigning higher priority to in a dictionary that, intuitively, defines the device capabilities. details than to generic information. These data are not limited to physical objects recognized in the environment through a pattern recognition process, but may 3 BASICS include user-related information such as the user role as well as We start this section by introducing a formal setting to manipulate environmental properties such as the room temperature. multidimensional data; for simplicity we will work on a single Definition 3.6 (Dictionary). A dictionary D is a set of keys, cube and consider linear hierarchies. each coupled with a domain of values (all domains include a Definition 3.1 (Hierarchy). A hierarchy is defined as a triple NULL value). Each pair d = ⟨key, value⟩ such that value belongs h = (Lh , ≽h , ≥h ) where: to the domain of key is called an entry of D. (i) Lh is a set of categorical levels; each level l ∈ Lh is cou- NULL values are used whenever the device through which the pled with a domain Dom(l) including a set of members (all user perceives the environment successfully classifies an object domains are disjoint); but is not capable of labelling the specific instance sensed (e.g., a (ii) ≽h is a roll-up total order of Lh ; and package is recognized but the package code cannot be read). (iii) ≥h is a part-of partial order of l ∈Lh Dom(l). Ð The power of the A-BI framework comes from the ability to Exactly one level dim(h) ∈ Lh , called dimension, is such that bind the perceived information to the cube md-elements. This dim(h) ≽h l for each other l ∈ Lh . The part-of partial order is capability is rooted in a-priori knowledge that specifies which such that, for each couple of levels l and l ′ such that l ≽h l ′ , md-elements can be interesting for the user when a given dic- for each member u ∈ Dom(l) there is exactly one member u ′ ∈ tionary entry is perceived. This knowledge is defined through a Dom(l ′ ) such that u ≥h u ′ . dictionary-to-cube mapping established by a domain expert at setup time. Definition 3.2 (Cube). A cube is defined as a triple C = (H, M, ω) where: Definition 3.7 (Mapping). A mapping from dictionary D to (i) H is a set of hierarchies; cube C is a (partial) multivalued function µ that maps each entry (ii) M is a set of numerical measures, with each measure d of D to a set of md-elements of C. Each md-element e ∈ µ(d) m ∈ M coupled with one aggregation operator op(m) ∈ has a mapping weight, w(d, e) ∈ (0, 1]. {sum, avg, . . .}; and Although a discussion about how mappings can be established (iii) ω is a partial function that maps the tuples of members for is out of the scope of this paper, we remark that this does not the dimensions of H to a numerical value for each measure necessarily have to be done manually for all the cube members, m ∈ M. which would be tedious for attributes with large domains, but it The levels, members, and measures of the cube C are given can be largely automated (for instance, by providing universally- the generic name of md-elements of C. quantified rules such as µ(⟨Device, value⟩) = {value} ∀value ∈ Dom(Device)). The mapping function is multivalued since many Example 3.3. Let us consider cube MaintenanceActivity in our md-elements may be of interest for each dictionary entry; this working example which, as sketched in Figure 1, includes hierar- typically happens for hierarchy levels, which can be all inter- chies Date, Device, and MaintenanceType. Maintenance activities esting —even if with different values of w. For example, when are described by measure Duration (coupled with the sum oper- some device is perceived, besides showing data for that specific ator). A member of the Device attribute is PackagingMachine, device, also showing aggregated data for the device type could while a member of MaintenanceType is Oiling.  be interesting. Through mapping weights, domain experts give In the A-BI framework, cubes are queried through GPSJ (Gen- an a-priori quantification of the interest of each md-element for eralized Projection / Selection / Join) queries, a well-knows class analyses when a given entry is part of the context. Example 3.8. The dictionary for our example includes, among The image includes the set of md-elements relevant to a con- the others, entries related to machines and their components text according to the mapping, but it does not specify how they (key = Device, value = ConveyorBelt, PackagingMachine, etc.), will be used to generate the queries to be proposed to the user and user roles (key = Role, value = Inspector, Operator, etc.). when that context is sensed. Indeed, given an image, several The mapping from this dictionary to the cube of Example 3.3 queries can be generated, each including a subset of the md- may look like this (see Figure 1): elements in the image. This is ruled by the definition of compatible query. µ(⟨Device, ConveyorBelt⟩) = {ConveyorBelt}, µ(⟨Role, Inspector⟩) = {Duration, Definition 4.4 (Compatible and Equivalent Query). A query q = ⟨Gq , Pq , Mq ⟩ is compatible with context T if it refers to at MaintenanceType}, least one of the md-elements in the image of T ; specifically, µ(⟨Date, 16/10/2018⟩) = {Date}, • a level l is referred by q if l ∈ Gq ; w(⟨Device, ConveyorBelt⟩, ConveyorBelt) = 0.5 • a measure m is referred by q if m ∈ Mq ; • a member u is referred to by q if (lev(u) = u) ∈ Pq and lev(u) ∈ Gq , being lev(u) the level whose domain includes where the first line refers to a member, the second one to a mea- u. sure, the third and fourth ones to a level. The mapping weight With a slight abuse of notation, we will write e ∈ q to state that of member ConveyorBelt when the conveyor belt device is per- q refers to md-element e. Let QT be the set of queries compatible ceived is 0.5.  with T ; the query equivalent to T is the query qeq (T ) ∈ QT that refers to all md-elements in I µ (T ). 4 CONTEXT INTERPRETATION In this section we show how, given a set of perceived entities, 4.2 ...add the log... A-BI produces the most relevant query, i.e., the one whose results The a-priori knowledge expressed through a mapping does not would be more useful to the user. enable the system to learn by considering how user interests evolve, which instead could lead to picking different md-elements 4.1 Take the context... when proposing queries or to choosing one of them more/less The A-BI starting point is the context: a list of dictionary entries frequently. To this end, A-BI exploits the history of previous corresponding to the currently perceived environment entities. interactions, stored in the query log, by means of a collaborative More formally: filtering approach. The log stores, for each context, all the queries proposed to the user and the specific one chosen for execution. Definition 4.1 (Context). A context T over dictionary D is a set of entries of D; each entry d ∈ T is coupled with a weight Definition 4.5 (Log). A log L is an ordered list of triples ⟨T , q, f ⟩ w(T , d) ∈ (0, 1]. where T is a context, q is a query, and f (feedback) is 1 if the user accepted the query, −1 if she rejected the query. The value of the weight for each entry may depend on different factors, depending on the application domain. Non-perceived Example 4.6. With reference to the context T proposed in entities (i.e., for which it would be w(T , d) = 0) are not included Example 4.3, a possible log is L = (⟨T , q 1 , −1⟩, ⟨T , q 2 , 1⟩), where in the context. In our case study we assume that a subset of entries q 1 = ⟨{Date, Device}, are engaged, meaning that they have explicitly been indicated by the user as being part of her current focus of interest; for these {(Device = ConveyorBelt)}, entries, the weight is always 1. For the other entries, the weight {Duration}⟩ is inversely proportional to the distance between the user and q 2 = ⟨{Month, Device}, the specific object being observed. {(Device = ConveyorBelt)}, Given a context, the mapping function identifies the relevant md-elements, i.e., those that will be involved in the queries to be {Duration}⟩ issued against the cube. Definition 4.2 (Image). Given context T over dictionary D, While q 1 has been rejected, q 2 (which is a roll-up of q 1 on the cube C, and mapping µ from D to C, the image of T through µ Date hierarchy) has been accepted.  is I µ (T ) = d ∈T µ(d), i.e., the set of md-elements of C that are Ð A log entry related to context T ′ should impact the recom- mapped through µ to at least one entry in T . mendations related to the current context T only if the two con- Example 4.3. A possible context is the one depicted in Figure texts are similar, since it is reasonable to assume that the user 1, where Bob is checking a conveyor belt placed in room A.1: will have similar behaviors in similar contexts. Context simi- larity is computed as the similarity between their two equiva- T = {⟨Device, ConveyorBelt, 1⟩, lent queries, sim(qeq (T ), qeq (T ′ )), which in turn is computed as ⟨Role, Inspector, 0.5⟩, in [29]. The similarity function sim(q, q ′ ) between two queries ⟨Date, 16/10/2018, 0.5⟩} q = ⟨Gq , Pq , Mq ⟩, q ′ = ⟨Gq ′ , Pq ′ , Mq ′ ⟩ combines three compo- nents, related to group-by sets, selection predicates, and mea- The ConveyorBelt device is engaged. The image of T through µ sures, respectively. In particular, the group-by set similarity con- is siders the distance between the levels involved in the query group-by sets; the selection similarity takes into account both I µ (T ) = {ConveyorBelt, Duration, MaintenanceType, Date} the levels and the members that form the selection predicates;  the measure similarity is based on the Jaccard index. Finally, the similarity between two queries is defined as the weighted average 4.3 ...get the queries of the three similarity components. Following [29], we assume In principle, given a context T , its equivalent query qeq (T ) the three components to be equally significant. might be directly proposed to the user. Unfortunately, equiv- Given log L, the image I µ (T ) of context T is extended to take alent queries are often monster queries, i.e., quite complex queries previous user interactions into account as follows. Let LT ⊆ L be with very high cardinalities. A monster query would be obtained the subset of triples whose context is similar to T : when the number of md-elements in the image is high because LT = {⟨T ′, q, f ⟩ | sim(qeq (T ), qeq (T ′ )) ≥ ϵ } several entities were sensed in the environment so the context in- cludes a large number of entries. Unfortunately, monster queries where ϵ is the similarity threshold. Then, I µ (T ) is extended with are particularly undesirable in AR applications since: all the md-elements referred to by the queries in LT : • High-cardinality queries take a long time to be computed, I µ∗ (T ) = I µ (T ) ∪ {e : ∃⟨T ′, q, f ⟩ ∈ LT | e ∈ q} transferred to the user smart device, and visualized. • While working on the field, users must be quick and reac- In this way, I µ∗ (T ) includes all the md-elements that are relevant tive, while the results of monster queries are hard to be to context T either according to the mapping or to the previous interpreted. user experience. We define the log relevance to T of md-element In the A-BI framework, monster queries are avoided in two ways: e ∈ I µ∗ (T ) as the weighted number of times e has been accepted by (i) by posing an upper bound γ to the query cardinality, and (ii) the user (f = 1) over the number of times it has been proposed; by considering only the most relevant md-elements in the image weighting is based on the similarity between the current context when generating the queries to be proposed to the user. T and the considered log context T ′ : As to (i), given query q = ⟨Gq , Pq , Mq ⟩, the expected cardinal- 1 + ⟨T ′,q, f ⟩ ∈LT (e),f =1 sim(qeq (T ), qeq (T ′ )) ity of its result, denoted card(q), can be estimated as follows: Í ρT (L, e) = 2 + ⟨T ′,q,f ⟩ ∈LT (e) sim(qeq (T ), qeq (T ′ )) Í Ö card(q) = card(Gq ) × selectivity(p) p ∈Pq where LT (e) is the subset of tuples in LT such that q refers to e. To avoid relevance to be 0 when e has never been accepted, a where card(Gq ) is the cardinality of the query group-by set (es- Laplace smoothing is applied in the formula above. Noticeably, timated for instance using the Cardenas formula [30, 31]) and the impact of Laplace smoothing decreases as the cardinality of sel(p) is the selectivity of each simple predicate p belonging to LT (e) increases, that is, the weight tends to 0 if several queries Pq . Note that we can safely use this formula to estimate card(q) referring e have been proposed but never accepted by the user. because, as a consequence of the way we create queries in our Conversely, it tends to 0.5 if only few queries referring e have approach, all predicates in Pq are always external, i.e., they are been proposed. expressed on levels that are less or equal to a level in Gq [32] in It is now possible to define the relevance to T of each md- the roll-up order. element e ∈ I µ∗ (T ) by taking into account, for each context entry d As to (ii), before generating the maximal query for a context that maps to e, not only the entry weight w(T , d) and the mapping we remove from the context image I µ∗ (T ) all the md-elements e weight w(d, e), but also the log relevance ρT (L, e): whose relevance relT (e) is below a given threshold η. (Í w(T , d) · w(d, e), if LT (e) = ∅ Definition 4.8 (Query Relevance and Maximal Query). Given relT (e) = Íd ∈T |e ∈µ(d ) context T , let QT be the set of queries that refer to any subset of d ∈T |e ∈µ(d ) w(T , d) · w(d, e) · ρT (L, e), otherwise md-elements in the image of T through mapping µ. The relevance (note that log relevance is not considered —i.e., it does not affect to T of a query q ∈ QT is defined as the overall relevance— if e is never referred in a log query). e ∈q relT (e) Í relT (q) = Í Example 4.7. With reference to the image I µ (T ) from Exam- e ∈I µ∗ (T ) relT (e) ple 4.3 and to the log entries in Example 4.6, the extended image The maximal query for T is the query qmax (T ) ∈ QT that has is maximum relevance among all those in QT such that card(q) ≤ γ . I µ∗ (T ) = {ConveyorBelt, Clearly, the number of queries in QT increases exponentially Duration, MaintenanceType, with |T |. Finding the maximal query can be formalized as a Knap- Date, Month} sack problem on the md-elements in T , considering γ as the knap- sack capacity. The Knapsack problem is an NP-hard optimization Month has been added to I µ (T ) as part of a query with a posi- problem, so in Algorithm 1 we propose a greedy approach to tive feedback. As to relT (Date) and relT (Month), assuming that compute the maximal query in real time. Starting from the image, the mapping weight w(d, Duration) = 1, and that the mapping we first include all the predicates to produce the smallest (i.e., weight for the remaining levels and members is set to 0.5, it is most focused) result set and then, while the result set cardinality relT (Device) = 0.25, is below the threshold, we incrementally extend the group-by set to produce progressively larger result sets. relT (ConveyorBelt) = 0.25, Given the extended image I µ∗ of context T , the cardinality relT (Duration) = 0.25, threshold γ , and the relevance threshold η, the algorithms works relT (MaintenanceType) = 0.17, as follows. At first, in Line 1 we remove from I µ∗ the md-elements relT (Month) = 0.13, whose relevance is below η. Then, we initialize the maximal query by selecting all the members in I µ∗ to create the conjunc- relT (Date) = 0.08 tive predicate Pq (intuitively, this is the “most focused” predicate),  by adding all the corresponding levels to the group-by set, and Algorithm 1 Maximal query generation Require: I µ∗ : extended image of context T , γ : cardinality threshold, η: relevance threshold Ensure: qmax (T ): maximal query 1: I µ∗ ← I µ∗ \ {e ∈ I µ∗ , relT (e) < η} ◃ Drop low-relevance md-elements from the extended image 2: Pq ← CreatePred({e ∈ I µ∗ , e is a member})◃ Create a predicate as the conjunction of IN clauses, each including all the members of the same level in the extended image 3: G q ← {lev(e), e ∈ I µ∗ , e is a member} ◃ Create a group-by set including the levels of all members in the extended image 4: Mq ← {e ∈ I µ∗ , e is a measure} ◃ Create a set of measures including all those in the extended image 5: q ← ⟨G q , Pq , Mq ⟩ ◃ Initialize the current maximal query 6: A ← {e ∈ I µ∗ \ G q , e is a level} ◃ Initialize a set of candidate levels to extend Gq 7: q ′ ← q ◃ New candidate maximal query 8: while card(q ′ ) ≤ γ do ◃ While the candidate query cardinality is below threshold... 9: q ← q′ ◃ ...update the current maximal query 10: if A = ∅ then ◃ If the search space is not exhausted... 11: break 12: l ← arдmax e ∈A (relT (e)) ◃ ...pick the most relevant candidate level, 13: A ← A \ {l } ◃ remove from the set of candidate levels, 14: Gq ← Gq ∪ {l } ◃ add it to the group-by set, 15: q ′ ← ⟨Gq , Pq , Mq ⟩ ◃ and update the candidate query 16: end while 17: return q by adding all the measures in I µ∗ (Lines 2–5). Now we iterate to  progressively add to the group-by set new relevant levels taken from the image, but only as long as the cardinality constraint is 5 QUERY DIVERSIFICATION met (Lines 8–16). Finally, the last query with a cardinality below Given a context T , the maximal query qmax (T ) is the more rel- γ is returned to the user (Line 17). Note that, if the candidate evant query that meets the cardinality constraint. Nonetheless, query defined in line 5 violates the cardinality constraint γ , it is to better meet the user’s desiderata, it may be useful to return immediately returned since all the transformations that follow a set of alternative queries that, though being still related to T , would further increase its cardinality. In this specific case, con- are different enough from each other and from qmax (T ) to offer straint enforcement will be ensured by the diversification phase different flavors of the same information to the user. This gen- (see Algorithm 2). eral idea is often practiced in the literature and referred to as diversification [23]. Example 4.9. With reference to the context T from Exam- Given the set of queries QT compatible with T , the problem ple 4.3 and to its extended image from Example 4.7, we ap- of diversification consists in finding the set of top-N queries ply Algorithm 1 to generate the maximal query for T with R ⊆ QT that maximizes diversity and relevance with respect γ = 20 and η = 0.1. At first, Date is pruned from I µ∗ since to user analyses [23]. The diversity div(q, R) of a query q with its relevance is below η. Then Pq is initialized to {(Device = respect to R can be defined as in [33] ConveyorBelt)}, Gq to {Device}, Mq to {Duration}; levels Main- tenanceType and Month are stored in A for possible further inclu- 1 Õ div(q, R) = (1 − sim(q, q ′ )) sion into Gq . The maximal query is initially set to ⟨Gq , Pq , Mq ⟩, |R| ′ q ∈R with 1 being its cardinality. At this time, the algorithm at- tempts to add MaintenanceType to the query group-by set. with div(q, R) = 1 when R is empty. Let the cardinality of ⟨{Device, MaintenanceType}, {(Device = Finding the optimal set of diverse queries is NP-hard with ConveyorBelt)}, {Duration}⟩ be 5. Since the cardinality of the reference to the cardinality of QT which, in turn, is exponen- candidate maximal query is below threshold, the current maxi- tial in the image cardinality |I µ∗ (T )|. Considering the real-time mal query is updated. The algorithm attemps to add Month to constraint related to every AR application, we propose in the the query group-by set. Since this increases the query cardinality following a greedy approach that starting from qmax (T ) gener- by a factor equal to the number of months being monitored in ates progressively different queries. Diversification is obtained the MaintenanceActivity cube, the cardinality of the new can- through three OLAP-based generative primitives, each applied to didate query ⟨{Device, MaintenanceType, Month}, {(Device = the previous query q = ⟨G, P, M⟩ to return a set of new queries: ConveyorBelt)}, {Duration}⟩ will supposedly be larger than 20, • roll(q) returns a set of queries whose group-by set is so the previous query ⟨{Device, MaintenanceType}, {(Device = coarser than G; each of these queries is obtained by re- ConveyorBelt)}, {Duration}⟩ is returned. The SQL representa- placing in G one level l with its immediate successor tion of the maximal query is in the roll-up partial order. For each q ′ ∈ roll(q), it is card(q ′ ) ≤ card(q). Only queries whose predicates are all SELECT Device, MaintenanceType, sum(Duration) external are returned. FROM MaintenanceActivity • drill(q) returns a set of queries whose group-by set is finer WHERE Device = ConveyorBelt than G; each of these queries is obtained by replacing in G one level l with its immediate predecessor in the roll-up GROUP BY Device, MaintenanceType partial order. For each q ′ ∈ roll(q), it is card(q ′ ) ≥ card(q). Algorithm 2 Diversification with the queries generated by Extend(q) (Line 11). Lines from Require: qmax (T ): maximal query for context T , γ : cardinal- 4 to 11 are repeated until either the cardinality of the result set ity threshold, θ : diversity threshold, N : number of diverse overcomes N or the search space Q is empty (Line 3). queries Intuitively, Algorithm 2 explores the search space QT around Ensure: R: diverse queries qmax (T ) assuming that the lower the similarity between q ′ and 1: R ← ∅ ◃ Result set qmax (T ), the lower the relevance of q ′ . The queries added to the 2: Q ← {qmax (T )} ◃ Search space result set R are no longer considered for differentiation aimed 3: while (|R| < N ) ∧ (Q , ∅) do at maximizing the probability of finding a a new different query 4: q ← arдmaxq ′ ∈Q (div(q ′, R)) ◃ Most diverse query with high relevance: indeed, a query q ′′ obtained by extending 5: Q ← Q \ {q} q ′ ∈ R will probably be very similar to q ′ and less relevant. This 6: if (card(q) ≤ γ ) ∧ (div(q, R) ≥ θ ) then rule does not apply to qmax (T ), which is the starting point for 7: R ← R ∪ {q} ◃ q is added to the result exploration and must be always expanded and added to R (except 8: if q = qmax (T ) then the specific case in which it violates the cardinality constraint). 9: Q ← Q ∪ Extend(q) ◃ Apply primitives Example 5.2. Let qmax (T ) = ⟨{Device, Month}, {Device = 10: else ◃ Continue the search ConveryorBelt}, {Duration}⟩ be the maximal query, with 11: Q ← Q ∪ Extend(q) ◃ Apply primitives card(qmax (T )) ≤ γ . At the first iteration, qmax (T ) (the only 12: end while query in the search space Q) is picked. Since div(qmax (T ), R) = 1 13: return R (R is empty), R is extended with qmax (T ), and Q is extended, among the others, with qr oll , qdr ill , and qslice (see Example 5.1). At the second iteration, let qdr ill be the most diverse query. • slice(q) returns a set of queries whose selection predicate Since qdr ill is quite similar to the maximal query, it is likely is less selective than P; each of these queries is obtained that div(qdr ill , R) is below θ . In this case, qdr ill is not added to by replacing one of the IN clauses in P, defined on a set of R, and Q is extended by applying Extend(qdr ill ). The iteration members u 1 , . . . , un , with a clause on the member(s) that continues until N diverse queries are generated, or the search are the immediate successors of u 1 , . . . , un in the part-of space Q is exhausted.  order. For each q ′ ∈ roll(q), it is card(q ′ ) ≥ card(q). The queries returned by all these operators are progressively less 6 EXPERIMENTAL TESTS similar to the maximal query, so their relevance is lower than the In this section we evaluate the A-BI framework in terms of both one of q. Besides, only queries compatible with the context are effectiveness (to what extent the proposed queries meet the user’s returned. interests) and efficiency. Tests were carried out on a synthetic benchmark since, in this work, we assume the problem of context Example 5.1. Given q = ⟨{Device, Month}, {Device = generation to be addressed by the smart device and, to the best ConveryorBelt}, {Duration}⟩, examples of queries returned by of our knowledge, no AR open dataset exists. our three primitives are, respectively, The user-system interaction is as follows. The user is moving qr oll = ⟨{Device, Year}, in a factory of 10 rooms, and, while moving, she collects one view {(Device = ConveyorBelt)}, of each room. In each view the smart device recognizes a set of entities that belong to the dictionary and lists them into a context. {Duration}⟩ Starting from each context, the A-BI framework proposes a set qdr ill = ⟨{Device, Date}, of queries to the user. Finally, the user either chooses one of {(Device = ConveyorBelt)}, the proposed queries or formulates an additional query that is {Duration}⟩ slightly different from the ones proposed. After some time, the user ends her exploration of the factory and moves back through qsl ice = ⟨{Device, Month}, the same rooms. Knowing the precedent behavior of the user, the {(DeviceType = PackagingMachine)}, A-BI framework exploits collaborative filtering to propose a new {Duration}⟩ set of queries (generated by taking the query log into account) that better suit her interests. We denote with α the number of  times the user has visited the factory (i.e., how many times each The diversification process is described in Algorithm 2. Given context has already been sensed by the user’s devices before the the maximal query qmax , the cardinality threshold γ , a diversity current visit). threshold θ , and the number of desired diverse queries N , the This interaction is simulated by randomly generating 10 seed set of diverse queries R is initialized to the empty set (Line 1) contexts, each corresponding to a different room, in such a way and the search space Q is initialized to qmax (Line 2). The most that these contexts differ significantly from each other. Then, to diverse query, q, is picked from the search space (Line 4); if its simulate multiple visits of each room, small context variations are cardinality is below γ and its diversity from the queries in R is generated starting from each seed (specifically, one for each visit; higher than θ , then q is added to R (Lines 6–7). Note that at the the idea is that each room is perceived with slight differences first step, when the search space contains only qmax (T ), function in each visit). The number of entities recognized in each room Extend(q) is invoked to apply our three primitives to qmax (T ) (i.e., the context cardinality) ranges between 5 and 15. Each test (Lines 8–9); specifically, roll(q) is alway applied since it decreases is repeated 10 times and the average behavior is considered. the query cardinality, while drill(q) and slice(q) are applied only We executed our tests against a cube including 5 linear hi- if card(q) ≤ γ . Otherwise, if either card(q) is too high or q is not erarchies with 5 levels of details each. Each dimension has 64 diverse enough from the queries in R, the search space is extended members, determining a maximum cube cardinality of about 109 . Table 1: Notation summary 6.1 Effectiveness Figure 3a evaluates the benefit of diversification and of collab- Notation Meaning orative filtering by showing how similar qmax , qdiv , and qloд T Context (corresponds to a view of a factory are to qu for different values of β and θ . Clearly, the lower β, the room) less similar qu from qmax . Diversification comes to the rescue, qmax Maximal query providing at least one diverse query qdiv that is closer to the qu User query user’s interests. However, even if a high diversity threshold is set qdiv Diverse query most similar to qu (e.g., θ = 0.2), the diverse queries are not sensibly closer to qu , qloд Log query most similar to qu as they deviate too much from queries with high relevance. The α ∈ [0, 8] Number of times the user has already seen closest query in the log, qloд , is always more similar to qu than a context qdiv . β ∈ [0.6, 1] Similarity between qu and qmax In Figure 3b we focus on diversification by showing that the γ = 100 Query cardinality threshold higher the number of diverse queries N , the higher the values ϵ = 0.5 Context similarity threshold of sim(qu , qdiv ), hence, the higher the effectiveness of diversi- η = 0.5 Relevance threshold fication. Finally, Figure 3c shows the capability of collaborative θ ∈ [0.05, 0.2] Diversity threshold filtering to meet the user’s desiderata even when the user query N ∈ [2, 8] Number of diverse queries generated qu is quite different from the maximal one. Indeed, the closest query in the log converges towards the user one after a few user feedbacks are collected. Already after one user feedback, the sim- qu ilarity between qloд and qu sensibly rises, being very close to 1 qdiv when 4 feedbacks are collected. β θ 6.2 Efficiency qmax We ran the tests on a machine equipped with Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPU and 16GB RAM, with the A-BI framework being implemented in Scala. The efficiency results are depicted in Figure 4, with the overall Figure 2: The user query qu , the maximal query qmax , and execution time being in the order of 10−1 seconds at most. The the set of diverse queries (in blue); among them, qdiv is the execution of Algorithm 2 accounts for most computational time. (a) one most similar to qu Indeed, the process of diversification is computationally heavy, qu requiring to find, among the search space, the query with the qlog highest diversity with reference to the current result set (with O(|Q |) being the complexity of Line 4 in Algorithm 2 in case Q qmax The dictionary includes qone key for each md-element (i.e., we is not sorted). Note that the set of diverse queries is increasingly div assume the smart device can recognize each single element of the built, thus the time to first result is lower since, once a query is cube); each dictionary entry d is one-to-one mapped to the corre- added to result set, it can be directly returned. We also varied sponding md-element e with mapping weight w(d, e) randomly the number of entries in the context (|T | ∈ [5, 15]), but since this ranging in [0.2, 1]. Additionally, for each context, we call does not significantly affect performances we do not show the • qmax the maximal query generated by Algorithm 1. results. • qu the query chosen by the user. (b) We denote with β the We finally emphasize that the execution time corresponds to similarity between the user query and the maximal query the time necessary to define the recommended queries, and not (β = sim(qu , qmax )). The lower the value of β, the higher to the time to actually execute them. Indeed, the execution of the the difference between the user and maximal queries; if query is demanded to and strictly depends on the performance β = 1, the user exactly chooses the maximal query. of the enterprise data warehouse system. • qdiv the query most similar to qu among those generated by the diversification process, when the log is ignored. • qloд the query most similar to qu among those generated 7 CONCLUSION by the diversification process, when the log is taken into The A-BI framework is a first result in the direction of establishing account. At each visit, 10 contexts are shown to the user; a tight connection between analytical reporting and augmented so, after each iteration, 10 contexts are added to the log reality applications. Besides proposing a reference functional with the corresponding user choices. architecture and interaction process, in this paper we have shown At the first visit (α = 0), the log is empty so qloд ≡ qdiv . In theory, that recommendations can be given in real-time. qdiv should be more similar to qu than qmax since it results from A-BI can be improved from many points of view, in particu- diversification, and qloд should further improve over qdiv since lar we are working towards producing recommendations that it also uses the log. are based on patterns of recognized context entities rather than A notation summary is provided in Table 1. To help the reader on single entities (e.g., md-element e is relevant if the context understand the role of each query, a visual representation of includes d and d ′ but not d ′′ ). Furthermore, in its current im- qmax , qdiv , and qloд is depicted in Figure 2, where the query plementation, a recommendation involves all the elements in space is represented as a Cartesian plane with Euclidean distances. the context while it would be useful to provide separate recom- mendations related to subsets of elements. Finally, it would be = 0.05 = 0.1 = 0.2 1.00 0.95 0.90 sim 0.85 0.80 qmax qdiv 0.75 qlog 0.70 1.0 0.9 0.8 0.7 0.6 1.0 0.9 0.8 0.7 0.6 1.0 0.9 0.8 0.7 0.6 (a) Average similarity of qmax , qd iv , and ql oд with qu for decreasing β and increasing θ ( |T | = 10, N = 4, α = 1) qmax 1.00 0.78 qdiv 0.95 sim(qu, qlog) 0.76 0.90 sim 0.74 0.85 = 0.6 = 0.7 0.72 0.80 = 0.8 2 4 6 8 0 1 2 3 4 5 6 7 8 N (b) Average similarity of qmax and qd iv with qu for increasing (c) Average similarity of qu and ql oд at increasing iterations for N (β = 0.6, θ = 0.1, |T | = 10) decreasing β (θ = 0.1, N = 4, |T | = 10) Figure 3: Effectiveness |T| = 10 Maximal query [8] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. Diversification Snipsuggest: Context-aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010. [9] Raphaël Thollot, Nicolas Kuchmann-Beauger, and Marie-Aude Aufaure. Se- Time (s) 10 2 mantics and usage statistics for multi-dimensional query expansion. In Proc. DASFAA, pages 250–260, Busan, South Korea, 2012. [10] Marie Le Guilly, Jean-Marc Petit, and Vasile-Marian Scuturici. SQL query completion for data exploration. CoRR, abs/1802.02872, 2018. [11] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and Neoklis Polyzotis. SEEDB: efficient data-driven visualization recommen- 10 3 dations to support visual analytics. PVLDB, 8(13):2182–2193, 2015. [12] Houssem Jerbi, Franck Ravat, Olivier Teste, and Gilles Zurfluh. Preference- 0.05 0.10 0.15 0.20 based recommendations for OLAP analysis. In Proc. DaWaK, pages 467–478, Linz, Austria, 2009. [13] Rafael Berlanga Llavori and Victoria Nebot. Context-Aware Business Intelli- gence, pages 87–110. 2015. Figure 4: Efficiency for increasing values of θ [14] Kaijian Xu, Manli Zhu, Daqing Zhang, and Tao Gu. Context-aware content filtering & presentation for pervasive & mobile information systems. In Proc. ICST, page 20, Quebec, Canada, 2008. [15] Wenwei Xue, Hung Keng Pung, Paulito P. Palmes, and Tao Gu. Schema interesting to investigate on how to turn A-BI into a purely sta- matching for context-aware computing. In Proc. UbiComp, pages 292–301, tistical framework where all weights are expressed in terms of Seoul, Korea, 2008. probabilities and reasoning is probabilistic too. [16] Minmin Chen, Jian-Tao Sun, Xiaochuan Ni, and Yixin Chen. Improving context-aware query classification via adaptive self-training. In Proce. CIKM, pages 115–124, Glasgow, United Kingdom, 2011. REFERENCES [17] Julien Aligon, Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and Stefano [1] Ronald Azuma. A survey of augmented reality. Presence, 6(4):355–385, 1997. Rizzi. A collaborative filtering approach for recommending OLAP sessions. [2] Angelo Croatti and Alessandro Ricci. Towards the web of augmented things. Decision Support Systems, 69:20–30, 2015. In Proc. ICSA, pages 80–87, Gothenburg, Sweden, 2017. [18] Xiaohui Yan, Jiafeng Guo, and Xueqi Cheng. Context-aware query recommen- [3] Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and dation by learning high-order relation in query logs. In Proc. CIKM, pages Alexander Tuzhilin. Incorporating contextual information in recommender 2073–2076, Glasgow, United Kingdom, 2011. systems using a multidimensional approach. ACM Trans. Inf. Syst., 23(1):103– [19] Di Jiang, Kenneth Wai-Ting Leung, Jan Vosecky, and Wilfred Ng. Personalized 145, 2005. query suggestion with diversity awareness. In Proc. ICDE, pages 400–411, [4] Kostas Stefanidis, Evaggelia Pitoura, and Panos Vassiliadis. A context-aware Chicago, USA, 2014. preference database system. Int. J. Pervasive Computing and Communications, [20] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. AIDE: an active 3(4):439–460, 2007. learning-based approach for interactive data exploration. IEEE Trans. Knowl. [5] Ibrahim A. Ibrahim, Abdullah M. Albarrak, and Xue Li. Constrained recom- Data Eng., 28(11):2842–2856, 2016. mendations for query visualizations. Knowl. Inf. Syst., 51(2):499–529, 2017. [21] Patrick Marcel and Elsa Negre. A survey of query recommendation techniques [6] S. Dobson, M. Golfarelli, S. Graziani, and S. Rizzi. A reference architecture and for data warehouse exploration. In Proc. EDA, pages 119–134, Clermont- model for sensor data warehousing. IEEE Sensors Journal, 18(18):7659–7670, Ferrand, France, 2011. 2018. [22] Arnaud Giacometti, Patrick Marcel, Elsa Negre, and Arnaud Soulet. Query [7] Shuyao Qi, Dingming Wu, and Nikos Mamoulis. Location aware keyword recommendations for OLAP discovery-driven analysis. IJDWM, 7(2):1–25, query suggestion based on document proximity. IEEE Trans. Knowl. Data Eng., 2011. 28(1):82–97, 2016. [23] Kaiping Zheng, Hongzhi Wang, Zhixin Qi, Jianzhong Li, and Hong Gao. A survey of query result diversification. Knowl. Inf. Syst., 51(1):1–36, 2017. [24] Elsa Negre, Franck Ravat, Olivier Teste, and Ronan Tournier. Cold-start recommender system problem within a multidimensional data warehouse. In Proc. RCIS, pages 1–8, 2013. [25] Wolfgang Büschel, Annett Mitschick, and Raimund Dachselt. Here and now: Reality-based information retrieval: Perspective paper. In Proc. CHIIR, pages 171–180, New Brunswick, USA, 2018. [26] Neven A. M. ElSayed, Bruce H. Thomas, Kim Marriott, Julia Piantadosi, and Ross T. Smith. Situated analytics: Demonstrating immersive analytical tools with augmented reality. J. Vis. Lang. Comput., 36:13–23, 2016. [27] Ben Shneiderman. The eyes have it: A task by data type taxonomy for infor- mation visualizations. In Proc. IEEE Symposium on Visual Languages, pages 336–343, Boulder, USA, 1996. [28] Ashish Gupta, Venky Harinarayan, and Dallan Quass. Aggregate-query pro- cessing in data warehousing environments. In Proc. VLDB, pages 358–369, 1995. [29] Julien Aligon, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, and Elisa Turric- chia. Similarity measures for OLAP sessions. Knowl. Inf. Syst., 39(2):463–489, 2014. [30] Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. Access path selection in a relational database management system. In Proc. SIGMOD, pages 23–34, Boston, USA, 1979. [31] Matteo Golfarelli and Ettore Saltarelli. The workload you have, the workload you would like. In Proc. DOLAP, pages 79–85, New Orleans, USA, 2003. [32] Stefano Rizzi and Ettore Saltarelli. View materialization vs. indexing: Balancing space constraints in data warehouse design. In Proc. CAiSE, pages 502–519, 2003. [33] Jaime G. Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proc. SIGIR, pages 335–336, Melbourne, Australia, 1998.