<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Modeling and Predicting Query Behavior in OLAP Systems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Carsten Sapia FORWISS - Bavarian Research Center for Knowledge-Based Systems Orleansstr.</institution>
          <addr-line>34, D-81667 Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>3</lpage>
      <abstract>
        <p>Interactive multidimensional data analysis tools (mostly OLAP systems) are the predominant frontend tools for end users in data warehouse environments. Thus, the design of these systems is an important part of the data warehouse design itself. This paper contributes to the important design step by discussing the modeling of user query behavior and its benefits. We present a mathematical model and a graphical notation for capturing knowledge about typical multidimensional interaction patterns in OLAP systems, taking into account the session oriented, interactive and navigational nature of the user query behavior. To exemplarily show that the modeling of user behavior is not only beneficial during the logical and physical design phase, we also present an architecture to speed up OLAP systems at runtime by using speculative execution techniques based on a prediction of the user query behavior.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>of the data warehouse design itself. As the design of the
multidimensional schema has a direct impact on the
functionality of the system, the design process should be
driven by the user analysis requirements. That means that
knowledge about the anticipated behavior of the user is
needed throughout the design and implementation
process. Surprisingly enough, very little work has been
devoted to models and notations to capture and formalize
this knowledge. Many design methodologies are mainly
driven by the structure of the operational data sources. We
argue that it is desirable for the quality of the design to
enrich these approaches with more user centric
techniques.</p>
      <p>Therefore, we start by demonstrating the benefits of
modeling knowledge about the anticipated user behavior in
OLAP systems (section 1.1) and examining the special
characteristics of interactions (section 1.2). Section 2
surveys related work to show that the existing approaches
do not fully match these characteristics. From the
observations about the characteristics of OLAP systems, we
derive a formal description technique for
multidimensional query sequences and user/task specific interaction
patterns and profiles (section 4). This formalism is based
on a formal notion of a multidimensional schema as
presented in section 3. To emphasize the benefits of our user
centric modeling approach, we show how the user profiles
can be used for improving the OLAP system performance
at runtime by predicting the users next query. An
architecture for this is shown in section 5. Finally, section 6
draws conclusions and presents directions for future work.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Why should we model user behavior?</title>
      <p>The purpose of an OLAP system is to satisfy the user’s
analysis requirements. This should make clear that
knowledge about the user query behavior is most important
during the whole development cycle of the system (see
figure 1).</p>
      <p>The main goal of the ‘Conceptual Design’ is to
consolidate the requirements identified during ‘Requirement
Analysis’ to a conceptual MD schema. A user centered,
use-case driven approach to schema design would
increases the quality of the conceptual design. As the static
data model (or schema) of an OLAP system is closely
related to the dynamic aspect of the system (query
behavior of the user), the schema determines which queries the
user can execute. Therefore, the design of the conceptual
DW schema should be driven by a model of the user
query behavior. Another reason is that our experiences
from several real world projects show that end users
usually can state very well which analytical task they want to
perform, but have difficulties understanding a static data
model or schema that is necessary to answer their queries.</p>
      <p>Operation
(Querying and
Data Maintenance)
Implementation
Requirement</p>
      <p>Analysis</p>
      <p>Conceptual Design
(Implementation</p>
      <p>Independent)</p>
      <p>Physical Design
(Implementation decisions)
Knowledge about the anticipated user behavior is also a
key input for the subsequent physical data warehouse
design. During this phase, the conceptual model has to be
mapped to a specific implementation (e.g. MOLAP or
ROLAP). This includes decisions about physical
optimization techniques (e.g. indexing schemes, physical
clustering, precomputation strategies, denormalization).
Comprehensive research of physical optimization strategies for
Data Warehouses and OLAP systems has been done for
specific implementation strategies (relational e.g.
[GHR+97] or multidimensional e.g. [FB99]). Irrespective
of the approach, all of these solutions require the
definition of a typical workload as a prerequisite. The view
selection problem (e.g. [TS98], [SDN98]) is a prominent
example that has attracted a lot of research, lately. To our
knowledge, no scientific work has so far been devoted on
how to find such a workload in an OLAP environment. If
the user behavior was modeled during the conceptual
design using a formal modeling technique, typical
workloads could be generated from this model, automatically.
Knowledge about the user query behavior can also be
used at run time (during the Implementation and
Operation phase). User and task profiles that contain
information about typical query sequences can be used for
runtime system optimization. Due to the interactive and
navigational nature of the user’s analysis it is promising to
predict the next steps of the user. This information can be
used by an intelligent OLAP database for optimizing
caching and scheduling purposes. An interesting option is
the usage of prediction information to speculatively
execute possible next queries (this idea is further elaborated
in section 5). If the user actually asks such a predicted
query the systems answering time is greatly reduced which
is one of the main factors of success for a data warehouse.
Summing up, knowledge about anticipated user behavior
has an impact on all phases of OLAP system design.
Therefore it should be fully incorporated into an OLAP
design methodology. In combination with techniques to
model the static part of the system (the multidimensional
conceptual schema (e.g. [GMR98], [SBHD99]) a
technique for modeling user behavior (the dynamic part of an
OLAP system) can contribute to a comprehensive
modeling technique for multidimensional information systems.
1.2</p>
      <p>What is so characteristic about user interaction
in OLAP systems?
As an example let us consider an extract of a real project
we performed together with an industrial partner. A car
manufacturer wants to analyze vehicle repairs to improve
the technical quality of his products, to evaluate the
warranty policy and to assess the quality of different garages.
According to the basic philosophy of OLAP systems the
user interactively analyzes the data by applying
multidimensional operations. Thus, the typical interaction
between the user and the system consists of a sequence of
queries. We call all the interactions necessary to answer a
business question (e.g. “Which garages offer a potential
for cutting down repair costs?”) or to verify a hypothesis
(e.g. “Cars in southern countries are repaired less often.”)
a session.</p>
      <p>The user normally uses a predefined entry point (e.g. a
business report). From there the user navigates through
the multidimensional data space by performing operations
on the previous query result. These results are displayed
using an MD view (for an example see figure 2). This
view (mostly a table or a graph) visualizes a subspace of
the original MD cube. Some dimensions are restricted to a
single value (called a slicing operation). We call these
dimensions the selection dimensions of the query. In the
example the values to which the dimensions are restricted
are shown in the left hand corner of the table (customers,
year). The members of the remaining dimensions are
shown in the visualization. In our example they are
grouped along the axes (geogr. region, vehicle model).
We call these dimensions the result dimensions of the
query. The data displayed in the MD view are the values
of a subset of all measures called the result measures of
the query (e.g. # of repairs). The next query is formulated
by manipulating the structure of the MD view, for
example exchanging result and selection dimensions or altering
the selection for a selection dimension.</p>
      <p>To identify saving potentials a controller may start by
accessing a predefined report (i.e. a query) ranking the
geographical regions by number of repairs during the year
1998. He then picks a region in which enough repairs
occurred to make it worthwhile further investigation. This
is a cognitive process which takes some time and does not
involve any system interaction. Therefore, from a systems
point of view the process is a ‘black box’ that is not
deterministic but predictable under certain circumstances.
Result Dimensions (Vehicle and Repair Location)
The only thing the system can perceive (and use to guess
the user’s intention) is the users next query.</p>
      <p>In our example the analyst, may perform a drill-down
operation on the chosen region (e.g. by performing a
double click on the region name) resulting in a list of garages
in this region. Afterwards, he might be interested in the
way how the number of repairs changed over the year. In
this case, he executes a drill down on the year resulting in
a two-dimensional view containing garages and month.
Next, he might identify an interesting garage and might
want to see how the number of repairs varies by car model
(he performs a rotate operation of the data cube) and so
on.</p>
      <p>We call the consecutive operations executed during a
session the query sequence of this session. As the OLAP
systems do not restrict the user to a certain workflow,
different sessions answering the same business query may
have different query sequences as the user can skip steps
or include additional steps. Nevertheless, each analytical
task (e.g. “find garages which should participate in a
promotion”) possesses a characteristic interaction pattern.
Furthermore, as different user have varying analysis
habits, the patterns are also user specific.
2</p>
      <sec id="sec-2-1">
        <title>Related Work</title>
        <p>Several approaches have been published to formalize the
multidimensional data model (e.g. [CT98a], [Vas98]).
The papers present a description of the MD data structure
and propose formalizations of single queries using
algebraic operations (e.g. [Vas98]), a logical calculus (e.g.
[CT98a]) or a graphical formalism ([CT98b]). Our static
data model (section 3) is based on this work (esp.
[Vas98]). However, the query descriptions in these papers
are very operationally oriented as the aim is to provide a
description that allows the efficient processing of single
queries. Our approach supplements this work by giving a
higher level abstraction of queries (needed for conceptual
modeling in earlier phases and query pattern recognition)
and taking into account the navigational nature of OLAP
sessions.</p>
        <p>Some work has been done in the field of design
methodologies for data warehouses and OLAP systems.
[GMR98] proposes a graphical representation of
multidimensional schemas (DF model) and a methodology to
build such a schema from the E/R models of the
operational sources. “Query patterns“ can be graphically
represented in the model by marking dimension attributes that
are of special interest to the user. The authors suggest that
these markers can be used to determine sensible levels of
preaggregation. The approach does not take into account
the correlation between successive queries due to the
navigational nature of OLAP applications which is the
basic assumption of our work. Furthermore, the approach
fundamentally differ from our approach by proposing a
‘bottom up’ data warehouse design, that means starting
from the schemas of the operational sources and deriving
a warehouse schema. Our approach is more user centered
as we propose to start with a model of the users’ analysis
requirements independently from the structure of the
operational systems.
[GR98] describes a methodology for data warehouse
design that uses the DF model ([GMR98]) as its core. The
description technique used for single queries in this paper
has some similarity to our approach. However, it does not
model the distinction between selection and result
dimensions, the relationship between consecutive queries,
typical query patterns and does not contain a graphical
representation for query profiles.
[ABD+99] describes the design of an aggregate cache for
OLAP systems. One of the parameters used to decide if a
cached aggregate is replaced is its similarity to the last
query of the user. Like in our approach, the similarity of
queries is measured by the number of operations
necessary to navigate between the queries. The presented
measurements show that this is a sensible assumption. Our
approach extends this work by not only taking into
account the last query issued but the whole history of the
user’s session combined with domain knowledge captured
by profiles.</p>
        <p>To our knowledge no work exists on the prediction of
queries in the domain of OLAP systems and Data
Warehouses. Although, this idea has been researched for
WWW based information systems. [ZAN99] presents a
theoretical framework for using knowledge about user
navigation for speculatively presending documents. A
practical evaluation shows that significant speed-ups are
possible. OLAP and WWW systems share the common
characteristic of session oriented navigational data access.
To a certain degree, the document structure of the WWW
is also present in OLAP systems. Nevertheless, the
operations available to the user are more complex in OLAP
systems. Furthermore, in an OLAP system more
information about the user and its task are available, therefore
user/task specific profiles can be built.
3</p>
        <p>Multidimensional Data Model
To formulate a formal conceptual model for the user
behavior in OLAP systems, we first need a formal
conceptual description of the underlying multidimensional
paradigm. As mentioned in section 2, several data models
have been published recently. Following an evaluation of
MD data models ([SBH99]) our formalism is inspired by
([Vas98]).</p>
        <p>Definition 3.1 (level): A dimension level (or short level) l
is a finite set of elements from a domain dom(l). Ψ
denotes the set of all levels for an MD schema w
Definition 3.2 (dimension): A dimension d is defined as
tuple d := (H, class) with
n H ⊆ Ψ being the set of levels that structure this
dimension. Each dimension contains a special level
&lt;dimension_name&gt;.all which contains only one
member and is used to aggregate all the basic values
to a single value.
n class ⊆ H×H defining the classification relationship
between levels. (l1,l2) ∈ class* reads “l1 can be
classified according to l2.” The relation class is minimal in
the sense that the following property is fulfilled: (l1,l2)
∈ class ∧ (l2,l3) ∈ class ⇒ (l1,l3) ∉class. The
transitive, reflexive closure class* of class must fulfill the
following property: (l1,l2) ∈ class* ⇒ (l2,l1) ∉ class*.</p>
        <p>That means that class* defines a partial order on H. w
Thus, the tuple (H,class*) defines a lattice over the levels.
Definition 3.3 (MD schema): A multidimensional
schema is a finite set of dimensions Ω ={d1,..dk, M},
where M is a special dimension containing the measures
of the schema. w
The function dim(l): Ψ→Ω maps each level to its
corresponding dimension, that means the following condition is
fulfilled: dim(l) = d ⇔ ∃d=(Hd,classd) ∈ Ω such that l∈
Hd
Definition 3.4 (dimension path): A dimension path dp of
a dimension D = (H, class) is defined as a path in the
lattice (H, class*), beginning from its least upper bound
and ending in its greatest lower bound. Each dimension
path is a linearly ordered list of dimension levels
containing the special dimension level &lt;dimension_ name&gt;.all as
the least upper bound. w
We use DPD to denote the set of all dimension pathes of
the dimension D.</p>
        <p>For conceptual modeling purposes, multidimensional
schemas can be represented graphically (e.g. by using the
ME/R model [SBHD99]).</p>
        <p>Example: A possible multidimensional schema for our
example scenario (figure 3) looks like this:
Ω = {dtime, dcustomer, dvehicle, dgarage, M}
dtime= ( { day, month, year, time.all },</p>
        <p>{ (day,month), (month, year), (year, time.all)})
dcustomer= ( { customer, customer.all },</p>
        <p>{ (customer, customer.all)})
dvehicle= ( { vehicle, model, brand, vehicle.all},
{ (vehicle, model), (model, brand),</p>
        <p>(brand, vehicle.all) } )
dgarage= ( { garage, geogr. region, type of garage,
country, garage.all},
{ ( garage, geogr. region),
(geogr. region, country), (country, garage.all),
(garage, type of garage),
(type of garage, garage.all)})
M = {measures, ∅} w
4</p>
      </sec>
      <sec id="sec-2-2">
        <title>Modeling the User Behavior</title>
        <p>The basic idea when modeling query behavior is that the
user’s queries follow certain patterns. E.g. when
evaluating the number of repairs for a given period a user always
uses an MD view that shows the number of repairs by
geographic regions on one axis and vehicle models on the
other axis. The individual query contains different values
for the timespan covered (e.g. 1998, or 1999) but the
structure of the MD view is unchanged. Thus, we use the
structure of the MD view to define the notion of a query
prototype (section 4.1).</p>
        <p>As we are interested in similar queries to define patterns,
we need a formal notion for this similarity. The basic idea
is to define the distance between two queries q1 and q2 as
the number of operations that is needed to transform the
query prototype of q1 to the prototype of q2 (section 4.2).
costs (part)
costs (wages)
costs (total)
# of persons</p>
        <p>duration
vehicle
model
year
month
day
vehicle
repair
customer
customer.all
type of
garage
vehicle.all
brand
vehicle
Based on this, we can define a user/task profile that
describes the typical interaction pattern for this task that may
be user specific (section 4.3). These user/task profiles are
usually created by a system designer together with domain
experts during the requirement engineering and
conceptual system design phase (Another possibility is the
creation of profiles from database logs. This idea is further
elaborated in section 5 and 6). The indispensable
involvement of domain experts makes a graphical notation
of profiles desirable (section 4.4).
4.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sessions, Queries and Query Prototypes</title>
      <p>As already stated, the user interaction in OLAP systems is
strongly session oriented. A session 6 consists of a
sequence of queries 6 := q1,...,qn where qi+1 is formulated
applying interactive manipulation operations (like drilling,
or rotation) to the results of qi. The results of a single
query q ∈ 6 are displayed by the OLAP frontend using an
MD view (table, graph etc.). Dimensions being restricted
to one element are called selection dimensions of q while
all other dimensions are called result dimensions of q.
Let us assume that dk∈Ω is a selection dimension of q
limited to the single value val. The level l∈Ψ to which val
belongs is called the selection granularity level (or short
selection level) of dk for q, i.e. val∈dom(l). For each result
dimension dr of a query q a level of granularity is given
that is called the result granularity level (or short result
level) for dimension dr.</p>
      <p>We are interested in modeling typical interaction patterns
in query behavior. These patterns describe the similarities
between different sessions that are similar with respect to
the intention/task the user had in mind when executing the
session. From a system’s point of view, the current
interest (or intention) of the user is mirrored in the structure of
the query he issues. The distinction between selection and
result dimensions in the current query contains
information about the user’s intention. It shows which correlation
in the data the user is interested in. For example if the
result dimensions are geographic region and time, the user
is likely to be analyzing the correlation of these factors
irrespective of e.g. the vehicle model.</p>
      <p>Thus, we assume that the structure of the MD view (for
the query q) contains the relevant information about the
state of the user’s current task at the moment he executes
the query q. This information is described by a query
prototype.</p>
      <p>Definition 4.1 (Query Prototype): Let us assume a query
q that has the result dimensions Dr and the selection
dimensions Ds. The query prototype T for this given query
q is defined as a triple T := (Mr, Lr, Ls) such that
n</p>
      <p>Mr ⊆ M is the set of measures referenced in query q
as result measures
n Lr is the set of result granularity levels of the result
dimensions Dr,
n Ls is the set of selection granularity levels of the
selection dimensions Ds
The following conditions have to be fulfilled in order for
to be a valid query pattern:
(1) Udim(l) ∪ Udim(l) = Ω
l∈Lr</p>
      <p>l∈Ls
(2) dim(l1 ) ≠ dim(l2 ) ∀l1, l2 ∈ (Lr ∪ Ls ) l1 ≠ l2 w
Conditions (1) and (2) ensure that exactly one level of
each dimension is either a result or a selection level.
Example: The query q1 = “Give me a ranking of garages
according to the number of repairs summed up to the
geographic region for 1998” has the following prototype
T = ( Mq1= {#of repairs}, Dr = {geogr.region},
Ds = {year, vehicle.all, customer.all} ) w
Definition 4.2 (Query Space): The set ΘΩ of all valid
query prototypes for a given MD schema Ω is called the
query space for this schema.</p>
    </sec>
    <sec id="sec-4">
      <title>4.2 Similarity of Queries</title>
      <p>We are interested in recognizing and formulating patterns
of access. Two sessions have a common pattern if they
contain similar queries. Thus, to define a pattern we first
define the notion of similarity of queries. We approximate
the similarity of two queries q1 and q2 by defining a
distance measure that corresponds to the number of user
interactions minimally needed to transform the query
prototype of query q1 to the prototype of q2. A short
distance means a great degree of similarity.</p>
      <p>Assuming a query q with prototype q = (Mq, Lr=
{r1,…,rk}, Ls= {s1,…,sm}), the following basic frontend
interactions result in a query q’ with prototype q’:
drill down/roll up: The user can increase (drill down) or
decrease (roll up) the level of detail for a selection or
result dimensions. For a selection dimension q’ = (Mq, Lr,
Ls’) with Ls’ = (s1, …, si’,…,sm). Where si’ has to fulfill the
following condition:</p>
      <p>∃d=(H,classd)∈Ω: (si,si’)∈classd ∨ (si’,si)∈classd
In case of a result dimension: q’ = (Mq, Lr’, Ls) with Lr’ =
(r1, …, ri’,…,,rk) and</p>
      <p>∃d=(H,classd)∈Ω: (ri,ri’)∈classd ∨ (ri’,ri)∈classd
rotation: the user can convert a result dimension to a
selection dimension and vice versa (e.g. by a drag and
drop mechanism). For result to selection dimension q’ =
(Mq, Lr’, Ls’) with Ls’ = Ls ∪ l and Lr’ = Lr - l; l∈ L .
r
For a selection to result dimension q’ = (Mq, Lr’, Ls’)
with Lr’ = Lr ∪ l and Ls’ = Ls - l; l∈ Ls
focus change: the user can add or delete a measure m
from the MD view. If a measure is added q’ = (Mq’, Lr,
Ls) with Mq’= Mq ∪ {m}; m ∈ M. If a measure is deleted
q’ = (Mq’, Lr, Ls) with Mq’= Mq - {m}; m ∈ M
Similar queries show that the user’s analysis intention
when executing them is similar. The change of intention
differs for different types of operations. For example a
rotate shows that the user is now interested in another
combination of parameters what might indicate that the
analysis task he is working on has evolved. Therefore, it
seems sensible to assign a higher weight to rotate
operations than to drilling operations when computing the
distance. Thus, the weighted distance between two queries is
informally defined as the weighted number of drilling
operations plus the weighted number of rotations plus the
weighted number of focus changes.</p>
      <p>To determine how many drilling/rolling operations are
necessary to navigate from one level to another level
(drilling distance), we define the drilling graph as follows.
Definition 4.3 (drilling graph): The drilling graph for a
given MD schema Ω with a set of corresponding levels Ψ
is defined as a tuple =(Ψ, E) where E⊆ Ψ×Ψ is
constructed using the following condition:
(l1,l2)∈E :⇔ ∃d=(H,classd)∈Ω: (l1,l2) ∈ classd ∨
(l2,l1) ∈ classd
Intuitively, the drilling graph contains all the levels of the
schema as nodes and the classification relationships of the
different dimensions as edges. The edges of the drilling
graph are undirected as drilling and rolling operations
should not be distinguished.</p>
      <p>Definition 4.4 (drilling distance of two levels):The
drilling distance│ l1-l2│ between two levels l1 and l2 is
defined as the minimum length of the path in the drilling
graph .</p>
      <p>Definition 4.5 (distance between two query
prototypes): We extend the definition of the function dim:
dim:2Ψ→2Ω maps a set of levels L⊆Ψ to their
corresponding dimensions.</p>
      <p>dim(L) := U dim(l)
l∈L
Furthermore for two sets of levels A,B ⊆Ψ let χ(A,B)
denote the corresponding level pairs that share the same
dimension.</p>
      <p>χ ( A, B) := {(a, b) ∈ A × B dim(a) = dim(b) }
The distance │ q - q’│ between two query prototypes q
= (Mq, Lr, Ls) and q’ = (Mq, Lr, Ls) is defined as follows:
℘q - ℘q′ := Δdrill + Δrotate + Δfocus
with
Δdrill =</p>
      <p>∑ l - l ′
(l,l′)∈χ (Lr∪Ls,Lr′∪Ls′)
Δrotate = (dim(Lr ’) ∪ dim(Lr ))- (dim(Lr ’) ∩ dim(Lr ))
Δfocus = M ′ ∪ M - (M ′ ∩ M )
Assuming that wd,wr, wf ∈ are the weights for the
different kinds of operations (e.g. drilling) The weighted
distance │ q - q’│ w between two query prototypes is
defined as:
℘q - ℘q′ w := wd ⋅ Δdrill + wr Δrotate + w f Δfocus .
w
Definition 4.6 (distance between two queries): The
(weighted) distance |q1-q2| between two query q1 and q2 is
defined as the (weighted) distance of their prototypes
│ q1 - q2│ . w
Example: Let us compare query q1 “Give me a ranking of
garages according to the number of repairs summed up to
the geographic region for 1998” with prototype
T = ( {#of repairs}, {geogr.region},</p>
      <p>{year, vehicle.all, customer.all})
and the query q2 = “Give me the number of repairs
detailed by vehicle model and month for the garage ‘XYZ’“
with</p>
      <p>T = ( {#of repairs}, {vehicle model, month},
2-6
{garage, customer.all} ).</p>
      <p>The following numbers of operation have to be
performed: Δdrill=2, Δrotate=3, Δfocus=0. Thus, assuming
the following weights wr=2,wd = w f= 1 theses two queries
have a weighted distance of</p>
      <p>|q1-q2|w = 2⋅3+1⋅2+1⋅0 = 8.</p>
      <p>The query q3 = “Give me the number of repairs for
different garages during the year 1998“ has the protoype
T = ( {#of repairs}, {garage},</p>
      <p>{year, vehicle.all, customer.all} ).</p>
      <p>Its weighted distance to query1 is |q1-q3|w=2⋅0+1⋅1+1⋅0=1
and its weighted distance to query2 is |q2-q3|w
=2⋅3+1⋅1+1⋅0 =7. Therefore, query3 is more similar to
query1 that to query2. w
4.3</p>
    </sec>
    <sec id="sec-5">
      <title>User/Task Profiles</title>
      <p>The previous section showed how to model the structure
of individual sessions. If a user solves an analysis task,
each of his sessions may be different as the user is free to
skip steps or include additional steps. Nevertheless, each
analysis task (e.g. “find garages which should participate
in a promotion”) possesses a characteristic interaction
pattern.</p>
      <p>Sessions
Session 1
Session 2
(1)
(4)
(5)</p>
      <p>Profile
(2)</p>
      <p>(3)
Query in Session
Step in Profile
This section deals with the definition of these typical
patterns (so-called user/task profiles). A session is a linear
sequence of queries and two subsequent queries have a
distance of 1. A user/task profile allows alternative
navigation pathes and does only contain prototypes of queries
that correspond to significant steps of the analysis process.
Thus, a single profile is an abstraction of a set of sessions.
The basic idea for the distinction between sessions and
profiles is shown in figure 4. On the left two distinct
sessions are visualized using an abstract two dimensional
projection of the query space. A profile is shown on the
right containing only queries that are significant to the
analysis task.</p>
      <p>Definition 4.5 (user/task profile): A user/task profile for
a query space ΘΩ is a labeled graph (V, E), where
n</p>
      <p>V ⊆ ΘΩ is a finite set of query prototypes (called
steps of the profile).
n</p>
      <p>E ⊆ V×V is the set of directed edges connecting query
prototypes. An edge from query prototype q1 to q2
means that sessions typically contain a query q1 (with
prototype q1) before a query q2 (with prototype q2).</p>
      <p>Notably, q1 and q2 can have an arbitrary distance.
To determine if a complete session ‘matches’ a given
profile, we define the notion of instances of profiles. w
Definition 4.6 (Instance of a user/task profile): A
session 6 := q1,...,qn is called an instance of a profile P =
(V,E) if there exists a profile path v1,...,vm in P and a
mapping I: → such that
(1) (q I(i)) = vi ∀1≤ i ≤ m
(2) i &gt; j ⇔ I(i) &gt; I(j) ∀1≤ i,j ≤ m w
Condition (1) formalizes the fact that for all of the steps of
the profile path vi the session 6 contains a query q with
index I(i) that has the prototype vi. Condition (2) ensures
that the queries in 6 occur in same order as in the profile
path.</p>
      <p>The sessions shown in figure 4 are instances of the profile
on the right. Session 2 contains the steps (1), (2), (3) of
the profile and session 1 contains the steps (1), (4), (3).
Other instances might also contain the sequence (1), (4),
(5)
4.4</p>
    </sec>
    <sec id="sec-6">
      <title>Graphical representation</title>
      <p>As user and task profiles are modeled during the
conceptual modeling phase together with the users of the system,
an intuitive graphical representation is necessary. For this
purpose we use an interaction graph which is a graphical
representation of a task/user profile. Each node of this
graph corresponds to a query prototype which is a mirror
of the user current state of intention.</p>
      <p>measures (Mq)
Selection
Levels (Ls)</p>
      <p>Result
Levels (Lr)</p>
      <sec id="sec-6-1">
        <title>Query Protoype</title>
        <p>It is depicted by an oval which contains three sections (see
figure 5). The set of result measures is contained in the
top part of the oval; selection levels are enumerated to left
and result levels are shown on the right. Furthermore each
section of the oval has an icon that depicts the
corresponding category. The icon shows an abstraction of a
table (as an example for an MD view) with the
corresponding area being gray.</p>
        <p>The edges of the profile correspond to the navigation of
the user. Albeit, following one edge in the profile might
include several individual queries steps for a session that
is an instance of the profile.
#of repairs
year
vehicle.all geogr. region
customer.all
Figure 6 shows a sample profile that models the analysis
task “find candidates for a new promotion” described in
section 1.2. The node (1) shows the predefined report
which serves as a starting point for this task. The
navigation path (1)-(4) corresponds to the workflow described in
the example. The alternative path from (2) to (5) models
the possibility that the user decides to find the most
promising garages for a promotion on the basis of the
customer base of this garage. Therefore, he analyses the
correlation between garages and customers for the current
year. As shown with nodes (1) and (2) ‘backward
navigation’ is also an option. This models the fact that the user
first analyses the figures for regions then drilling to see
the garages for one region but returning to the region view
if the results are not satisfactory. A cycle occurs if the user
executes queries that have the same prototype, i.e. queries
of the same structure with different selection criteria (e.g.
first analyzing 1997 figures the analyzing 1998 figures).
5</p>
        <p>An Architecture for the Prediction
of User Behavior
If typical interaction patterns (task and/or user specific)
are known, this information can be used to build query
profiles for users or user groups. This information can be
used to optimize the performance of an OLAP system at
runtime. The idea is to use the profiles together with the
know prefix of the query session to predict which queries
the user is likely to ask during the rest of the session. Thus
these queries or parts of the results can already be
computed while the user is busy formulating his next query. In
this section we sketch an architecture that allows for
extending an existing OLAP system with prediction
functionality.</p>
        <p>The component architecture of such an environment is
show in figure 7. This abstract architecture applies to all
OLAP systems irrespective of the storage strategy
(MOLAP, ROLAP etc.). The white elements are part of the
traditional OLAP system: an MD user interface which is
used by the analyst to formulate the query and which
visualizes the results. Via an interface (e.g. OLEDB for
OLAP, see [Mic98]), the queries are passed to an MD
query processor that performs optimization and maps the
query to the physical storage format of the MD data store
that contains a persistent image of the data. The query
processor stores information about the executed queries in
a query log. This abstract architecture applies to all OLAP
systems irrespective of the storage strategy (MOLAP,
ROLAP etc.).</p>
        <p>The gray components are added to the system in order to
provide the prediction functionality. The profile
information stores the user/task specific profiles (using the
formalism presented in section 4). These patterns are initially
constructed during the conceptual user modeling process
as described earlier in this paper. Another interesting
alternative is to build, validate and adapt these patterns by
analyzing the query logs. This task is carried out by the
profile builder which uses data mining/pattern recognition
techniques to derive new profiles or adapt existing
profiles. The core of the architecture is a prediction unit,
which is located between the query processor and the
frontend tool. It has the same interface to the user
inter2-8
user domain
system domain
speculative</p>
        <p>cache
‘standard’ OLAP</p>
        <p>component
additional component
the execution is finished. The results of all queries are
either passed to the user or stored in the speculative result
cache. The cache uses an appropriate replacement strategy
(e.g. a time stamp method).
6</p>
        <p>Conclusions and Future Work
The starting point of our research work was the
observation that capturing knowledge about the query behavior of
OLAP users is beneficial for the conceptual and physical
design of such a system but also can improve the systems
runtime performance. An analysis of the characteristics of
OLAP query behavior showed that user access is typically
multidimensional, navigational, explorative, session
oriented with task and user specific patterns.</p>
        <p>Therefore, we proposed a formal framework to model
knowledge about user behavior taking into account these
characteristics. We also presented a graphical notation to
visualize the models and make them usable during
conceptual design. To show the benefits of such an approach
we exemplarily focused on the deployment of the modeled
user/task profiles for increasing the system performance
using prediction techniques.</p>
        <p>The topic of our current and future work will be the
integration of the presented notation into our conceptual
modeling methodology and the validation of this modeling
technique in several real world projects. For this purpose
a modeling tool will extended to support the notation.
topics are the relationship between static and dynamic
model.</p>
        <p>The architecture proposed in the previous section raises a
lot of interesting research problems. It is necessary to
develop a formal prediction model in order to implement
the prediction algorithm. Another interesting point is the
face as the query processor (e.g. MDX) and predicts the
possible next queries of the user from the status of the
current session (the queries asked so far) and the profile
information. Following a predefined strategy it passed
predicted queries to the query processor for speculative
execution. The results are stored in a speculative result
cache.</p>
        <p>In figure 8 the dataflow and the processes inside the
prediction unit are shown (the parts where the formalism
presented in this paper is used are marked gray). The
frontend passes a new query to the request handler. The
handler checks if this query can be answered from the
speculative result cache, i.e. has been correctly predicted.
If not, the query is passed to the query scheduler which is
responsible for passing the query on to the OLAP query
processor. In any case the session state for the current
session is updated (i.e. the current query prototype is
added to the session prefix). This information together
with the profile information is used by the ‘prediction
model’-process which generates queries that are most
likely to be asked next and estimates their probability. In
this process it makes use of the distance measure defined
in section 4 as an approximation of similarity.</p>
        <p>A cost model process passes the estimated database
execution costs of the queries to a decision model process
which decides if the query should be executed
speculatively. In this case it passes the query to the query
scheduler with a flag that the results of this query should be
stored in the cache instead of being sent back to the user
immediately. When the query scheduler receives a new
query for execution, it first checks if such a query is
already being executed at the moment. In this case the query
is not passed on but answered using the results of the
running query. This case can occur if a query that is being
speculatively executed is actually asked by the user before
2-9
component
process
database
storage
covered in this paper
speculative
result
cache</p>
        <p>precomputed
results
results
of speculative
query
new
query
new
query
request
handler
query
scheduler
session state
possible next</p>
        <p>queries
cost
model
speculative query
prediction</p>
        <p>model
estimated
costs
decision
model
probability
prediction unit</p>
      </sec>
      <sec id="sec-6-2">
        <title>MD query processor</title>
        <p>derivation of user/task profiles from the saved interaction
logs using data mining and pattern recognition techniques.
The question of how to combine the manually modeled
user profiles and the automatically generated ones
immediately arises. Physical implementation issues are certainly
worth further studies (e.g. how to organize the speculative
result cache and which caching strategies to use).
Our plans are to implement the above architecture on top
of a commercial OLAP system and to evaluate the
prediction algorithm using real world interaction logs.
Until now we only focused on using the profile
information outside the kernel of the OLAP system. Further
potential optimization become possible when integrating the
prediction functionality into the kernel of the OLAP query
processing system (e.g. special caching techniques can
make use of the prediction results).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>I want to express my gratitude to my colleagues Markus
Blaschka, Gabriele Höfling and Wolfgang Wohner for
reading through several versions of this paper and
discussing this work with me. Their most valuable comments
played a central role during the development of this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [ABD+99]
          <string-name>
            <given-names>J.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Deyerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Günzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hümmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Schlesinger: Management of Multidimensional Aggregates for Efficient Online Analytical Processing</article-title>
          .
          <source>Proc. of the IDEAS</source>
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [CT98a]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cabibbo</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Torlone: A Logical Approach to Multidimensional Databases</article-title>
          .
          <source>Proceedings of the EDBT</source>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [CT98b]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cabibbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          .
          <article-title>From a Procedural to a Visual Query Language for OLAP</article-title>
          .
          <source>In 10th IEEE International Conference on Scientific and Statistical Database Management (SSDBM-98)</source>
          , Capri, Italy
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [DSBH98]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dinter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sapia</surname>
          </string-name>
          , G. Höfling,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Blaschka: The OLAP Market: State of the Art and Research Issues</article-title>
          ,
          <source>Proc.of First International Workshop on Data Warehousing and OLAP (DOLAP)</source>
          ,
          <year>1998</year>
          , Washington,
          <string-name>
            <surname>D.C.</surname>
          </string-name>
          , USA..
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [FB99]
          <string-name>
            <given-names>P.</given-names>
            <surname>Furtado</surname>
          </string-name>
          , P. Baumann:
          <article-title>Storage of Multidimensional Arrays based on Arbitrary Tiling</article-title>
          ,
          <source>Proc. of the 15th International Conference on Data Engineering (ICDE)</source>
          ,
          <article-title>Sydney 1999</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [GHR+97]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Harinarayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          <article-title>Ullman: Index Selection for OLAP</article-title>
          .
          <source>Proceedings of the International Conference on Data Engineering (ICDE)1997</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>[GMR98] M. Golfarelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <article-title>Conceptual design of data warehouses from E/R schemes</article-title>
          ,
          <source>Proc. 31st Hawaii Intl. Conf. on System Sciences</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>[GR98] M. Golfarelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <article-title>A Methodological Framework for Data Warehouse Design</article-title>
          , ACM DOLAP Workshop, Washington 1998
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Mic98]
          <string-name>
            <given-names>Microsoft</given-names>
            <surname>Corp</surname>
          </string-name>
          .,
          <source>OLEDB for OLAP Specification 1</source>
          .
          <fpage>0</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [SBHD99]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sapia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blaschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Höfling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dinter</surname>
          </string-name>
          , Extending the
          <string-name>
            <surname>E</surname>
          </string-name>
          /
          <article-title>R Model for the Multidimensional Paradigm</article-title>
          ,
          <source>in "Advances in Database Technologies"</source>
          , Springer LNCS 1552,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [SBH99]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sapia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blaschka</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Höfling, An Overview of Multidimensional Data Models</article-title>
          ,
          <source>FORWISS Report FR-1999- 001</source>
          , http://www.forwiss.tu-muenchen.de/~system42.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [SDN98]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shukla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          <article-title>: Materialized View Selection for Multidimensional Datasets</article-title>
          ,
          <source>Proceedings of the VLDB</source>
          <year>1998</year>
          , Athens, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [TS98]
          <string-name>
            <given-names>D.</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Sellis: Data Warehouse Schema and instance Design</article-title>
          ,
          <source>in Conceptual Modeling - ER'98</source>
          , Springer LNCS Vol.
          <volume>1507</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Vas98]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis: Modeling Multidimensional</surname>
          </string-name>
          <string-name>
            <surname>Databases</surname>
          </string-name>
          , Cubes and
          <string-name>
            <given-names>Cube</given-names>
            <surname>Operations</surname>
          </string-name>
          .
          <source>Conference on Statistical and Scientific Databases (SSDBM)</source>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [ZAN99]
          <string-name>
            <surname>Zukerman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albrecht</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Nicholson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Predicting</surname>
            <given-names>Users</given-names>
          </string-name>
          '
          <article-title>Requests on the WWW</article-title>
          .
          <source>In Proc. of the 7th International Conference on User Modeling</source>
          , Springer-Verlag
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>