=Paper= {{Paper |id=Vol-2572/paper1 |storemode=property |title=Towards Conversational OLAP |pdfUrl=https://ceur-ws.org/Vol-2572/paper1.pdf |volume=Vol-2572 |authors=Matteo Francia,Enrico Gallinucci,Matteo Golfarelli |dblpUrl=https://dblp.org/rec/conf/dolap/FranciaGG20 }} ==Towards Conversational OLAP== https://ceur-ws.org/Vol-2572/paper1.pdf

Towards Conversational OLAP
Matteo Francia Enrico Gallinucci Matteo Golfarelli
DISI - University of Bologna DISI - University of Bologna DISI - University of Bologna
m.francia@unibo.it enrico.gallinucci@unibo.it matteo.golfarelli@unibo.it

ABSTRACT #2 it must handle OLAP sequences rather than single queries:
The democratization of data access and the adoption of OLAP in in an OLAP sequence the first query is fully described by
scenarios requiring hand-free interfaces push towards the cre- the text, while the following ones are implicitly/partially
ation of smart OLAP interfaces. In this paper, we envisage a con- described by an OLAP operator and require the system to
versational framework specifically devised for OLAP applications. have memory of the previous ones;
The system converts natural language text in GPSJ (Generalized #3 it must be robust with respect to user inaccuracies in using
Projection, Selection and Join) queries. The approach relies on an syntax, OLAP terms, and attribute values, as well as in the
ad-hoc grammar and a knowledge base storing multidimensional presence of implicit information;
metadata and cubes values. In case of ambiguous or incomplete #4 it must be easy to configure on a DW without a heavy
query description, the system is able to obtain the correct query manual definition of the lexicon.
either through automatic inference or through interactions with More technically, our text-to-SQL approach is based on a gram-
the user to disambiguate the text. Our tests show very promising mar parsing natural language descriptions of GPSJ queries. The
results both in terms of effectiveness and efficiency. recognized entities include a set of typical query lexicons (e.g.
group by, select, filter) and the domain specific terms and val-
ues automatically extracted from the DW (see desiderata #1 and
1 INTRODUCTION #4). Robustness (desiderata #3) is one of the main goals of our
Nowadays, one of the most popular research trends in computer approach and is pursued in all the translation phases: lexicon
science is the democratization of data access, analysis and visu- identification is based on a string similarity function, multi-word
alization, which means opening them to end users lacking the lexicons are handled through n-grams, and alternative query in-
required vertical skills on the services themselves. Smart per- terpretations are scored and ranked. The grammar proposed in
sonal assistants [10] (Alexa, Siri, etc.) and auto machine learning Section 4.3 recognizes full queries only, thus desiderata #2 is not
services [20] are examples of such research efforts that are now covered by the current work. To sum up, the main contributions
on corporate agendas [1]. of this paper are:
In particular, interfacing natural language processing (either (1) a list of features and desiderata for an effective conversa-
written or spoken) to database systems opens to new opportuni- tional OLAP system;
ties for data exploration and querying [13]. Actually, in the area (2) an architectural view of the whole framework;
of data warehouse, OLAP (On-Line Analytical Processing) itself (3) an original approach to translate the natural language
is an "ante litteram" smart interface, since it supports the users description of an OLAP query in a well-formed GPSJ query
with a "point-and-click" metaphor to avoid writing well-formed (Full query module in Figure 1).
SQL queries. Nonetheless, the possibility of having a conversa- (4) a set of tests to verify the effectiveness of our approach.
tion with a smart assistant to run an OLAP session (i.e., a set of The remainder of the paper is organized as follows: in Section 2
related OLAP queries) opens to new scenarios and applications. It the functional architecture and the modules of a conversational
is not just a matter of further reducing the complexity of posing a OLAP framework are sketched; Section 3 discusses related works;
query: a conversational OLAP system must also provide feedback Section 4 describes in detail the query generation process; in Sec-
to refine and correct wrong queries, and it must have memory tion 5 a large set of effectiveness and efficiency tests are reported;
to relate subsequent requests. A reference application scenario finally, Section 6 draws the conclusions and discusses the system
for this kind of framework is augmented business intelligence [9], evolution.
where hand-free interfaces are mandatory.
In this paper, we envisage a conversational OLAP framework 2 SYSTEM OVERVIEW
able to convert a natural language text into a GPSJ query. GPSJ
Figure 1 sketches a functional view of the architecture. Given a
[11] is the main class of queries used in OLAP since it enables
DW, that is a set of multidimensional cubes together with their
Generalized Projection, Selection and Join operations over a set of
metadata, the offline phase is aimed at extracting the DW spe-
tables. Although, some natural language interfaces to databases
cific terms used by user to express the queries. Such information
have already been proposed, to the best of our knowledge this is
are stored in the system knowledge base (KB). Noticeably, this
the first proposal specific to OLAP systems.
phase runs only once for each DW unless it undergoes modifica-
In our vision, the desiderata for an OLAP smart interface are:
tions. More in detail, the Automatic KB feeding process extracts
#1 it must be automated and portable: it must exploit cubes from the cubes categorical attribute values and metadata (e.g.
metadata (e.g. hierarchy structures, role of measures, at- attribute and measure names, hierarchy structures, aggregation
tributes, and aggregation operators) to increase its un- operators). With reference to the cube represented through a
derstanding capabilities and to simplify the user-machine DFM schema in Figure 2, the KB will store the names of the
interaction process; measures (e.g. StoreSales, StoreCost) together with their default
aggregation operators (e.g. sum, avg). Additional synonyms can
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of be automatically extracted from open data ontologies (Wordnet
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution [16] in our implementation); for example, "client" is a synonym
4.0 International (CC BY 4.0). for Customer. The larger the number of synonyms, the wider the
language understood by the systems. Besides the domain specific ensure drill-across validity). Also differently from [14], the re-
terminology, the KB includes the set of standard OLAP terms that sults we provide are supported by extensive effectiveness and
are domain independent and that do not require any feeding (e.g. efficiency performance evaluation that completely lack in [14].
group by, equal to, select). Further enrichment can be optionally Finally, existing dialog systems, such as [17], address the explo-
carried out manually (i.e., by the KB enrichment module) when ration of linked data. Hence they are not suitable for analytics on
the application domain involves a non-standard vocabulary (i.e., the multidimensional model. As to question answering, existing
when the physical names of tables and columns do not match systems are well understood and differentiate for the knowledge
the words of a standard vocabulary). required to formulate the query and for the generative approach.
The online phase runs every time a query is issued to the Domain agnostic approaches solely rely on the database schema.
system. The spoken query is initially translated to text by the NaLIR [13] translates natural language queries into dependency
Speech-to-text module. This task is out of scope in our research trees [15] and brute-forcefully transforms promising trees un-
and we exploited the Google API in our implementation. The til a valid query can be generated. In our approach we rely on
uninterpreted text is then analyzed by the Interpretation module n-grams instead of dependency trees [15] since the latter can-
that actually consists of two sub-modules: Full query is in charge not be directly mapped to entities in the knowledge base (i.e.,
of interpreting the texts describing full queries, which typically they require tree manipulation) and are sensible to the query
happens when an OLAP session starts. Conversely, OLAP op- syntax (e.g., "sum unit sales" and "sum the unit sales" produce two
erator modifies the latest query when the user states an OLAP different trees with the same meaning). SQLizer [22] generates
operator along an OLAP session. On the one hand, understanding templates over the issued query and applies a "repair" loop until
a single OLAP operator is simpler since it involves less elements. it generates queries that can be obtained using at most a given
On the other hand, it requires to have memory of previous queries number of changes from the initial template. Domain specific
(stored in the Log) and to understand which part of the previous approaches add semantics to the translation process by means of
query must be modified. domain-specific ontologies and ontology-to-database mappings.
Due to natural language ambiguity, user errors and system SODA [5] uses a simple but limited keyword-based approach
inaccuracies, part of the text can be misunderstood. The role that generates a reasonable and executable SQL query based on
of the Disambiguation module is to solve ambiguities by ask- the matches between the input query and the database metadata,
ing appropriate questions to the user. The reasons behind the enriched with domain-specific ontologies. ATHENA [18] and its
misunderstandings are manyfold, including (but not limited to): recent extension [19] map natural language into an ontology
ambiguities in the aggregation operator to be used; inconsistency representation and exploit mappings crafted by the relational
between attribute and value in a selection predicate; (3) identifica- schema designer to resolve SQL queries. Analyza [6] integrates
tion of relevant elements in the text without full comprehension. the domain-specific ontology into a "semantic grammar" (i.e., a
The output of the previous steps is a data structure (i.e., a grammar with placeholders for the typed concepts such as mea-
parsing tree) that models the query and that can be automatically sures, dimensions, etc.) to annotate and finally parse the user
turned in a SQL query exploiting the DW structure stored in the query. Additionally, Analyza provides an intuitive interface facil-
KB. Finally, the obtained query is run on the DW and the results itating user-system interaction in spreadsheets. Unfortunately,
are reported to the user by the Execution & Visualization module. by relying on the definition of domain specific knowledge and
Such module could exploit a standard OLAP visualization tool mappings, the adoption of these approaches is not plug-and-play
or it could implement voice-based approaches [21] to create an as an ad-hoc ontology is rarely available and is burdensome to
end-to-end conversational solution. create.
In the area of business intelligence the road to conversation-
driven OLAP is not paved yet. The recommendation of OLAP
3 RELATED WORKS
sessions to improve data exploration has been well-understood
Conversational business intelligence can be classified as a natural [3] also in domains of unconventional contexts [9] where hand-
language interface (NLI) to business intelligence systems to drive free interfaces are mandatory. Recommendation systems focus
analytic sessions. Despite the plethora of contributions in each on integrating (previous) user experience with external knowl-
area, to the best of our knowledge, no approach lies at their edge to suggest queries or sessions, rather than providing smart
intersection. interfaces to BI tools. To this end, personal assistants and con-
NLIs to operational databases enable users to specify com- versational interfaces can help users unfamiliar with such tools
plex queries without previous training on formal programming and SQL language to perform data exploration. However, end-
languages (such as SQL) and software; a recent and comprehen- to-end frameworks are not provided in the domain of analytic
sive survey is provided in [2]. Overall, NLIs are divided into sessions over multidimensional data. QUASL [12] introduces a
two categories: question answering (QA) and dialog (D). While QA approach over the multidimensional model that supports
the former are designed to operate on single queries, only the analytical queries but lacks both the formalization of the dis-
latter are capable of supporting sequences of related queries as ambiguation process (i.e., how ambiguous results are addressed)
needed in OLAP analytic sessions. However, to the best of our and the support to OLAP sessions (with respect to QA, handling
knowledge, no dialog-based system for OLAP sessions has been OLAP sessions requires to manage previous knowledge from the
provided so far. The only contribution in the dialog-based di- Log and to understand whether the issued sentence refines previ-
rection is [14], where the authors provide an architecture for ous query or is a new one). Complementarily to our framework,
querying relational databases; with respect to this contribution [21] recently formalized the vocalization of OLAP results.
we rely on the formal foundations of the multidimensional model To summarize, the main differences of our approach to the
to drive analytic sessions (e.g., according to the multidimensional previous works are the following.
model it is not possible to group by a measure, compute aggrega-
tions of categorical attributes, aggregate by descriptive attributes,
Interpretation
Sales by Log
Customer and OLAP
Month operator Parse tree
Raw Annotated Parse
text parse forest tree SQL Results
Speech SQL Execution &
Full query Disambiguation
-to-Text generation Visualization
Online
Offline

KB Automatic Metadata & values DW
enrichment KB KB feeding
Synonyms
Synonyms
Ontology

Figure 1: A functional architecture for a conversational OLAP framework.

S.Region C.Region
Gender • DB tables: the structure of the database implementing
S.City C.City the DW including tables’ and attributes’ names, primary
Store Customer and foreign key relationships.
Family Name Sales Month The set of DW entities includes:
StoreSales Date Year
• DW element names: measures, dimensional attributes
StoreCost
Type Product and fact names.
Category UnitSales
Quarter • DW element values: for each categorical attribute, all
the values are stored.
Figure 2: Simplified DFM representation of the Foodmart
Sales cube. Several synonyms can be stored for each entity, enabling the
system to cope with slang and different shades of the text. Both
DW entities and DW structure are automatically collected by
(1) The envisioning of an end-to-end general-purpose dialog- querying the DW and its data dictionary and are stored in a
driven framework supporting OLAP sessions. QB4OLAP [8] compatible repository. Besides DW entities, that
(2) The definition of the framework’s functional architecture are domain specific, the knowledge base includes those keywords
and the formalization of its modules. and patterns that are typically used to express a query:
(3) The enforcement of multidimensional constraints by the • Intention keywords: express the role of the subsequent
means of formal grammar supporting GPSJ queries. part of text. Examples of intention keywords are group by,
(4) A plug-and-play implementation of the core modules on select and filter.
top existing data warehouses: our framework does not • Operators: include logic (e.g. and, or), comparison (e.g.
impact existing schemata and the integration with external greater, equal) and aggregation operators (e.g. sum, aver-
knowledge is supported but not mandatory. age).
• Patterns of dates and numbers: used to automatically
4 QUERY GENERATION recognize dates and numbers in the raw text.
In the next subsections, we describe in detail the interpretation Overall, the entities defined in the system are defined as E =
process for a single full query. Figure 3 shows the necessary {E 1 , E 2 , ..., Em }; note that each entity can be a multi-word that
steps in the online phase. We remind that, with reference to can be modeled as a sequence of tokens E = ⟨t 1 , ..., tr ⟩.
Figure 1, modules Speech-to-Text, OLAP operator and, Execution
& Visualization are out of scope here. Conversely, the offline 4.2 Tokenization and Mapping
phase process and the related modules are implemented but not A raw text T can be modeled as a sequence of tokens (i.e., single
reported. words) T = ⟨t 1 , t 2 , ..., tz ⟩. The goal of this phase is to identify in
T the known entities that are the only elements involved in the
4.1 The Knowledge Base Parsing phase.
The system Knowledge Base (KB in Figure 1) stores all the infor- Turning a text into a sequence of entities means finding a
mation extracted from the DW that is necessary to support the mapping between tokens in T and E. More formally:
translation phases. Information can be classified in DW Entities
and Structure. More in detail, the DW structure enables consis- Definition 4.1 (Mapping & Mapping function). A mapping func-
tency checks on parse trees and, given a parse tree, allows the tion M(T ) is a partial function that associates sub-sequences1
SQL generation. It includes: from T to entities in E such that:
• Hierarchy structure: the roll-up relationships between • sub-sequences of T have length n at most;
attributes. • the mapping function determines a partitioning of T ;
• Aggregation operators: for each measure, the applicable
1 The term n -gram is used as a synonym of sub-sequence in the area of text mining.
operators and the default one.
Sales by
Customer and
Month
Interpretation: Full query
Parse Annotated Parse
Mappings forest parse forest tree SQL
Tokenization Checking & SQL
Parsing Disambiguation
& Mapping Enhancement generation

Figure 3: Query generation steps for a text describing a complete query. The KB is involved in all the steps.

• a sub-sequence T ′ = ⟨ti , ..., tl ⟩ ∈ T with (l ≤ n) is associ- Edgar Allan Poe Edgar Allan Poe Edgar Allan Poe
ated to an entity E if and only if E ∈ TopN E ∈ E Sim(T ′, E)
and Sim(T ′, E) > α. P. Edgar P. Edgar P. Edgar

The output of a mapping function is a sequence M = ⟨E 1 , ..., El ⟩ Edgar Allan Poe Edgar Allan Poe Edgar Allan Poe
on E that we call a mapping. Given the mapping M the similarities
between each entity Ei and the corresponding tokens Sim(T ′, Ei ) P. Edgar P. Edgar P. Edgar
are also retained and denoted with Sim M (Ei ). A mapping is said
to be valid if the fraction of mapped tokens in T is higher than a Figure 4: Token dispositions: arcs denote the correspon-
given threshold β. We call M the set of valid mappings. dence of tokens for a specific disposition (the bolder the
Several mapping functions (and thus several mappings) may line, the higher the similarity). The disposition determin-
exists between T and E since Definition 4.1 admits sub-sequences ing the maximum similarity is shown in a dashed rectan-
of variable lengths and retains, for each sub-sequence, the top gle.
similar entities. This increases interpretation robustness, but it
can lead to an increase in computation time. Generating several
different mappings increases robustness since it allows to choose, ⟨P ., Edдar ⟩ and W = ⟨Edдar , Allan, Poe⟩. The disposition deter-
in the next steps, the best text interpretation out of a higher num- mining the highest similarity is surrounded by a dashed rectangle;
ber of candidates. Generated mappings differ both in the number the similarity is 0.46 and it is calculated as
of entities involved and, in the specific entities mapped to a token.
sim(P ., Poe)|Poe | + sim(Edдar, Edдar )|Edдar |
In the simple case where multi-token mappings are not possible Sim(T ,W ) =
(i.e., n = 1 in Definition 4.1) the number of generated mappings |Poe | + |Edдar | + |Allan|
for a raw text T , such that |T | = z, is: □
z
z

· Ni We assume the best interpretation of the input text to be the
Õ
i one where (1) all the entities discovered in the text are included
i= ⌈z ·β ⌉
in the query (i.e., all the entities are parsed through the grammar)
The formula counts the possible configurations of sufficient and, (2) each entity discovered in the text is perfectly mapped
length (i.e., higher or equal to ⌈z · β⌉) and, for each length, count to one sub-sequence of tokens (i.e., Sim M (Ei ) = 1). The two
the number of mappings determined by the top similar enti- previous statements are modeled through the following score
ties. Since the number of candidate mappings is exponential, we function. Given a mapping M = ⟨E 1 , ..., Em ⟩, we define its score
consider only the most significant ones through α, β, and N : α as
imposes sub-sequence of tokens to be very similar to an entity; N m
Õ
further imposes to consider only the N entities with the highest Score(M) = Sim M (Ei ) (1)
similarity; finally, β imposes a sufficient portion of the text to be i=1
mapped. The score is higher when M includes several entities with high
The similarity function Sim() is based on the Levenshtein dis- values of Sim M . Although at this stage it is not possible to predict
tance and keeps token permutation into account to make similar- if a mapping will be fully parsed, it is apparent that the higher
ity robust to token permutations (i.e., sub-sequences ⟨P ., Edдar ⟩ the mapping score, the higher its probability to determine an
and ⟨Edдar, Allan, Poe⟩ must result similar). Given two token optimal interpretation. As it will be explained in the Section 4.4,
sequences T and W with |T | = l, |W | = m such that l ≤ m it is: ordering the mapping by descending score also enables pruning
Sim(⟨t 1 , ..., tl ⟩, ⟨w 1 , ..., wm ⟩) = strategies to be applied.
Íl
i=1 sim(ti , w D(i) ) · max(|ti |, |w D(i) |) Example 4.3 (Tokenization and mapping). Given the set of enti-
max Íl ties E and a tokenized text T = ⟨medium, sales, in, 2019, by, the,
i=1 max(|ti |, |w D(i) |) +
D ∈Disp(l,m)
Í
i ∈ D̂ |w i | reдion⟩, examples of mappings M 1 and M 2 are:
where D ∈ Disp(l, m) is an l-disposition of {1, ..., m} and D̂ is
M 1 =⟨avg, UnitSales, where, 2019, group by, region⟩
the subset of values in {1, ..., m} that are not present in D. Func-
tion Sim() weights token similarity based on their lengths (i.e., M 2 =⟨avg, UnitSales, where, 2019, group by, Reдin⟩
max(|ti |, |w D(i) |) and penalizes similarities between sequences of
Í where the token the is not mapped and the token reдion is
different lengths that implies unmatched tokens (i.e., i ∈D̂ |w i |).
mapped to the attribute Region in M 1 and to the value Reдin
Example 4.2 (Token similarity). Figure 4 shows some of the in M 2 (where Reдin is a value of attribute Customer that holds a
possible token dispositions for the two token sequences T = sufficient similarity). □
Table 1: Generic and domain specific syntactic categories.
⟨GPSJ⟩ ::= ⟨MC⟩⟨GC⟩⟨SC⟩ | ⟨MC⟩⟨SC⟩⟨GC⟩ | ⟨SC⟩⟨GC⟩⟨MC⟩
Category Entity Synonym samples | ⟨SC⟩⟨MC⟩⟨GC⟩ | ⟨GC⟩⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩⟨SC⟩
Int select return, show, get | ⟨MC⟩⟨SC⟩ | ⟨MC⟩⟨GC⟩ | ⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩
Whr where in, such that | ⟨MC⟩
Gby group by by, for each, per ⟨MC⟩ ::= (⟨Agg⟩⟨Mea⟩ | ⟨Mea⟩⟨Agg⟩ | ⟨Mea⟩ | ⟨Cnt⟩⟨Fct⟩
Cop =, <>, >, <, ≥, ≤ equal to, greater than
Agg sum, avg total, medium | ⟨Cnt⟩⟨Attr⟩)+
Cnt count, count distinct number, amount ⟨GC⟩ ::= ⟨Gby⟩⟨Attr⟩+
Fct Facts Domain specific ⟨SC⟩ ::= ⟨Whr⟩⟨SCO⟩
Mea Measures Domain specific
⟨SCO⟩ ::= ⟨SCA⟩”or ”⟨SCO⟩ | ⟨SCA⟩
Att Attributes Domain specific
Val Categorical values Domain specific ⟨SCA⟩ ::= ⟨SCN⟩”and”⟨SCA⟩ | ⟨SCN⟩
Dates and numbers - ⟨SCN⟩ ::= ”not”⟨SSC⟩ | ⟨SSC⟩
⟨SSC⟩ ::= ⟨Attr⟩⟨Cop⟩⟨Val⟩ | ⟨Attr⟩⟨Val⟩ | ⟨Val⟩⟨Cop⟩⟨Attr⟩

4.3 Query Parsing | ⟨Val⟩⟨Attr⟩ | ⟨Val⟩

Parsing a valid mapping means searching in the text the complex
syntax structures (i.e., clauses) that build-up the query. Given a Figure 5: Backus-Naur representation of the grammar.
mapping M, the output of a parser is a parse tree PTM , i.e. an
ordered, rooted tree that represents the syntactic structure of
a string according to a given grammar. To the aim of parsing, � GPSJ �
entities (i.e., terminal elements in the grammar) are grouped
� MC � � SC � � GC �
in syntactic categories (see Table 1). The whole grammar is de-
scribed in Figure 5. As a GPSJ query consists of 3 clauses (measure, � SCO �
group by and selection), in our grammar we identify four types � SCA �
of derivations2 :
� SCN �
• Measure clause ⟨MC⟩: this derivation consists of a list
of measure/aggregation operator pairs. If the operator is � SSC �
omitted (i.e., ⟨MC⟩ ::= ⟨Mea⟩ applies) the default one � Agg � � Mea � � Whr � � Val � � Gby � � Attr �
specified in the KB will be applied to the measure during M1 = � avg, UnitSales, where, 2019, group by, Region �
the Checking & Enhancement step.
(a) The entire mapping is parsed (i.e., P F M = PTM ).
• Group by clause ⟨GC⟩: this derivation consists of a se-
quence of attribute names preceded by an entity of type
� GPSJ �
⟨Gby⟩.
• Selection clause ⟨SC⟩: this derivation consists of a Boolean � MC � � SC � � SC �
expression of simple selection predicates ⟨SSC⟩; follows
standard SQL operator priority (see ⟨SCO⟩, ⟨SCA⟩, and � SCO � � SCO �
⟨SCN⟩ derivations). � SCA � � SCA �
• GPSJ query ⟨GPSJ⟩: this derivation assembles the final
� SCN � � SCN �
query. Only the measure clause is mandatory since a GPSJ
could aggregate a single measure with no selections. The � SSC � � SSC �
order of clauses is irrelevant; this implies the proliferation � Agg � � Mea � � Whr � � Val � � Gby � � Val �
of derivations due to clause permutations. M2 = � avg, UnitSales, where, 2019, group by, Regin �
The GPSJ grammar is LL(1)3 [4], is not ambiguous (i.e., each (b) ⟨Gby⟩ ⟨Val⟩ cannot be parsed.
mapping admits a single parsing tree PTM ) and can be parsed by
a LL(1) parser with linear complexity [4]. If the input mapping
Figure 6: Parsing forests from Example 4.3.
M can be fully parsed, PTM will include all the entities as leaves.
Conversely, if only a portion of the input belongs to the grammar,
an LL(1) parser will produce a partial parsing, meaning that it will
return a parsing tree including the portion of the input mapping than the simple parse tree enables disambiguation and errors to
that belongs to the grammar. Remaining entities can be either be recovered during the Disambiguation phase.
singleton or complex clauses that were not possible to connect to
the main parse tree. We will call parse forest PF M the union of the Example 4.4 (Parsing). Figure 6 reports the parsing outcome
parsing tree with residual clauses. Obviously, if all the entities for the two mappings in Example 4.3. M 1 is fully parsed, thus
are parsed, it is PF M = PTM . Considering the whole forest rather its parsing forest correspond to the parsing tree (i.e., PTM1 =
2 A derivation in the form ⟨X ⟩ ::= e represent a substitution for the non-terminal PF M1 ). Conversely, in M 2 the last token is wrongly mapped to
symbol ⟨X ⟩ with the given expression e . Symbols that never appear on the left side the attribute value Regin rather than to the attribute name Region.
of ::= are named terminals. Non-terminal symbols are enclosed between ⟨⟩ .
3 The rules presented in Figure 5 do not satisfy LL(1) constraints for readability
This prevents the full parsing and the parsing tree PTM does not
reasons. It is easy to turn such rules in a LL(1) complaint version, but the resulting include all the entities in M.
rules are much more complex to be read and understood.
Annotation type Gen. derivation sample Example
Ambiguous Attribute ⟨SSC⟩ ::= ⟨Val⟩ "sum unit sales for Salem", but Salem is member of City and StoreCity
Ambiguous Agg. Operator ⟨MC⟩ ::= ⟨Mea⟩ "unit sales by product", but sum and avg are valid aggregations
Attribute-Value Mismatch ⟨SSC⟩ ::= ⟨Attr⟩⟨Cop⟩⟨Val⟩ "sum unit sales for product New York", but N ew York is not a Product
MD-Meas Violation ⟨MC⟩ ::= ⟨Agg⟩⟨Mea⟩ "sum prices per store", but Price is not addictive
MD-GBY Violation ⟨GC⟩ ::= ⟨Gby⟩⟨Attr⟩+ "average prices by name", but Product is not in ⟨GC⟩
Unparsed clause – "average unit sales by Regin", but Reдin is not an attribute
Table 2: Annotation types.

� GPSJ �
4.4 Knowledge-based Parse Forest Checking unparsed
and Enhancement � MC � � SC � � SC �
The Parsing phase verifies adherence of the mapping to the GPSJ � SCO � � SCO �
grammar, but this does not guarantee the executability of the
� SCA � � SCA �
corresponding SQL query due to implicit elements, ambiguities
and errors. The following list enumerates the cases that require � SCN � � SCN �
further processing. � SSC � AVM � SSC �
• Ambiguous attribute: the ⟨SSC⟩ clause has an implicit
� Agg � � Mea � � Whr � � Attr � � Cop � � Val � � Gby � � Val �
attribute and the parsed value belongs to multiple attribute
M = � avg, UnitSales, where, Product, =, New York, group by, Regin �
domains. pessimistic
• Ambiguous aggregation operator: the ⟨MC⟩ clause has optimistic-pessimistic
an implicit aggregation operator but the measure is asso- optimistic
ciated with multiple aggregation operators.
• Attribute-value mismatch: the ⟨SSC⟩ clause includes a Figure 7: Portion of the parsing forest contributing to its
value that is not valid for the specified attribute, i.e. it is score depending on the adopted scoring criterion.
either outside or incompatible with the attribute domain.
• Violation of a multidimensional constraint on a mea-
sure: the ⟨MC⟩ clause contains an aggregation operator • Proposing only the most promising choice may incur in
that is not allowed for the specified measure. missing the optimal derivation but it makes easier to create
• Violation of a multidimensional constraint on an at- a baseline query. This one can be improved by adding or
tribute: the ⟨GC⟩ clause contains a descriptive attribute removing clauses through further interactions enabled by
without the corresponding dimensional attribute4 . the OLAP operator module.
• Implicit aggregation operator: the ⟨MC⟩ clause has an Definition 4.5 (Parsing Forest Score). Given a mapping M and
implicit aggregation operator and the measure is asso- the corresponding parsing forest PF M , we define its score as
ciated with only one aggregation operator or a default Score(PF M ) = Score(M ′ ) where M ′ is the sub-sequence of M
operator is defined. belonging to the parsing tree PTM .
• Implicit attribute: the ⟨SSC⟩ clause has an implicit at-
tribute and the parsed value belongs to only one attribute The parsing forest holding the highest score is the one pro-
domain. posed to the user. This ranking criterion is based on an optimistic-
• Unparsed clause: The parser was not able to fully under- pessimistic forecast of the outcome of the Disambiguation phase.
stand the mapping and the returned parse forest includes On the one hand, we optimistically assume that the annotations
one or more dandling clauses. belonging to PTM will be positively solved in the Disambiguation
phase and the corresponding clauses and entities will be kept.
Each bullet above determines a possible annotation of the
On the other hand, we pessimistically assume that non-parsed
nodes in the parsed forest. In case of implicit information, the
clauses belonging to PF M will be dropped. The rationale of our
entities in the KB represented by the implicit nodes are automat-
choice is that an annotated clause included in the parsing tree is
ically added to the parse tree; thus no user action is required. All
more likely to be a proper interpretation of the text. As shown
the other cases cannot be automatically solved and the nodes
in Figure 7, a totally pessimistic criterion (i.e., exclude from the
are annotated with a specific error type that will trigger a user-
score all the annotated clauses and entities) would carry forward
system interaction during the Disambiguation phase. Table 2
a too simple, but non-ambiguous, forest; conversely, a totally
reports annotation examples.
optimistic criterion (i.e., consider the score of all the entities in
A textual query generates several parsing forests, one for
PF M ) would make preferable a large but largely non-parsed forest.
each mapping. In our approach, only the most promising one is
Please note that the bare score of the mapping (i.e., the one avail-
proposed to the user in the Disambiguation phase. This choice
able before parsing) corresponds to a totally optimistic choice
comes from two main motivations:
since it sums up the scores of all the entities in the mapping.
• Proposing more than one alternative queries to the user The ranking criterion defined above enables the pruning of the
can be confusing and makes very difficult to contextualize mappings to be parsed as shown by Algorithm 1 that reports the
the disambiguation questions. pseudo-code for the Full query module. Reminding that mappings
4 According to DFM a descriptive attribute is an attribute that further describes a
are parsed in descending score order, let us assume that, at some
dimensional level (i.e., it is related one-to-one with the level), but that can be used step, the best parse forest is PF M ′ with score Score(PF M ′ ). If
for aggregation only in combination with the corresponding level. the next mapping to be parsed, M ′′ , has score Score(M ′′ ) <
Algorithm 1 Parsing forest selection • FROM: measures and attributes/values identify, respectively,
Require: M: set of valid mappings the fact and the dimension tables involved in the query.
Ensure: PF ∗ : best parsing forest Given these tables, the join path is identified by follow-
1: M ← sort(M) ▷ sort mappings by score ing the referential integrity constraints (i.e., by following
2: PF ∗ ← ∅ foreign keys from dimension tables imported in the fact
3: while M , ∅ do ▷ while mapping space is not exhausted table).
4: M ← head(M) ▷ get the mapping with highest score Example 4.6 (SQL generation). Given the GPSJ query "sum the
5: M ← M \ {M } ▷ remove it from M unit sales by type in the month of July", its corresponding SQL is:
6: PF M ← parse(M) ▷ parse the mapping
SELECT Type, sum(UnitSales)
7: if score(PF M ) > score(PF ∗ ) then
FROM Sales s JOIN Product p ON (s.pid = p.id)
▷ if current score is higher than previous score
JOIN Date d ON (s.did = d.id)
8: PF ∗ ← PF M ▷ store the new parsing forest
WHERE Month = "July"
9: M ← M \ {M ′ ∈ M, score(M ′ ) ≤ score(PF ∗ )}
GROUP BY Type
▷ remove mappings with lower scores from M
return PF ∗ □

5 EXPERIMENTAL TESTS
Score(PF M ′ ), we can stop the algorithm and return PF M ′ since In this section, we evaluate our framework in terms of effective-
the optimistic score of M ′′ is an upper bound to the score of the ness and efficiency. Tests are carried out on a real-word bench-
corresponding parse forest. mark of analytics queries [7]. Since queries from [7] refers to
Algorithm 1 works as follows. At first, mappings are sorted private datasets, we mapped the natural language queries to the
by their score (Line 1), the best parse forest is initialized, and Foodmart schema5 . The Automatic KB feeding populated the
the iteration begins. While the set of existing mappings is not KB with 1 fact, 39 attributes, 12337 values and 12449 synonyms.
exhausted (Line 3), the best mapping is picked, removed from the Additionally, only 50 synonyms where manually added in the KB
set of candidates, and its parsing forest is generated (Lines 4–6). enrichment step (e.g., "for each", "for every", "per" are synonyms
If the score of the current forest is higher than the score of the of the group by statement). While mapping of natural language
stored one (Line 7), then the current forest is stored (Line 8) and queries to the Foodmart domain, we preserved the structure of
all the mappings with a lower score are removed from the search the original queries (e.g., word order, typos, etc.). Overall, 75%
space (Line 9) as the pruned mappings cannot produce parsing of the queries in the dataset are valid GPSJ queries, confirming
forests with a score greater than what has been already parsed. how general and standard GPSJ queries are. The filtered bench-
mark includes 110 queries. We consider token sub-sequences of
maximum length n = 4 (i.e., the [1..4]-grams) as no entity in the
4.5 Parse Tree Disambiguation KB is longer than 4 words. Each sub-sequence is associated to
Each annotation in the parse forest requires a user choice to the top N similar entities with similarity at least higher than α
restore query correctness. User-system interaction takes place such that at least a percentage β of the tokens in T is covered.
through a set of questions, one for each annotation. Obviously, The value of β is fixed to 70% based on an empirical evaluation
each answer must be parsed following the same step sequence of the benchmark queries. Table 4 summarizes the parameters
discussed so far. Table 3 reports the question templates for each considered in our approach.
annotation type. Each question allows to either provide the miss-
ing information or to drop the clause. Templates are standardized 5.1 Effectiveness
and user choices are limited to keep interaction easy. This al-
The effectiveness of our framework quantifies how well the
lows also unskilled users to obtain a baseline query. Additional
translation of natural language queries meets user desiderata.
clauses can be added through the OLAP operator module that
Effectiveness is primarily evaluated as the parse tree similarity
implements OLAP navigation.
T Sim(PT , PT ∗ ) between the parse tree PT produced by our sys-
At the end of this step, the parse forest is reduced to a parse
tem and the correct one PT ∗ , which is manually written by us for
tree since ambiguous clauses are either dropped or added to the
each query in the benchmark. Parse tree similarity is based on
parsing tree.
the tree distance [23] that keeps into account both the number of
correctly parsed entities (i.e., the parse tree leaves) and the tree
4.6 SQL Generation structure that codes the query structure (e.g., selection clauses
Given a parsed tree PTM , the generation of its corresponding "(A and B) or C" and "A and (B or C)" refers to the same parsed
SQL requires to fill in the SELECT, WHERE, GROUP BY and FROM entities but underlie different structures). In this subsection, we
statements. The SQL generation is applicable to both star and evaluate how our framework behaves with(out) disambiguation.
snowflake schemata and is done as follows: Let’s consider first the system behavior without disambigua-
• SELECT: measures and aggregation operators from ⟨MC⟩ tion. Figure 8 depicts the performance of our approach with
are added to the query selection clause together with the respect to variations in the number of retrieved top similar enti-
attributes in the group by clause ⟨GC⟩; ties (i.e., N ∈ {2, 4, 6}). Values are reported for the top-k trees (i.e.,
• WHERE: predicate from the selection clause ⟨SC⟩ (i.e., val- the k trees with the highest score). We remind that only one parse
ues and their respective attributes) is added to the query forest is involved in the disambiguation phase; nonetheless, for
predicate; testing purposes, it is interesting to see if best parse tree belongs
• GROUP BY: attributes from the group by clause ⟨GC⟩ are 5 A public dataset about food sales between 1997 and 1998 (https://github.com/
added to the query group by set; julianhyde/foodmart-data-mysql).
Annotation type Description Action
Ambiguous Attribute ⟨Val⟩ is member of these attributes [...] Pick attribute / drop clause
Ambiguous Agg. Operator ⟨Mea⟩ allows these operators [...] Pick operator / drop clause
Attribute-Value Mismatch ⟨Attr⟩ and ⟨Val⟩ domains mismatch, possible value are [...] Pick value / drop clause
MD-Meas Violation ⟨Mea⟩ does not allow ⟨Agg⟩, possible operators are [...] Pick operator / drop clause
MD-GBY Violation It is not allowed to group by on ⟨Attr⟩ without ⟨Attr⟩ Add attribute / drop clause
Unparsed ⟨GC⟩ clause There is a dangling grouping clause ⟨GC⟩ Add clause to ⟨GPSJ⟩ / drop clause
Unparsed ⟨MC⟩ clause There is a dangling measure clause ⟨MC⟩ Add clause to ⟨GPSJ⟩ / drop clause
Unparsed ⟨SC⟩ clause There is a dangling predicate clause ⟨SC⟩ Add clause to ⟨GPSJ⟩ / drop clause
Table 3: Templates and actions required to solve misunderstandings in textual queries.

Symbol Range Meaning
N {2, 4, 6} Num. of top similar entities N=2 N=4 N=6
α {0.4, 0.5, 0.6} Token/entity minimum similarity
1.0
β 70% Sentence coverage threshold
n 4 Maximum sub-sequence length 0.8
Table 4: Parameter values for testing. 0.6

TSim
0.4
0.2
0.0
1 2 3 4 5
k
to the top-k ranked ones. Effectiveness slightly change varying
N and it ranges in [0.88, 0.91]. Effectiveness is more affected by
the entity/token similarity threshold α and ranges in [0.83, 0.91]. Figure 8: Effectiveness varying the number of retrieved
In both cases, the best results are obtained when more similar top similar entities N and the number of Top-k queries re-
entities are admitted and more candidate mappings are gener- turned (with α = 0.4).
ated. Independently of the chosen thresholds the system results
very stable (i.e., the effectiveness variations are limited) and even = 0.6 = 0.5 = 0.4
considering only one query to be returned its effectiveness is at 1.0
the state of the art [13, 18, 22]. This confirms that (1) the choice
0.8
of proposing only one query to the user does not negatively im-
0.6
TSim

pact on performances (while it positively impacts on interaction
complexity and efficiency) and (2) our scoring function properly 0.4
ranks parse tree similarity to the correct interpretation for the 0.2
query since the best ranked is in most cases the most similar
0.0
to the correct solution. As the previous tests do not include dis- 1 2 3 4 5
ambiguation, only 58 queries out of 110 are not ambiguous and k
produce parse trees that can be fed as-is to the generation and
execution phases. This means that 52 queries — despite being
very similar to the correct tree, as shown by the aforementioned Figure 9: Effectiveness varying the similarity threshold α
results — are not directly executable without disambiguation (we and the number of Top-k queries returned (with N = 6).
recall the ambiguity/error types from Table 2). Indeed, of these 52 1.0
queries, 38 contains 1 ambiguity annotation, 12 contains two am- 0.8
biguities annotations, and 2 contains three or more ambiguities
0.6
TSim

annotations. Figure 10 depicts the performance when the best
parse tree undergoes iterative disambiguation (i.e., an increasing 0.4
number of correcting action is applied). Starting from the best 0.2
configuration from the previous tests (for N = 6 and α = 0.4), 0.0
by applying an increasing number of correcting actions the ef- 0 1 2 3
fectiveness increases from 0.89 up to 0.94. Unsolved differences Disambiguation step
between PT and PT ∗ are mainly due to missed entities in the
mappings. Figure 10: Effectiveness as a function of the number of dis-
Although our approach shows effectiveness comparable to ambiguation steps (i.e., application of correcting actions;
state-of-art proposals [13, 18, 22], it was not possible to run a de- with k = 1, N = 6 and α = 0.4).
tailed comparison against them because the implementations are
not available and the provided descriptions are far from making
them reproducible. Furthermore, despite the availability of some
natural language datasets, the latter are hardly compatible with
GPSJ queries and are not based on real analytic workloads.
6 CONCLUSIONS AND FUTURE WORKS
Parsed Generated In this paper, we proposed a conversational OLAP framework and
105 an original approach to translate a text describing an OLAP query
in a well-formed GPSJ query. Tests on a real case dataset have
104
shown state-of-the-art performances. More in detail, we have
103
| |

shown that coupling a grammar-based text recognition approach
102 with a disambiguation phase determines a 94% accuracy. We are
now working in many different directions to complete the frame-
101 work by: (a) supporting an OLAP navigation session rather than
1 3 5 7 a single query, as this requires to extend the grammar and to have
|M| memories of previous queries in order to properly modify them;
(b) designing a proper visual representation of the interpreted
text to improve the system effectiveness, which can be obtained
Figure 11: Number of "Generated" and "Parsed" mappings
by defining a proper graphical metaphor of the query on top of a
varying the number |M | of entities in the optimal tree
conceptual visual formalism such as DFM; (c) testing the whole
(with N = 6 and α = 0.4).
framework on real users to verify its effectiveness from points of
view that go beyond accuracy, e.g. usability, immediacy, mem-
N=2 N=4 N=6 orability; (d) exploiting the Log to improve the disambiguation
effectiveness and to learn synonyms not available a-priori.
101
Time (s)

REFERENCES
100 [1] [n. d.]. Hype Cycle for Artificial Intelligence, 2018. http://www.gartner.com/
en/documents/3883863/hype-cycle-for-artificial-intelligence-2018. ([n. d.]).
Accessed: 2019-06-21.
10 1 [2] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative
survey of recent natural language interfaces for databases. The VLDB Journal
1 3 5 7 28, 5 (2019), 793–819.
|M| [3] Julien Aligon, Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and Stefano
Rizzi. 2015. A collaborative filtering approach for recommending OLAP
sessions. Decision Support Syst. 69 (2015), 20–30.
[4] John C. Beatty. 1982. On the relationship between LL(1) and LR(1) grammars.
Figure 12: Execution time varying the number |M | of en- J. ACM 29, 4 (1982), 1007–1022.
tities in the optimal tree and the number of top similar [5] Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt
Stockinger. 2012. SODA: Generating SQL for Business Users. PVLDB 5, 10
entities N (with α = 0.4). (2012), 932–943.
[6] Kedar Dhamdhere, Kevin S. McCurley, Ralfi Nahmias, Mukund Sundararajan,
and Qiqi Yan. 2017. Analyza: Exploring Data with Conversation. In IUI. ACM,
493–504.
[7] Krista Drushku, Julien Aligon, Nicolas Labroche, Patrick Marcel, and Verónika
Peralta. 2019. Interest-based recommendations for business intelligence users.
5.2 Efficiency Inf. Syst. 86 (2019), 79–93.
[8] Lorena Etcheverry and Alejandro A. Vaisman. 2012. QB4OLAP: A Vocabulary
We ran the tests on a machine equipped with Intel(R) Core(TM) for OLAP Cubes on the Semantic Web. In COLD (CEUR Workshop Proceedings),
i7-6700 CPU @ 3.40GHz CPU and 8GB RAM, with the framework Vol. 905. CEUR-WS.org.
[9] Matteo Francia, Matteo Golfarelli, and Stefano Rizzi. 2019. Augmented Busi-
implemented in Java. ness Intelligence. In DOLAP (CEUR Workshop Proceedings), Vol. 2324. CEUR-
In Section 4.2 we have shown that the mapping search space WS.org.
increases exponentially in the text length. Figure 11 confirms this [10] Ramanathan V. Guha, Vineet Gupta, Vivek Raghunathan, and Ramakrishnan
Srikant. 2015. User Modeling for a Personal Assistant. In WSDM. ACM, 275–
result showing the number of generated mappings as a function 284.
of the number of entities |M | included in the optimal parse tree [11] Ashish Gupta, Venky Harinarayan, and Dallan Quass. 1995. Aggregate-Query
Processing in Data Warehousing Environments. In VLDB. Morgan Kaufmann,
PT ∗ , |M | is strictly related to |T |. Note that, with reference to the 358–369.
parameter values reported in Table 4, the configuration analyzed [12] Nicolas Kuchmann-Beauger, Falk Brauer, and Marie-Aude Aufaure. 2013.
in Figure 11 is the worst case since it determines the largest QUASL: A framework for question answering and its Application to busi-
ness intelligence. In RCIS. IEEE, 1–12.
search space due to the high number of admitted similar entities. [13] Fei Li and H. V. Jagadish. 2016. Understanding Natural Language Queries over
Noticeably, pruning rules strongly limit the number of mappings Relational Databases. SIGMOD Record 45, 1 (2016), 6–13.
to be actually parsed. [14] Gabriel Lyons, Vinh Tran, Carsten Binnig, Ugur Çetintemel, and Tim Kraska.
2016. Making the Case for Query-by-Voice with EchoQuery. In SIGMOD
Figure 12 shows the average execution time by varying |M | Conference. ACM, 2129–2132.
and the number of allowed top similar entities N . The execution [15] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel,
Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural
time increase with the number of entities included in the optimal Language Processing Toolkit. In ACL (System Demonstrations). The Association
parse tree, as such also the number of top similar entities impacts for Computer Linguistics, 55–60.
the overall execution time. We recall that effectiveness remains [16] George A. Miller. 1995. WordNet: A Lexical Database for English. Commun.
ACM 38, 11 (1995), 39–41.
high also for N = 2 corresponding to an execution time of 1 [17] Amrita Saha, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. To-
second, raising to 101 seconds in the worst case. We emphasize wards Building Large Scale Multimodal Domain-Aware Conversation Systems.
that the execution time corresponds to the time necessary for In AAAI. AAAI Press, 696–704.
[18] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq
the translation, and not to the time to actually execute them. Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-
Queries are executed against the enterprise data mart and their Driven System for Natural Language Querying over Relational Data Stores.
PVLDB 9, 12 (2016), 1209–1220.
performance clearly depends on the underlying multidimensional [19] Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish R. Mittal,
engine. Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan.
2019. Natural Language Querying of Complex Business Intelligence Queries.
In SIGMOD Conference. ACM, 1997–2000.
[20] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Machine
Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287 (2019).
[21] Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A Holis-
tic Approach for Query Evaluation and Result Vocalization in Voice-Based
OLAP. In Proc. of the 2019 Int. Conf. on Manage. of Data, SIGMOD Conf. 2019,
Amsterdam, The Netherlands, June 30 - July 5, 2019. 936–953.
[22] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017.
SQLizer: query synthesis from natural language. PACMPL 1, OOPSLA (2017),
63:1–63:26.
[23] Kaizhong Zhang and Dennis E. Shasha. 1989. Simple Fast Algorithms for the
Editing Distance Between Trees and Related Problems. SIAM J. Comput. 18, 6
(1989), 1245–1262.