Towards Conversational OLAP Matteo Francia Enrico Gallinucci Matteo Golfarelli DISI - University of Bologna DISI - University of Bologna DISI - University of Bologna m.francia@unibo.it enrico.gallinucci@unibo.it matteo.golfarelli@unibo.it ABSTRACT #2 it must handle OLAP sequences rather than single queries: The democratization of data access and the adoption of OLAP in in an OLAP sequence the first query is fully described by scenarios requiring hand-free interfaces push towards the cre- the text, while the following ones are implicitly/partially ation of smart OLAP interfaces. In this paper, we envisage a con- described by an OLAP operator and require the system to versational framework specifically devised for OLAP applications. have memory of the previous ones; The system converts natural language text in GPSJ (Generalized #3 it must be robust with respect to user inaccuracies in using Projection, Selection and Join) queries. The approach relies on an syntax, OLAP terms, and attribute values, as well as in the ad-hoc grammar and a knowledge base storing multidimensional presence of implicit information; metadata and cubes values. In case of ambiguous or incomplete #4 it must be easy to configure on a DW without a heavy query description, the system is able to obtain the correct query manual definition of the lexicon. either through automatic inference or through interactions with More technically, our text-to-SQL approach is based on a gram- the user to disambiguate the text. Our tests show very promising mar parsing natural language descriptions of GPSJ queries. The results both in terms of effectiveness and efficiency. recognized entities include a set of typical query lexicons (e.g. group by, select, filter) and the domain specific terms and val- ues automatically extracted from the DW (see desiderata #1 and 1 INTRODUCTION #4). Robustness (desiderata #3) is one of the main goals of our Nowadays, one of the most popular research trends in computer approach and is pursued in all the translation phases: lexicon science is the democratization of data access, analysis and visu- identification is based on a string similarity function, multi-word alization, which means opening them to end users lacking the lexicons are handled through n-grams, and alternative query in- required vertical skills on the services themselves. Smart per- terpretations are scored and ranked. The grammar proposed in sonal assistants [10] (Alexa, Siri, etc.) and auto machine learning Section 4.3 recognizes full queries only, thus desiderata #2 is not services [20] are examples of such research efforts that are now covered by the current work. To sum up, the main contributions on corporate agendas [1]. of this paper are: In particular, interfacing natural language processing (either (1) a list of features and desiderata for an effective conversa- written or spoken) to database systems opens to new opportuni- tional OLAP system; ties for data exploration and querying [13]. Actually, in the area (2) an architectural view of the whole framework; of data warehouse, OLAP (On-Line Analytical Processing) itself (3) an original approach to translate the natural language is an "ante litteram" smart interface, since it supports the users description of an OLAP query in a well-formed GPSJ query with a "point-and-click" metaphor to avoid writing well-formed (Full query module in Figure 1). SQL queries. Nonetheless, the possibility of having a conversa- (4) a set of tests to verify the effectiveness of our approach. tion with a smart assistant to run an OLAP session (i.e., a set of The remainder of the paper is organized as follows: in Section 2 related OLAP queries) opens to new scenarios and applications. It the functional architecture and the modules of a conversational is not just a matter of further reducing the complexity of posing a OLAP framework are sketched; Section 3 discusses related works; query: a conversational OLAP system must also provide feedback Section 4 describes in detail the query generation process; in Sec- to refine and correct wrong queries, and it must have memory tion 5 a large set of effectiveness and efficiency tests are reported; to relate subsequent requests. A reference application scenario finally, Section 6 draws the conclusions and discusses the system for this kind of framework is augmented business intelligence [9], evolution. where hand-free interfaces are mandatory. In this paper, we envisage a conversational OLAP framework 2 SYSTEM OVERVIEW able to convert a natural language text into a GPSJ query. GPSJ Figure 1 sketches a functional view of the architecture. Given a [11] is the main class of queries used in OLAP since it enables DW, that is a set of multidimensional cubes together with their Generalized Projection, Selection and Join operations over a set of metadata, the offline phase is aimed at extracting the DW spe- tables. Although, some natural language interfaces to databases cific terms used by user to express the queries. Such information have already been proposed, to the best of our knowledge this is are stored in the system knowledge base (KB). Noticeably, this the first proposal specific to OLAP systems. phase runs only once for each DW unless it undergoes modifica- In our vision, the desiderata for an OLAP smart interface are: tions. More in detail, the Automatic KB feeding process extracts #1 it must be automated and portable: it must exploit cubes from the cubes categorical attribute values and metadata (e.g. metadata (e.g. hierarchy structures, role of measures, at- attribute and measure names, hierarchy structures, aggregation tributes, and aggregation operators) to increase its un- operators). With reference to the cube represented through a derstanding capabilities and to simplify the user-machine DFM schema in Figure 2, the KB will store the names of the interaction process; measures (e.g. StoreSales, StoreCost) together with their default aggregation operators (e.g. sum, avg). Additional synonyms can © Copyright 2020 for this paper held by its author(s). Published in the proceedings of be automatically extracted from open data ontologies (Wordnet DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution [16] in our implementation); for example, "client" is a synonym 4.0 International (CC BY 4.0). for Customer. The larger the number of synonyms, the wider the language understood by the systems. Besides the domain specific ensure drill-across validity). Also differently from [14], the re- terminology, the KB includes the set of standard OLAP terms that sults we provide are supported by extensive effectiveness and are domain independent and that do not require any feeding (e.g. efficiency performance evaluation that completely lack in [14]. group by, equal to, select). Further enrichment can be optionally Finally, existing dialog systems, such as [17], address the explo- carried out manually (i.e., by the KB enrichment module) when ration of linked data. Hence they are not suitable for analytics on the application domain involves a non-standard vocabulary (i.e., the multidimensional model. As to question answering, existing when the physical names of tables and columns do not match systems are well understood and differentiate for the knowledge the words of a standard vocabulary). required to formulate the query and for the generative approach. The online phase runs every time a query is issued to the Domain agnostic approaches solely rely on the database schema. system. The spoken query is initially translated to text by the NaLIR [13] translates natural language queries into dependency Speech-to-text module. This task is out of scope in our research trees [15] and brute-forcefully transforms promising trees un- and we exploited the Google API in our implementation. The til a valid query can be generated. In our approach we rely on uninterpreted text is then analyzed by the Interpretation module n-grams instead of dependency trees [15] since the latter can- that actually consists of two sub-modules: Full query is in charge not be directly mapped to entities in the knowledge base (i.e., of interpreting the texts describing full queries, which typically they require tree manipulation) and are sensible to the query happens when an OLAP session starts. Conversely, OLAP op- syntax (e.g., "sum unit sales" and "sum the unit sales" produce two erator modifies the latest query when the user states an OLAP different trees with the same meaning). SQLizer [22] generates operator along an OLAP session. On the one hand, understanding templates over the issued query and applies a "repair" loop until a single OLAP operator is simpler since it involves less elements. it generates queries that can be obtained using at most a given On the other hand, it requires to have memory of previous queries number of changes from the initial template. Domain specific (stored in the Log) and to understand which part of the previous approaches add semantics to the translation process by means of query must be modified. domain-specific ontologies and ontology-to-database mappings. Due to natural language ambiguity, user errors and system SODA [5] uses a simple but limited keyword-based approach inaccuracies, part of the text can be misunderstood. The role that generates a reasonable and executable SQL query based on of the Disambiguation module is to solve ambiguities by ask- the matches between the input query and the database metadata, ing appropriate questions to the user. The reasons behind the enriched with domain-specific ontologies. ATHENA [18] and its misunderstandings are manyfold, including (but not limited to): recent extension [19] map natural language into an ontology ambiguities in the aggregation operator to be used; inconsistency representation and exploit mappings crafted by the relational between attribute and value in a selection predicate; (3) identifica- schema designer to resolve SQL queries. Analyza [6] integrates tion of relevant elements in the text without full comprehension. the domain-specific ontology into a "semantic grammar" (i.e., a The output of the previous steps is a data structure (i.e., a grammar with placeholders for the typed concepts such as mea- parsing tree) that models the query and that can be automatically sures, dimensions, etc.) to annotate and finally parse the user turned in a SQL query exploiting the DW structure stored in the query. Additionally, Analyza provides an intuitive interface facil- KB. Finally, the obtained query is run on the DW and the results itating user-system interaction in spreadsheets. Unfortunately, are reported to the user by the Execution & Visualization module. by relying on the definition of domain specific knowledge and Such module could exploit a standard OLAP visualization tool mappings, the adoption of these approaches is not plug-and-play or it could implement voice-based approaches [21] to create an as an ad-hoc ontology is rarely available and is burdensome to end-to-end conversational solution. create. In the area of business intelligence the road to conversation- driven OLAP is not paved yet. The recommendation of OLAP 3 RELATED WORKS sessions to improve data exploration has been well-understood Conversational business intelligence can be classified as a natural [3] also in domains of unconventional contexts [9] where hand- language interface (NLI) to business intelligence systems to drive free interfaces are mandatory. Recommendation systems focus analytic sessions. Despite the plethora of contributions in each on integrating (previous) user experience with external knowl- area, to the best of our knowledge, no approach lies at their edge to suggest queries or sessions, rather than providing smart intersection. interfaces to BI tools. To this end, personal assistants and con- NLIs to operational databases enable users to specify com- versational interfaces can help users unfamiliar with such tools plex queries without previous training on formal programming and SQL language to perform data exploration. However, end- languages (such as SQL) and software; a recent and comprehen- to-end frameworks are not provided in the domain of analytic sive survey is provided in [2]. Overall, NLIs are divided into sessions over multidimensional data. QUASL [12] introduces a two categories: question answering (QA) and dialog (D). While QA approach over the multidimensional model that supports the former are designed to operate on single queries, only the analytical queries but lacks both the formalization of the dis- latter are capable of supporting sequences of related queries as ambiguation process (i.e., how ambiguous results are addressed) needed in OLAP analytic sessions. However, to the best of our and the support to OLAP sessions (with respect to QA, handling knowledge, no dialog-based system for OLAP sessions has been OLAP sessions requires to manage previous knowledge from the provided so far. The only contribution in the dialog-based di- Log and to understand whether the issued sentence refines previ- rection is [14], where the authors provide an architecture for ous query or is a new one). Complementarily to our framework, querying relational databases; with respect to this contribution [21] recently formalized the vocalization of OLAP results. we rely on the formal foundations of the multidimensional model To summarize, the main differences of our approach to the to drive analytic sessions (e.g., according to the multidimensional previous works are the following. model it is not possible to group by a measure, compute aggrega- tions of categorical attributes, aggregate by descriptive attributes, Interpretation Sales by Log Customer and OLAP Month operator Parse tree Raw Annotated Parse text parse forest tree SQL Results Speech SQL Execution & Full query Disambiguation -to-Text generation Visualization Online Offline KB Automatic Metadata & values DW enrichment KB KB feeding Synonyms Synonyms Ontology Figure 1: A functional architecture for a conversational OLAP framework. S.Region C.Region Gender • DB tables: the structure of the database implementing S.City C.City the DW including tables’ and attributes’ names, primary Store Customer and foreign key relationships. Family Name Sales Month The set of DW entities includes: StoreSales Date Year • DW element names: measures, dimensional attributes StoreCost Type Product and fact names. Category UnitSales Quarter • DW element values: for each categorical attribute, all the values are stored. Figure 2: Simplified DFM representation of the Foodmart Sales cube. Several synonyms can be stored for each entity, enabling the system to cope with slang and different shades of the text. Both DW entities and DW structure are automatically collected by (1) The envisioning of an end-to-end general-purpose dialog- querying the DW and its data dictionary and are stored in a driven framework supporting OLAP sessions. QB4OLAP [8] compatible repository. Besides DW entities, that (2) The definition of the framework’s functional architecture are domain specific, the knowledge base includes those keywords and the formalization of its modules. and patterns that are typically used to express a query: (3) The enforcement of multidimensional constraints by the • Intention keywords: express the role of the subsequent means of formal grammar supporting GPSJ queries. part of text. Examples of intention keywords are group by, (4) A plug-and-play implementation of the core modules on select and filter. top existing data warehouses: our framework does not • Operators: include logic (e.g. and, or), comparison (e.g. impact existing schemata and the integration with external greater, equal) and aggregation operators (e.g. sum, aver- knowledge is supported but not mandatory. age). • Patterns of dates and numbers: used to automatically 4 QUERY GENERATION recognize dates and numbers in the raw text. In the next subsections, we describe in detail the interpretation Overall, the entities defined in the system are defined as E = process for a single full query. Figure 3 shows the necessary {E 1 , E 2 , ..., Em }; note that each entity can be a multi-word that steps in the online phase. We remind that, with reference to can be modeled as a sequence of tokens E = ⟨t 1 , ..., tr ⟩. Figure 1, modules Speech-to-Text, OLAP operator and, Execution & Visualization are out of scope here. Conversely, the offline 4.2 Tokenization and Mapping phase process and the related modules are implemented but not A raw text T can be modeled as a sequence of tokens (i.e., single reported. words) T = ⟨t 1 , t 2 , ..., tz ⟩. The goal of this phase is to identify in T the known entities that are the only elements involved in the 4.1 The Knowledge Base Parsing phase. The system Knowledge Base (KB in Figure 1) stores all the infor- Turning a text into a sequence of entities means finding a mation extracted from the DW that is necessary to support the mapping between tokens in T and E. More formally: translation phases. Information can be classified in DW Entities and Structure. More in detail, the DW structure enables consis- Definition 4.1 (Mapping & Mapping function). A mapping func- tency checks on parse trees and, given a parse tree, allows the tion M(T ) is a partial function that associates sub-sequences1 SQL generation. It includes: from T to entities in E such that: • Hierarchy structure: the roll-up relationships between • sub-sequences of T have length n at most; attributes. • the mapping function determines a partitioning of T ; • Aggregation operators: for each measure, the applicable 1 The term n -gram is used as a synonym of sub-sequence in the area of text mining. operators and the default one. Sales by Customer and Month Interpretation: Full query Parse Annotated Parse Mappings forest parse forest tree SQL Tokenization Checking & SQL Parsing Disambiguation & Mapping Enhancement generation Figure 3: Query generation steps for a text describing a complete query. The KB is involved in all the steps. • a sub-sequence T ′ = ⟨ti , ..., tl ⟩ ∈ T with (l ≤ n) is associ- Edgar Allan Poe Edgar Allan Poe Edgar Allan Poe ated to an entity E if and only if E ∈ TopN E ∈ E Sim(T ′, E) and Sim(T ′, E) > α. P. Edgar P. Edgar P. Edgar The output of a mapping function is a sequence M = ⟨E 1 , ..., El ⟩ Edgar Allan Poe Edgar Allan Poe Edgar Allan Poe on E that we call a mapping. Given the mapping M the similarities between each entity Ei and the corresponding tokens Sim(T ′, Ei ) P. Edgar P. Edgar P. Edgar are also retained and denoted with Sim M (Ei ). A mapping is said to be valid if the fraction of mapped tokens in T is higher than a Figure 4: Token dispositions: arcs denote the correspon- given threshold β. We call M the set of valid mappings. dence of tokens for a specific disposition (the bolder the Several mapping functions (and thus several mappings) may line, the higher the similarity). The disposition determin- exists between T and E since Definition 4.1 admits sub-sequences ing the maximum similarity is shown in a dashed rectan- of variable lengths and retains, for each sub-sequence, the top gle. similar entities. This increases interpretation robustness, but it can lead to an increase in computation time. Generating several different mappings increases robustness since it allows to choose, ⟨P ., Edдar ⟩ and W = ⟨Edдar , Allan, Poe⟩. The disposition deter- in the next steps, the best text interpretation out of a higher num- mining the highest similarity is surrounded by a dashed rectangle; ber of candidates. Generated mappings differ both in the number the similarity is 0.46 and it is calculated as of entities involved and, in the specific entities mapped to a token. sim(P ., Poe)|Poe | + sim(Edдar, Edдar )|Edдar | In the simple case where multi-token mappings are not possible Sim(T ,W ) = (i.e., n = 1 in Definition 4.1) the number of generated mappings |Poe | + |Edдar | + |Allan| for a raw text T , such that |T | = z, is: □ z z   · Ni We assume the best interpretation of the input text to be the Õ i one where (1) all the entities discovered in the text are included i= ⌈z ·β ⌉ in the query (i.e., all the entities are parsed through the grammar) The formula counts the possible configurations of sufficient and, (2) each entity discovered in the text is perfectly mapped length (i.e., higher or equal to ⌈z · β⌉) and, for each length, count to one sub-sequence of tokens (i.e., Sim M (Ei ) = 1). The two the number of mappings determined by the top similar enti- previous statements are modeled through the following score ties. Since the number of candidate mappings is exponential, we function. Given a mapping M = ⟨E 1 , ..., Em ⟩, we define its score consider only the most significant ones through α, β, and N : α as imposes sub-sequence of tokens to be very similar to an entity; N m Õ further imposes to consider only the N entities with the highest Score(M) = Sim M (Ei ) (1) similarity; finally, β imposes a sufficient portion of the text to be i=1 mapped. The score is higher when M includes several entities with high The similarity function Sim() is based on the Levenshtein dis- values of Sim M . Although at this stage it is not possible to predict tance and keeps token permutation into account to make similar- if a mapping will be fully parsed, it is apparent that the higher ity robust to token permutations (i.e., sub-sequences ⟨P ., Edдar ⟩ the mapping score, the higher its probability to determine an and ⟨Edдar, Allan, Poe⟩ must result similar). Given two token optimal interpretation. As it will be explained in the Section 4.4, sequences T and W with |T | = l, |W | = m such that l ≤ m it is: ordering the mapping by descending score also enables pruning Sim(⟨t 1 , ..., tl ⟩, ⟨w 1 , ..., wm ⟩) = strategies to be applied. Íl i=1 sim(ti , w D(i) ) · max(|ti |, |w D(i) |) Example 4.3 (Tokenization and mapping). Given the set of enti- max Íl ties E and a tokenized text T = ⟨medium, sales, in, 2019, by, the, i=1 max(|ti |, |w D(i) |) + D ∈Disp(l,m) Í i ∈ D̂ |w i | reдion⟩, examples of mappings M 1 and M 2 are: where D ∈ Disp(l, m) is an l-disposition of {1, ..., m} and D̂ is M 1 =⟨avg, UnitSales, where, 2019, group by, region⟩ the subset of values in {1, ..., m} that are not present in D. Func- tion Sim() weights token similarity based on their lengths (i.e., M 2 =⟨avg, UnitSales, where, 2019, group by, Reдin⟩ max(|ti |, |w D(i) |) and penalizes similarities between sequences of Í where the token the is not mapped and the token reдion is different lengths that implies unmatched tokens (i.e., i ∈D̂ |w i |). mapped to the attribute Region in M 1 and to the value Reдin Example 4.2 (Token similarity). Figure 4 shows some of the in M 2 (where Reдin is a value of attribute Customer that holds a possible token dispositions for the two token sequences T = sufficient similarity). □ Table 1: Generic and domain specific syntactic categories. ⟨GPSJ⟩ ::= ⟨MC⟩⟨GC⟩⟨SC⟩ | ⟨MC⟩⟨SC⟩⟨GC⟩ | ⟨SC⟩⟨GC⟩⟨MC⟩ Category Entity Synonym samples | ⟨SC⟩⟨MC⟩⟨GC⟩ | ⟨GC⟩⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩⟨SC⟩ Int select return, show, get | ⟨MC⟩⟨SC⟩ | ⟨MC⟩⟨GC⟩ | ⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩ Whr where in, such that | ⟨MC⟩ Gby group by by, for each, per ⟨MC⟩ ::= (⟨Agg⟩⟨Mea⟩ | ⟨Mea⟩⟨Agg⟩ | ⟨Mea⟩ | ⟨Cnt⟩⟨Fct⟩ Cop =, <>, >, <, ≥, ≤ equal to, greater than Agg sum, avg total, medium | ⟨Cnt⟩⟨Attr⟩)+ Cnt count, count distinct number, amount ⟨GC⟩ ::= ⟨Gby⟩⟨Attr⟩+ Fct Facts Domain specific ⟨SC⟩ ::= ⟨Whr⟩⟨SCO⟩ Mea Measures Domain specific ⟨SCO⟩ ::= ⟨SCA⟩”or ”⟨SCO⟩ | ⟨SCA⟩ Att Attributes Domain specific Val Categorical values Domain specific ⟨SCA⟩ ::= ⟨SCN⟩”and”⟨SCA⟩ | ⟨SCN⟩ Dates and numbers - ⟨SCN⟩ ::= ”not”⟨SSC⟩ | ⟨SSC⟩ ⟨SSC⟩ ::= ⟨Attr⟩⟨Cop⟩⟨Val⟩ | ⟨Attr⟩⟨Val⟩ | ⟨Val⟩⟨Cop⟩⟨Attr⟩ 4.3 Query Parsing | ⟨Val⟩⟨Attr⟩ | ⟨Val⟩ Parsing a valid mapping means searching in the text the complex syntax structures (i.e., clauses) that build-up the query. Given a Figure 5: Backus-Naur representation of the grammar. mapping M, the output of a parser is a parse tree PTM , i.e. an ordered, rooted tree that represents the syntactic structure of a string according to a given grammar. To the aim of parsing, � GPSJ � entities (i.e., terminal elements in the grammar) are grouped � MC � � SC � � GC � in syntactic categories (see Table 1). The whole grammar is de- scribed in Figure 5. As a GPSJ query consists of 3 clauses (measure, � SCO � group by and selection), in our grammar we identify four types � SCA � of derivations2 : � SCN � • Measure clause ⟨MC⟩: this derivation consists of a list of measure/aggregation operator pairs. If the operator is � SSC � omitted (i.e., ⟨MC⟩ ::= ⟨Mea⟩ applies) the default one � Agg � � Mea � � Whr � � Val � � Gby � � Attr � specified in the KB will be applied to the measure during M1 = � avg, UnitSales, where, 2019, group by, Region � the Checking & Enhancement step. (a) The entire mapping is parsed (i.e., P F M = PTM ). • Group by clause ⟨GC⟩: this derivation consists of a se- quence of attribute names preceded by an entity of type � GPSJ � ⟨Gby⟩. • Selection clause ⟨SC⟩: this derivation consists of a Boolean � MC � � SC � � SC � expression of simple selection predicates ⟨SSC⟩; follows standard SQL operator priority (see ⟨SCO⟩, ⟨SCA⟩, and � SCO � � SCO � ⟨SCN⟩ derivations). � SCA � � SCA � • GPSJ query ⟨GPSJ⟩: this derivation assembles the final � SCN � � SCN � query. Only the measure clause is mandatory since a GPSJ could aggregate a single measure with no selections. The � SSC � � SSC � order of clauses is irrelevant; this implies the proliferation � Agg � � Mea � � Whr � � Val � � Gby � � Val � of derivations due to clause permutations. M2 = � avg, UnitSales, where, 2019, group by, Regin � The GPSJ grammar is LL(1)3 [4], is not ambiguous (i.e., each (b) ⟨Gby⟩ ⟨Val⟩ cannot be parsed. mapping admits a single parsing tree PTM ) and can be parsed by a LL(1) parser with linear complexity [4]. If the input mapping Figure 6: Parsing forests from Example 4.3. M can be fully parsed, PTM will include all the entities as leaves. Conversely, if only a portion of the input belongs to the grammar, an LL(1) parser will produce a partial parsing, meaning that it will return a parsing tree including the portion of the input mapping than the simple parse tree enables disambiguation and errors to that belongs to the grammar. Remaining entities can be either be recovered during the Disambiguation phase. singleton or complex clauses that were not possible to connect to the main parse tree. We will call parse forest PF M the union of the Example 4.4 (Parsing). Figure 6 reports the parsing outcome parsing tree with residual clauses. Obviously, if all the entities for the two mappings in Example 4.3. M 1 is fully parsed, thus are parsed, it is PF M = PTM . Considering the whole forest rather its parsing forest correspond to the parsing tree (i.e., PTM1 = 2 A derivation in the form ⟨X ⟩ ::= e represent a substitution for the non-terminal PF M1 ). Conversely, in M 2 the last token is wrongly mapped to symbol ⟨X ⟩ with the given expression e . Symbols that never appear on the left side the attribute value Regin rather than to the attribute name Region. of ::= are named terminals. Non-terminal symbols are enclosed between ⟨⟩ . 3 The rules presented in Figure 5 do not satisfy LL(1) constraints for readability This prevents the full parsing and the parsing tree PTM does not reasons. It is easy to turn such rules in a LL(1) complaint version, but the resulting include all the entities in M. rules are much more complex to be read and understood. Annotation type Gen. derivation sample Example Ambiguous Attribute ⟨SSC⟩ ::= ⟨Val⟩ "sum unit sales for Salem", but Salem is member of City and StoreCity Ambiguous Agg. Operator ⟨MC⟩ ::= ⟨Mea⟩ "unit sales by product", but sum and avg are valid aggregations Attribute-Value Mismatch ⟨SSC⟩ ::= ⟨Attr⟩⟨Cop⟩⟨Val⟩ "sum unit sales for product New York", but N ew York is not a Product MD-Meas Violation ⟨MC⟩ ::= ⟨Agg⟩⟨Mea⟩ "sum prices per store", but Price is not addictive MD-GBY Violation ⟨GC⟩ ::= ⟨Gby⟩⟨Attr⟩+ "average prices by name", but Product is not in ⟨GC⟩ Unparsed clause – "average unit sales by Regin", but Reдin is not an attribute Table 2: Annotation types. � GPSJ � 4.4 Knowledge-based Parse Forest Checking unparsed and Enhancement � MC � � SC � � SC � The Parsing phase verifies adherence of the mapping to the GPSJ � SCO � � SCO � grammar, but this does not guarantee the executability of the � SCA � � SCA � corresponding SQL query due to implicit elements, ambiguities and errors. The following list enumerates the cases that require � SCN � � SCN � further processing. � SSC � AVM � SSC � • Ambiguous attribute: the ⟨SSC⟩ clause has an implicit � Agg � � Mea � � Whr � � Attr � � Cop � � Val � � Gby � � Val � attribute and the parsed value belongs to multiple attribute M = � avg, UnitSales, where, Product, =, New York, group by, Regin � domains. pessimistic • Ambiguous aggregation operator: the ⟨MC⟩ clause has optimistic-pessimistic an implicit aggregation operator but the measure is asso- optimistic ciated with multiple aggregation operators. • Attribute-value mismatch: the ⟨SSC⟩ clause includes a Figure 7: Portion of the parsing forest contributing to its value that is not valid for the specified attribute, i.e. it is score depending on the adopted scoring criterion. either outside or incompatible with the attribute domain. • Violation of a multidimensional constraint on a mea- sure: the ⟨MC⟩ clause contains an aggregation operator • Proposing only the most promising choice may incur in that is not allowed for the specified measure. missing the optimal derivation but it makes easier to create • Violation of a multidimensional constraint on an at- a baseline query. This one can be improved by adding or tribute: the ⟨GC⟩ clause contains a descriptive attribute removing clauses through further interactions enabled by without the corresponding dimensional attribute4 . the OLAP operator module. • Implicit aggregation operator: the ⟨MC⟩ clause has an Definition 4.5 (Parsing Forest Score). Given a mapping M and implicit aggregation operator and the measure is asso- the corresponding parsing forest PF M , we define its score as ciated with only one aggregation operator or a default Score(PF M ) = Score(M ′ ) where M ′ is the sub-sequence of M operator is defined. belonging to the parsing tree PTM . • Implicit attribute: the ⟨SSC⟩ clause has an implicit at- tribute and the parsed value belongs to only one attribute The parsing forest holding the highest score is the one pro- domain. posed to the user. This ranking criterion is based on an optimistic- • Unparsed clause: The parser was not able to fully under- pessimistic forecast of the outcome of the Disambiguation phase. stand the mapping and the returned parse forest includes On the one hand, we optimistically assume that the annotations one or more dandling clauses. belonging to PTM will be positively solved in the Disambiguation phase and the corresponding clauses and entities will be kept. Each bullet above determines a possible annotation of the On the other hand, we pessimistically assume that non-parsed nodes in the parsed forest. In case of implicit information, the clauses belonging to PF M will be dropped. The rationale of our entities in the KB represented by the implicit nodes are automat- choice is that an annotated clause included in the parsing tree is ically added to the parse tree; thus no user action is required. All more likely to be a proper interpretation of the text. As shown the other cases cannot be automatically solved and the nodes in Figure 7, a totally pessimistic criterion (i.e., exclude from the are annotated with a specific error type that will trigger a user- score all the annotated clauses and entities) would carry forward system interaction during the Disambiguation phase. Table 2 a too simple, but non-ambiguous, forest; conversely, a totally reports annotation examples. optimistic criterion (i.e., consider the score of all the entities in A textual query generates several parsing forests, one for PF M ) would make preferable a large but largely non-parsed forest. each mapping. In our approach, only the most promising one is Please note that the bare score of the mapping (i.e., the one avail- proposed to the user in the Disambiguation phase. This choice able before parsing) corresponds to a totally optimistic choice comes from two main motivations: since it sums up the scores of all the entities in the mapping. • Proposing more than one alternative queries to the user The ranking criterion defined above enables the pruning of the can be confusing and makes very difficult to contextualize mappings to be parsed as shown by Algorithm 1 that reports the the disambiguation questions. pseudo-code for the Full query module. Reminding that mappings 4 According to DFM a descriptive attribute is an attribute that further describes a are parsed in descending score order, let us assume that, at some dimensional level (i.e., it is related one-to-one with the level), but that can be used step, the best parse forest is PF M ′ with score Score(PF M ′ ). If for aggregation only in combination with the corresponding level. the next mapping to be parsed, M ′′ , has score Score(M ′′ ) < Algorithm 1 Parsing forest selection • FROM: measures and attributes/values identify, respectively, Require: M: set of valid mappings the fact and the dimension tables involved in the query. Ensure: PF ∗ : best parsing forest Given these tables, the join path is identified by follow- 1: M ← sort(M) ▷ sort mappings by score ing the referential integrity constraints (i.e., by following 2: PF ∗ ← ∅ foreign keys from dimension tables imported in the fact 3: while M , ∅ do ▷ while mapping space is not exhausted table). 4: M ← head(M) ▷ get the mapping with highest score Example 4.6 (SQL generation). Given the GPSJ query "sum the 5: M ← M \ {M } ▷ remove it from M unit sales by type in the month of July", its corresponding SQL is: 6: PF M ← parse(M) ▷ parse the mapping SELECT Type, sum(UnitSales) 7: if score(PF M ) > score(PF ∗ ) then FROM Sales s JOIN Product p ON (s.pid = p.id) ▷ if current score is higher than previous score JOIN Date d ON (s.did = d.id) 8: PF ∗ ← PF M ▷ store the new parsing forest WHERE Month = "July" 9: M ← M \ {M ′ ∈ M, score(M ′ ) ≤ score(PF ∗ )} GROUP BY Type ▷ remove mappings with lower scores from M return PF ∗ □ 5 EXPERIMENTAL TESTS Score(PF M ′ ), we can stop the algorithm and return PF M ′ since In this section, we evaluate our framework in terms of effective- the optimistic score of M ′′ is an upper bound to the score of the ness and efficiency. Tests are carried out on a real-word bench- corresponding parse forest. mark of analytics queries [7]. Since queries from [7] refers to Algorithm 1 works as follows. At first, mappings are sorted private datasets, we mapped the natural language queries to the by their score (Line 1), the best parse forest is initialized, and Foodmart schema5 . The Automatic KB feeding populated the the iteration begins. While the set of existing mappings is not KB with 1 fact, 39 attributes, 12337 values and 12449 synonyms. exhausted (Line 3), the best mapping is picked, removed from the Additionally, only 50 synonyms where manually added in the KB set of candidates, and its parsing forest is generated (Lines 4–6). enrichment step (e.g., "for each", "for every", "per" are synonyms If the score of the current forest is higher than the score of the of the group by statement). While mapping of natural language stored one (Line 7), then the current forest is stored (Line 8) and queries to the Foodmart domain, we preserved the structure of all the mappings with a lower score are removed from the search the original queries (e.g., word order, typos, etc.). Overall, 75% space (Line 9) as the pruned mappings cannot produce parsing of the queries in the dataset are valid GPSJ queries, confirming forests with a score greater than what has been already parsed. how general and standard GPSJ queries are. The filtered bench- mark includes 110 queries. We consider token sub-sequences of maximum length n = 4 (i.e., the [1..4]-grams) as no entity in the 4.5 Parse Tree Disambiguation KB is longer than 4 words. Each sub-sequence is associated to Each annotation in the parse forest requires a user choice to the top N similar entities with similarity at least higher than α restore query correctness. User-system interaction takes place such that at least a percentage β of the tokens in T is covered. through a set of questions, one for each annotation. Obviously, The value of β is fixed to 70% based on an empirical evaluation each answer must be parsed following the same step sequence of the benchmark queries. Table 4 summarizes the parameters discussed so far. Table 3 reports the question templates for each considered in our approach. annotation type. Each question allows to either provide the miss- ing information or to drop the clause. Templates are standardized 5.1 Effectiveness and user choices are limited to keep interaction easy. This al- The effectiveness of our framework quantifies how well the lows also unskilled users to obtain a baseline query. Additional translation of natural language queries meets user desiderata. clauses can be added through the OLAP operator module that Effectiveness is primarily evaluated as the parse tree similarity implements OLAP navigation. T Sim(PT , PT ∗ ) between the parse tree PT produced by our sys- At the end of this step, the parse forest is reduced to a parse tem and the correct one PT ∗ , which is manually written by us for tree since ambiguous clauses are either dropped or added to the each query in the benchmark. Parse tree similarity is based on parsing tree. the tree distance [23] that keeps into account both the number of correctly parsed entities (i.e., the parse tree leaves) and the tree 4.6 SQL Generation structure that codes the query structure (e.g., selection clauses Given a parsed tree PTM , the generation of its corresponding "(A and B) or C" and "A and (B or C)" refers to the same parsed SQL requires to fill in the SELECT, WHERE, GROUP BY and FROM entities but underlie different structures). In this subsection, we statements. The SQL generation is applicable to both star and evaluate how our framework behaves with(out) disambiguation. snowflake schemata and is done as follows: Let’s consider first the system behavior without disambigua- • SELECT: measures and aggregation operators from ⟨MC⟩ tion. Figure 8 depicts the performance of our approach with are added to the query selection clause together with the respect to variations in the number of retrieved top similar enti- attributes in the group by clause ⟨GC⟩; ties (i.e., N ∈ {2, 4, 6}). Values are reported for the top-k trees (i.e., • WHERE: predicate from the selection clause ⟨SC⟩ (i.e., val- the k trees with the highest score). We remind that only one parse ues and their respective attributes) is added to the query forest is involved in the disambiguation phase; nonetheless, for predicate; testing purposes, it is interesting to see if best parse tree belongs • GROUP BY: attributes from the group by clause ⟨GC⟩ are 5 A public dataset about food sales between 1997 and 1998 (https://github.com/ added to the query group by set; julianhyde/foodmart-data-mysql). Annotation type Description Action Ambiguous Attribute ⟨Val⟩ is member of these attributes [...] Pick attribute / drop clause Ambiguous Agg. Operator ⟨Mea⟩ allows these operators [...] Pick operator / drop clause Attribute-Value Mismatch ⟨Attr⟩ and ⟨Val⟩ domains mismatch, possible value are [...] Pick value / drop clause MD-Meas Violation ⟨Mea⟩ does not allow ⟨Agg⟩, possible operators are [...] Pick operator / drop clause MD-GBY Violation It is not allowed to group by on ⟨Attr⟩ without ⟨Attr⟩ Add attribute / drop clause Unparsed ⟨GC⟩ clause There is a dangling grouping clause ⟨GC⟩ Add clause to ⟨GPSJ⟩ / drop clause Unparsed ⟨MC⟩ clause There is a dangling measure clause ⟨MC⟩ Add clause to ⟨GPSJ⟩ / drop clause Unparsed ⟨SC⟩ clause There is a dangling predicate clause ⟨SC⟩ Add clause to ⟨GPSJ⟩ / drop clause Table 3: Templates and actions required to solve misunderstandings in textual queries. Symbol Range Meaning N {2, 4, 6} Num. of top similar entities N=2 N=4 N=6 α {0.4, 0.5, 0.6} Token/entity minimum similarity 1.0 β 70% Sentence coverage threshold n 4 Maximum sub-sequence length 0.8 Table 4: Parameter values for testing. 0.6 TSim 0.4 0.2 0.0 1 2 3 4 5 k to the top-k ranked ones. Effectiveness slightly change varying N and it ranges in [0.88, 0.91]. Effectiveness is more affected by the entity/token similarity threshold α and ranges in [0.83, 0.91]. Figure 8: Effectiveness varying the number of retrieved In both cases, the best results are obtained when more similar top similar entities N and the number of Top-k queries re- entities are admitted and more candidate mappings are gener- turned (with α = 0.4). ated. Independently of the chosen thresholds the system results very stable (i.e., the effectiveness variations are limited) and even = 0.6 = 0.5 = 0.4 considering only one query to be returned its effectiveness is at 1.0 the state of the art [13, 18, 22]. This confirms that (1) the choice 0.8 of proposing only one query to the user does not negatively im- 0.6 TSim pact on performances (while it positively impacts on interaction complexity and efficiency) and (2) our scoring function properly 0.4 ranks parse tree similarity to the correct interpretation for the 0.2 query since the best ranked is in most cases the most similar 0.0 to the correct solution. As the previous tests do not include dis- 1 2 3 4 5 ambiguation, only 58 queries out of 110 are not ambiguous and k produce parse trees that can be fed as-is to the generation and execution phases. This means that 52 queries — despite being very similar to the correct tree, as shown by the aforementioned Figure 9: Effectiveness varying the similarity threshold α results — are not directly executable without disambiguation (we and the number of Top-k queries returned (with N = 6). recall the ambiguity/error types from Table 2). Indeed, of these 52 1.0 queries, 38 contains 1 ambiguity annotation, 12 contains two am- 0.8 biguities annotations, and 2 contains three or more ambiguities 0.6 TSim annotations. Figure 10 depicts the performance when the best parse tree undergoes iterative disambiguation (i.e., an increasing 0.4 number of correcting action is applied). Starting from the best 0.2 configuration from the previous tests (for N = 6 and α = 0.4), 0.0 by applying an increasing number of correcting actions the ef- 0 1 2 3 fectiveness increases from 0.89 up to 0.94. Unsolved differences Disambiguation step between PT and PT ∗ are mainly due to missed entities in the mappings. Figure 10: Effectiveness as a function of the number of dis- Although our approach shows effectiveness comparable to ambiguation steps (i.e., application of correcting actions; state-of-art proposals [13, 18, 22], it was not possible to run a de- with k = 1, N = 6 and α = 0.4). tailed comparison against them because the implementations are not available and the provided descriptions are far from making them reproducible. Furthermore, despite the availability of some natural language datasets, the latter are hardly compatible with GPSJ queries and are not based on real analytic workloads. 6 CONCLUSIONS AND FUTURE WORKS Parsed Generated In this paper, we proposed a conversational OLAP framework and 105 an original approach to translate a text describing an OLAP query in a well-formed GPSJ query. Tests on a real case dataset have 104 shown state-of-the-art performances. More in detail, we have 103 | | shown that coupling a grammar-based text recognition approach 102 with a disambiguation phase determines a 94% accuracy. We are now working in many different directions to complete the frame- 101 work by: (a) supporting an OLAP navigation session rather than 1 3 5 7 a single query, as this requires to extend the grammar and to have |M| memories of previous queries in order to properly modify them; (b) designing a proper visual representation of the interpreted text to improve the system effectiveness, which can be obtained Figure 11: Number of "Generated" and "Parsed" mappings by defining a proper graphical metaphor of the query on top of a varying the number |M | of entities in the optimal tree conceptual visual formalism such as DFM; (c) testing the whole (with N = 6 and α = 0.4). framework on real users to verify its effectiveness from points of view that go beyond accuracy, e.g. usability, immediacy, mem- N=2 N=4 N=6 orability; (d) exploiting the Log to improve the disambiguation effectiveness and to learn synonyms not available a-priori. 101 Time (s) REFERENCES 100 [1] [n. d.]. Hype Cycle for Artificial Intelligence, 2018. http://www.gartner.com/ en/documents/3883863/hype-cycle-for-artificial-intelligence-2018. ([n. d.]). Accessed: 2019-06-21. 10 1 [2] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 1 3 5 7 28, 5 (2019), 793–819. |M| [3] Julien Aligon, Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and Stefano Rizzi. 2015. A collaborative filtering approach for recommending OLAP sessions. Decision Support Syst. 69 (2015), 20–30. [4] John C. Beatty. 1982. On the relationship between LL(1) and LR(1) grammars. Figure 12: Execution time varying the number |M | of en- J. ACM 29, 4 (1982), 1007–1022. tities in the optimal tree and the number of top similar [5] Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. PVLDB 5, 10 entities N (with α = 0.4). (2012), 932–943. [6] Kedar Dhamdhere, Kevin S. McCurley, Ralfi Nahmias, Mukund Sundararajan, and Qiqi Yan. 2017. Analyza: Exploring Data with Conversation. In IUI. ACM, 493–504. [7] Krista Drushku, Julien Aligon, Nicolas Labroche, Patrick Marcel, and Verónika Peralta. 2019. Interest-based recommendations for business intelligence users. 5.2 Efficiency Inf. Syst. 86 (2019), 79–93. [8] Lorena Etcheverry and Alejandro A. Vaisman. 2012. QB4OLAP: A Vocabulary We ran the tests on a machine equipped with Intel(R) Core(TM) for OLAP Cubes on the Semantic Web. In COLD (CEUR Workshop Proceedings), i7-6700 CPU @ 3.40GHz CPU and 8GB RAM, with the framework Vol. 905. CEUR-WS.org. [9] Matteo Francia, Matteo Golfarelli, and Stefano Rizzi. 2019. Augmented Busi- implemented in Java. ness Intelligence. In DOLAP (CEUR Workshop Proceedings), Vol. 2324. CEUR- In Section 4.2 we have shown that the mapping search space WS.org. increases exponentially in the text length. Figure 11 confirms this [10] Ramanathan V. Guha, Vineet Gupta, Vivek Raghunathan, and Ramakrishnan Srikant. 2015. User Modeling for a Personal Assistant. In WSDM. ACM, 275– result showing the number of generated mappings as a function 284. of the number of entities |M | included in the optimal parse tree [11] Ashish Gupta, Venky Harinarayan, and Dallan Quass. 1995. Aggregate-Query Processing in Data Warehousing Environments. In VLDB. Morgan Kaufmann, PT ∗ , |M | is strictly related to |T |. Note that, with reference to the 358–369. parameter values reported in Table 4, the configuration analyzed [12] Nicolas Kuchmann-Beauger, Falk Brauer, and Marie-Aude Aufaure. 2013. in Figure 11 is the worst case since it determines the largest QUASL: A framework for question answering and its Application to busi- ness intelligence. In RCIS. IEEE, 1–12. search space due to the high number of admitted similar entities. [13] Fei Li and H. V. Jagadish. 2016. Understanding Natural Language Queries over Noticeably, pruning rules strongly limit the number of mappings Relational Databases. SIGMOD Record 45, 1 (2016), 6–13. to be actually parsed. [14] Gabriel Lyons, Vinh Tran, Carsten Binnig, Ugur Çetintemel, and Tim Kraska. 2016. Making the Case for Query-by-Voice with EchoQuery. In SIGMOD Figure 12 shows the average execution time by varying |M | Conference. ACM, 2129–2132. and the number of allowed top similar entities N . The execution [15] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural time increase with the number of entities included in the optimal Language Processing Toolkit. In ACL (System Demonstrations). The Association parse tree, as such also the number of top similar entities impacts for Computer Linguistics, 55–60. the overall execution time. We recall that effectiveness remains [16] George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41. high also for N = 2 corresponding to an execution time of 1 [17] Amrita Saha, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. To- second, raising to 101 seconds in the worst case. We emphasize wards Building Large Scale Multimodal Domain-Aware Conversation Systems. that the execution time corresponds to the time necessary for In AAAI. AAAI Press, 696–704. [18] Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq the translation, and not to the time to actually execute them. Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology- Queries are executed against the enterprise data mart and their Driven System for Natural Language Querying over Relational Data Stores. PVLDB 9, 12 (2016), 1209–1220. performance clearly depends on the underlying multidimensional [19] Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish R. Mittal, engine. Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019. Natural Language Querying of Complex Business Intelligence Queries. In SIGMOD Conference. ACM, 1997–2000. [20] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Machine Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287 (2019). [21] Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A Holis- tic Approach for Query Evaluation and Result Vocalization in Voice-Based OLAP. In Proc. of the 2019 Int. Conf. on Manage. of Data, SIGMOD Conf. 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 936–953. [22] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: query synthesis from natural language. PACMPL 1, OOPSLA (2017), 63:1–63:26. [23] Kaizhong Zhang and Dennis E. Shasha. 1989. Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. Comput. 18, 6 (1989), 1245–1262.