<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Francia</string-name>
          <email>m.francia@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Gallinucci</string-name>
          <email>enrico.gallinucci@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Golfarelli</string-name>
          <email>matteo.golfarelli@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISI - University of Bologna</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The democratization of data access and the adoption of OLAP in scenarios requiring hand-free interfaces push towards the creation of smart OLAP interfaces. In this paper, we envisage a conversational framework specifically devised for OLAP applications. The system converts natural language text in GPSJ (Generalized Projection, Selection and Join) queries. The approach relies on an ad-hoc grammar and a knowledge base storing multidimensional metadata and cubes values. In case of ambiguous or incomplete query description, the system is able to obtain the correct query either through automatic inference or through interactions with the user to disambiguate the text. Our tests show very promising results both in terms of efectiveness and eficiency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Nowadays, one of the most popular research trends in computer
science is the democratization of data access, analysis and
visualization, which means opening them to end users lacking the
required vertical skills on the services themselves. Smart
personal assistants [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (Alexa, Siri, etc.) and auto machine learning
services [20] are examples of such research eforts that are now
on corporate agendas [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In particular, interfacing natural language processing (either
written or spoken) to database systems opens to new
opportunities for data exploration and querying [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Actually, in the area
of data warehouse, OLAP (On-Line Analytical Processing) itself
is an "ante litteram" smart interface, since it supports the users
with a "point-and-click" metaphor to avoid writing well-formed
SQL queries. Nonetheless, the possibility of having a
conversation with a smart assistant to run an OLAP session (i.e., a set of
related OLAP queries) opens to new scenarios and applications. It
is not just a matter of further reducing the complexity of posing a
query: a conversational OLAP system must also provide feedback
to refine and correct wrong queries, and it must have memory
to relate subsequent requests. A reference application scenario
for this kind of framework is augmented business intelligence [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
where hand-free interfaces are mandatory.
      </p>
      <p>
        In this paper, we envisage a conversational OLAP framework
able to convert a natural language text into a GPSJ query. GPSJ
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is the main class of queries used in OLAP since it enables
Generalized Projection, Selection and Join operations over a set of
tables. Although, some natural language interfaces to databases
have already been proposed, to the best of our knowledge this is
the first proposal specific to OLAP systems.
      </p>
      <p>In our vision, the desiderata for an OLAP smart interface are:
#1 it must be automated and portable: it must exploit cubes
metadata (e.g. hierarchy structures, role of measures,
attributes, and aggregation operators) to increase its
understanding capabilities and to simplify the user-machine
interaction process;
#2 it must handle OLAP sequences rather than single queries:
in an OLAP sequence the first query is fully described by
the text, while the following ones are implicitly/partially
described by an OLAP operator and require the system to
have memory of the previous ones;
#3 it must be robust with respect to user inaccuracies in using
syntax, OLAP terms, and attribute values, as well as in the
presence of implicit information;
#4 it must be easy to configure on a DW without a heavy
manual definition of the lexicon.</p>
      <p>More technically, our text-to-SQL approach is based on a
grammar parsing natural language descriptions of GPSJ queries. The
recognized entities include a set of typical query lexicons (e.g.
group by, select, filter) and the domain specific terms and
values automatically extracted from the DW (see desiderata #1 and
#4). Robustness (desiderata #3) is one of the main goals of our
approach and is pursued in all the translation phases: lexicon
identification is based on a string similarity function, multi-word
lexicons are handled through n-grams, and alternative query
interpretations are scored and ranked. The grammar proposed in
Section 4.3 recognizes full queries only, thus desiderata #2 is not
covered by the current work. To sum up, the main contributions
of this paper are:
(1) a list of features and desiderata for an efective
conversational OLAP system;
(2) an architectural view of the whole framework;
(3) an original approach to translate the natural language
description of an OLAP query in a well-formed GPSJ query
(Full query module in Figure 1).
(4) a set of tests to verify the efectiveness of our approach.</p>
      <p>The remainder of the paper is organized as follows: in Section 2
the functional architecture and the modules of a conversational
OLAP framework are sketched; Section 3 discusses related works;
Section 4 describes in detail the query generation process; in
Section 5 a large set of efectiveness and eficiency tests are reported;
ifnally, Section 6 draws the conclusions and discusses the system
evolution.
2</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>
        Figure 1 sketches a functional view of the architecture. Given a
DW, that is a set of multidimensional cubes together with their
metadata, the ofline phase is aimed at extracting the DW
specific terms used by user to express the queries. Such information
are stored in the system knowledge base (KB). Noticeably, this
phase runs only once for each DW unless it undergoes
modifications. More in detail, the Automatic KB feeding process extracts
from the cubes categorical attribute values and metadata (e.g.
attribute and measure names, hierarchy structures, aggregation
operators). With reference to the cube represented through a
DFM schema in Figure 2, the KB will store the names of the
measures (e.g. StoreSales, StoreCost) together with their default
aggregation operators (e.g. sum, avg). Additional synonyms can
be automatically extracted from open data ontologies (Wordnet
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] in our implementation); for example, "client" is a synonym
for Customer. The larger the number of synonyms, the wider the
language understood by the systems. Besides the domain specific
terminology, the KB includes the set of standard OLAP terms that
are domain independent and that do not require any feeding (e.g.
group by, equal to, select). Further enrichment can be optionally
carried out manually (i.e., by the KB enrichment module) when
the application domain involves a non-standard vocabulary (i.e.,
when the physical names of tables and columns do not match
the words of a standard vocabulary).
      </p>
      <p>The online phase runs every time a query is issued to the
system. The spoken query is initially translated to text by the
Speech-to-text module. This task is out of scope in our research
and we exploited the Google API in our implementation. The
uninterpreted text is then analyzed by the Interpretation module
that actually consists of two sub-modules: Full query is in charge
of interpreting the texts describing full queries, which typically
happens when an OLAP session starts. Conversely, OLAP
operator modifies the latest query when the user states an OLAP
operator along an OLAP session. On the one hand, understanding
a single OLAP operator is simpler since it involves less elements.
On the other hand, it requires to have memory of previous queries
(stored in the Log) and to understand which part of the previous
query must be modified.</p>
      <p>Due to natural language ambiguity, user errors and system
inaccuracies, part of the text can be misunderstood. The role
of the Disambiguation module is to solve ambiguities by
asking appropriate questions to the user. The reasons behind the
misunderstandings are manyfold, including (but not limited to):
ambiguities in the aggregation operator to be used; inconsistency
between attribute and value in a selection predicate; (3)
identification of relevant elements in the text without full comprehension.</p>
      <p>The output of the previous steps is a data structure (i.e., a
parsing tree) that models the query and that can be automatically
turned in a SQL query exploiting the DW structure stored in the
KB. Finally, the obtained query is run on the DW and the results
are reported to the user by the Execution &amp; Visualization module.
Such module could exploit a standard OLAP visualization tool
or it could implement voice-based approaches [21] to create an
end-to-end conversational solution.
3</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORKS</title>
      <p>Conversational business intelligence can be classified as a natural
language interface (NLI) to business intelligence systems to drive
analytic sessions. Despite the plethora of contributions in each
area, to the best of our knowledge, no approach lies at their
intersection.</p>
      <p>
        NLIs to operational databases enable users to specify
complex queries without previous training on formal programming
languages (such as SQL) and software; a recent and
comprehensive survey is provided in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Overall, NLIs are divided into
two categories: question answering (QA) and dialog (D). While
the former are designed to operate on single queries, only the
latter are capable of supporting sequences of related queries as
needed in OLAP analytic sessions. However, to the best of our
knowledge, no dialog-based system for OLAP sessions has been
provided so far. The only contribution in the dialog-based
direction is [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], where the authors provide an architecture for
querying relational databases; with respect to this contribution
we rely on the formal foundations of the multidimensional model
to drive analytic sessions (e.g., according to the multidimensional
model it is not possible to group by a measure, compute
aggregations of categorical attributes, aggregate by descriptive attributes,
ensure drill-across validity). Also diferently from [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the
results we provide are supported by extensive efectiveness and
eficiency performance evaluation that completely lack in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Finally, existing dialog systems, such as [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], address the
exploration of linked data. Hence they are not suitable for analytics on
the multidimensional model. As to question answering, existing
systems are well understood and diferentiate for the knowledge
required to formulate the query and for the generative approach.
Domain agnostic approaches solely rely on the database schema.
NaLIR [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] translates natural language queries into dependency
trees [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and brute-forcefully transforms promising trees
until a valid query can be generated. In our approach we rely on
n-grams instead of dependency trees [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] since the latter
cannot be directly mapped to entities in the knowledge base (i.e.,
they require tree manipulation) and are sensible to the query
syntax (e.g., "sum unit sales" and "sum the unit sales" produce two
diferent trees with the same meaning). SQLizer [ 22] generates
templates over the issued query and applies a "repair" loop until
it generates queries that can be obtained using at most a given
number of changes from the initial template. Domain specific
approaches add semantics to the translation process by means of
domain-specific ontologies and ontology-to-database mappings.
SODA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses a simple but limited keyword-based approach
that generates a reasonable and executable SQL query based on
the matches between the input query and the database metadata,
enriched with domain-specific ontologies. ATHENA [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and its
recent extension [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] map natural language into an ontology
representation and exploit mappings crafted by the relational
schema designer to resolve SQL queries. Analyza [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] integrates
the domain-specific ontology into a "semantic grammar" (i.e., a
grammar with placeholders for the typed concepts such as
measures, dimensions, etc.) to annotate and finally parse the user
query. Additionally, Analyza provides an intuitive interface
facilitating user-system interaction in spreadsheets. Unfortunately,
by relying on the definition of domain specific knowledge and
mappings, the adoption of these approaches is not plug-and-play
as an ad-hoc ontology is rarely available and is burdensome to
create.
      </p>
      <p>
        In the area of business intelligence the road to
conversationdriven OLAP is not paved yet. The recommendation of OLAP
sessions to improve data exploration has been well-understood
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] also in domains of unconventional contexts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] where
handfree interfaces are mandatory. Recommendation systems focus
on integrating (previous) user experience with external
knowledge to suggest queries or sessions, rather than providing smart
interfaces to BI tools. To this end, personal assistants and
conversational interfaces can help users unfamiliar with such tools
and SQL language to perform data exploration. However,
endto-end frameworks are not provided in the domain of analytic
sessions over multidimensional data. QUASL [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduces a
QA approach over the multidimensional model that supports
analytical queries but lacks both the formalization of the
disambiguation process (i.e., how ambiguous results are addressed)
and the support to OLAP sessions (with respect to QA, handling
OLAP sessions requires to manage previous knowledge from the
Log and to understand whether the issued sentence refines
previous query or is a new one). Complementarily to our framework,
[21] recently formalized the vocalization of OLAP results.
      </p>
      <p>To summarize, the main diferences of our approach to the
previous works are the following.</p>
      <p>Online
Offline</p>
      <sec id="sec-3-1">
        <title>Speech -to-Text</title>
        <p>Raw
text</p>
      </sec>
      <sec id="sec-3-2">
        <title>Interpretation</title>
      </sec>
      <sec id="sec-3-3">
        <title>OLAP operator</title>
      </sec>
      <sec id="sec-3-4">
        <title>Full query</title>
        <p>In the next subsections, we describe in detail the interpretation
process for a single full query. Figure 3 shows the necessary
steps in the online phase. We remind that, with reference to
Figure 1, modules Speech-to-Text, OLAP operator and, Execution
&amp; Visualization are out of scope here. Conversely, the ofline
phase process and the related modules are implemented but not
reported.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The Knowledge Base</title>
      <p>The system Knowledge Base (KB in Figure 1) stores all the
information extracted from the DW that is necessary to support the
translation phases. Information can be classified in DW Entities
and Structure. More in detail, the DW structure enables
consistency checks on parse trees and, given a parse tree, allows the
SQL generation. It includes:
• Hierarchy structure: the roll-up relationships between
attributes.
• Aggregation operators: for each measure, the applicable
operators and the default one.
• DB tables: the structure of the database implementing
the DW including tables’ and attributes’ names, primary
and foreign key relationships.</p>
      <sec id="sec-4-1">
        <title>The set of DW entities includes:</title>
        <p>• DW element names: measures, dimensional attributes
and fact names.
• DW element values: for each categorical attribute, all
the values are stored.</p>
        <p>
          Several synonyms can be stored for each entity, enabling the
system to cope with slang and diferent shades of the text. Both
DW entities and DW structure are automatically collected by
querying the DW and its data dictionary and are stored in a
QB4OLAP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] compatible repository. Besides DW entities, that
are domain specific, the knowledge base includes those keywords
and patterns that are typically used to express a query:
• Intention keywords: express the role of the subsequent
part of text. Examples of intention keywords are group by,
select and filter.
• Operators: include logic (e.g. and, or), comparison (e.g.
greater, equal) and aggregation operators (e.g. sum,
average).
• Patterns of dates and numbers: used to automatically
recognize dates and numbers in the raw text.
        </p>
        <p>Overall, the entities defined in the system are defined as E =
{E1, E2, ..., Em }; note that each entity can be a multi-word that
can be modeled as a sequence of tokens E = ⟨t1, ..., tr ⟩.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Tokenization and Mapping</title>
      <p>A raw text T can be modeled as a sequence of tokens (i.e., single
words) T = ⟨t1, t2, ..., tz ⟩. The goal of this phase is to identify in
T the known entities that are the only elements involved in the
Parsing phase.</p>
      <p>Turning a text into a sequence of entities means finding a
mapping between tokens in T and E. More formally:</p>
      <p>Definition 4.1 (Mapping &amp; Mapping function). A mapping
function M(T ) is a partial function that associates sub-sequences1
from T to entities in E such that:
• sub-sequences of T have length n at most;
• the mapping function determines a partitioning of T ;
1The term n-gram is used as a synonym of sub-sequence in the area of text mining.
Tokenization
&amp; Mapping</p>
      <p>Mappings</p>
      <p>Parsing</p>
      <p>Interpretation: Full query</p>
      <p>Parse
forest</p>
      <p>Checking &amp;
Enhancement</p>
      <p>The output of a mapping function is a sequence M = ⟨E1, ..., El ⟩
on E that we call a mapping. Given the mapping M the similarities
between each entity Ei and the corresponding tokens Sim(T ′, Ei )
are also retained and denoted with SimM (Ei ). A mapping is said
to be valid if the fraction of mapped tokens in T is higher than a
given threshold β . We call M the set of valid mappings.</p>
      <p>Several mapping functions (and thus several mappings) may
exists between T and E since Definition 4.1 admits sub-sequences
of variable lengths and retains, for each sub-sequence, the top
similar entities. This increases interpretation robustness, but it
can lead to an increase in computation time. Generating several
diferent mappings increases robustness since it allows to choose,
in the next steps, the best text interpretation out of a higher
number of candidates. Generated mappings difer both in the number
of entities involved and, in the specific entities mapped to a token.
In the simple case where multi-token mappings are not possible
(i.e., n = 1 in Definition 4.1) the number of generated mappings
for a raw text T , such that |T | = z, is:
z
Õ
i= ⌈z ·β ⌉
z
i · N
i
The formula counts the possible configurations of suficient
length (i.e., higher or equal to ⌈z · β ⌉) and, for each length, count
the number of mappings determined by the top similar
entities. Since the number of candidate mappings is exponential, we
consider only the most significant ones through α , β , and N : α
imposes sub-sequence of tokens to be very similar to an entity; N
further imposes to consider only the N entities with the highest
similarity; finally, β imposes a suficient portion of the text to be
mapped.</p>
      <p>The similarity function Sim() is based on the Levenshtein
distance and keeps token permutation into account to make
similarity robust to token permutations (i.e., sub-sequences ⟨P ., Edдar ⟩
and ⟨Edдar , Allan, Poe⟩ must result similar). Given two token
sequences T and W with |T | = l, |W | = m such that l ≤ m it is:
max
D ∈Disp(l,m)</p>
      <p>Sim(⟨t1, ..., tl ⟩, ⟨w1, ..., wm ⟩) =
Íli=1 sim(ti , wD(i)) · max(|ti |, |wD(i) |)
Íli=1 max(|ti |, |wD(i) |) + Íi ∈Dˆ |wi |
where D ∈ Disp(l, m) is an l-disposition of {1, ..., m} and Dˆ is
the subset of values in {1, ..., m} that are not present in D.
Function Sim() weights token similarity based on their lengths (i.e.,
max(|ti |, |wD(i) |) and penalizes similarities between sequences of
diferent lengths that implies unmatched tokens (i.e., Íi ∈Dˆ |wi |).</p>
      <p>Example 4.2 (Token similarity). Figure 4 shows some of the
possible token dispositions for the two token sequences T =
Edgar</p>
      <p>Allan Poe</p>
      <p>Edgar</p>
      <p>Allan Poe</p>
      <p>Edgar</p>
      <p>Allan Poe
Edgar</p>
      <p>Allan Poe</p>
      <p>Edgar</p>
      <p>Allan Poe</p>
      <p>Edgar</p>
      <p>Allan Poe</p>
      <p>We assume the best interpretation of the input text to be the
one where (1) all the entities discovered in the text are included
in the query (i.e., all the entities are parsed through the grammar)
and, (2) each entity discovered in the text is perfectly mapped
to one sub-sequence of tokens (i.e., SimM (Ei ) = 1). The two
previous statements are modeled through the following score
function. Given a mapping M = ⟨E1, ..., Em ⟩, we define its score
as</p>
      <p>Score(M) =</p>
      <p>SimM (Ei )
m
Õ
i=1
The score is higher when M includes several entities with high
values of SimM . Although at this stage it is not possible to predict
if a mapping will be fully parsed, it is apparent that the higher
the mapping score, the higher its probability to determine an
optimal interpretation. As it will be explained in the Section 4.4,
ordering the mapping by descending score also enables pruning
strategies to be applied.</p>
      <p>Example 4.3 (Tokenization and mapping). Given the set of
entities E and a tokenized text T = ⟨medium, sales, in, 2019, by, the,
reдion⟩, examples of mappings M1 and M2 are:</p>
      <p>M1 =⟨avg, UnitSales, where, 2019, group by, region⟩
M2 =⟨avg, UnitSales, where, 2019, group by, Reдin⟩
where the token the is not mapped and the token reдion is
mapped to the attribute Region in M1 and to the value Reдin
in M2 (where Reдin is a value of attribute Customer that holds a
suficient similarity). □
□
(1)
⟨GPSJ⟩ ::= ⟨MC⟩⟨GC⟩⟨SC⟩ | ⟨MC⟩⟨SC⟩⟨GC⟩ | ⟨SC⟩⟨GC⟩⟨MC⟩
| ⟨SC⟩⟨MC⟩⟨GC⟩ | ⟨GC⟩⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩⟨SC⟩
| ⟨MC⟩⟨SC⟩ | ⟨MC⟩⟨GC⟩ | ⟨SC⟩⟨MC⟩ | ⟨GC⟩⟨MC⟩
| ⟨MC⟩
⟨MC⟩ ::= (⟨Agg⟩⟨Mea⟩ | ⟨Mea⟩⟨Agg⟩ | ⟨Mea⟩ | ⟨Cnt⟩⟨Fct⟩
| ⟨Cnt⟩⟨Atr ⟩)+
⟨GC⟩ ::= ⟨Gby⟩⟨Atr ⟩+
⟨SC⟩ ::= ⟨Whr⟩⟨SCO⟩
⟨SCO⟩ ::= ⟨SCA⟩”or ”⟨SCO⟩ | ⟨SCA⟩
⟨SCA⟩ ::= ⟨SCN⟩”and”⟨SCA⟩ | ⟨SCN⟩
⟨SCN⟩ ::= ”not ”⟨SSC⟩ | ⟨SSC⟩</p>
      <p>| ⟨Val⟩⟨Atr ⟩ | ⟨Val⟩
⟨SSC⟩ ::= ⟨Atr ⟩⟨Cop⟩⟨Val⟩ | ⟨Atr ⟩⟨Val⟩ | ⟨Val⟩⟨Cop⟩⟨Atr ⟩
Parsing a valid mapping means searching in the text the complex
syntax structures (i.e., clauses) that build-up the query. Given a
mapping M, the output of a parser is a parse tree PTM , i.e. an
ordered, rooted tree that represents the syntactic structure of
a string according to a given grammar. To the aim of parsing,
entities (i.e., terminal elements in the grammar) are grouped
in syntactic categories (see Table 1). The whole grammar is
described in Figure 5. As a GPSJ query consists of 3 clauses (measure,
group by and selection), in our grammar we identify four types
of derivations2:
• Measure clause ⟨MC⟩: this derivation consists of a list
of measure/aggregation operator pairs. If the operator is
omitted (i.e., ⟨MC⟩ ::= ⟨Mea⟩ applies) the default one
specified in the KB will be applied to the measure during
the Checking &amp; Enhancement step.
• Group by clause ⟨GC⟩: this derivation consists of a
sequence of attribute names preceded by an entity of type
⟨Gby⟩.
• Selection clause ⟨SC⟩: this derivation consists of a Boolean
expression of simple selection predicates ⟨SSC⟩; follows
standard SQL operator priority (see ⟨SCO⟩, ⟨SCA⟩, and
⟨SCN⟩ derivations).
• GPSJ query ⟨GPSJ⟩: this derivation assembles the final
query. Only the measure clause is mandatory since a GPSJ
could aggregate a single measure with no selections. The
order of clauses is irrelevant; this implies the proliferation
of derivations due to clause permutations.</p>
      <p>
        The GPSJ grammar is LL(1)3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is not ambiguous (i.e., each
mapping admits a single parsing tree PTM ) and can be parsed by
a LL(1) parser with linear complexity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. If the input mapping
M can be fully parsed, PTM will include all the entities as leaves.
Conversely, if only a portion of the input belongs to the grammar,
an LL(1) parser will produce a partial parsing, meaning that it will
return a parsing tree including the portion of the input mapping
that belongs to the grammar. Remaining entities can be either
singleton or complex clauses that were not possible to connect to
the main parse tree. We will call parse forest P FM the union of the
parsing tree with residual clauses. Obviously, if all the entities
are parsed, it is P FM = PTM . Considering the whole forest rather
2A derivation in the form ⟨X⟩ ::= e represent a substitution for the non-terminal
symbol ⟨X⟩ with the given expression e. Symbols that never appear on the left side
of ::= are named terminals. Non-terminal symbols are enclosed between ⟨⟩.
3The rules presented in Figure 5 do not satisfy LL(1) constraints for readability
reasons. It is easy to turn such rules in a LL(1) complaint version, but the resulting
rules are much more complex to be read and understood.
      </p>
      <p>Agg Mea Whr Val Gby Val
M2 = avg, UnitSales, where, 2019, group by, Regin</p>
      <p>(b) ⟨Gby⟩ ⟨Val⟩ cannot be parsed.
than the simple parse tree enables disambiguation and errors to
be recovered during the Disambiguation phase.</p>
      <p>Example 4.4 (Parsing). Figure 6 reports the parsing outcome
for the two mappings in Example 4.3. M1 is fully parsed, thus
its parsing forest correspond to the parsing tree (i.e., PTM1 =
P FM1 ). Conversely, in M2 the last token is wrongly mapped to
the attribute value Regin rather than to the attribute name Region.
This prevents the full parsing and the parsing tree PTM does not
include all the entities in M.</p>
      <p>SCO
SCA
SCN
SSC
SCO
SCA
SCN
SSC</p>
      <p>SC
SCO
SCA
SCN
SSC
Annotation type
Ambiguous Attribute
Ambiguous Agg. Operator
Attribute-Value Mismatch
MD-Meas Violation
MD-GBY Violation
Unparsed clause</p>
      <p>Gen. derivation sample
⟨SSC⟩ ::= ⟨Val⟩
⟨MC⟩ ::= ⟨Mea⟩
⟨SSC⟩ ::= ⟨Atr ⟩⟨Cop⟩⟨Val⟩
⟨MC⟩ ::= ⟨Agg⟩⟨Mea⟩
⟨GC⟩ ::= ⟨Gby⟩⟨Atr ⟩+
–
The Parsing phase verifies adherence of the mapping to the GPSJ
grammar, but this does not guarantee the executability of the
corresponding SQL query due to implicit elements, ambiguities
and errors. The following list enumerates the cases that require
further processing.</p>
      <p>• Ambiguous attribute: the ⟨SSC⟩ clause has an implicit
attribute and the parsed value belongs to multiple attribute
domains.
• Ambiguous aggregation operator: the ⟨MC⟩ clause has
an implicit aggregation operator but the measure is
associated with multiple aggregation operators.
• Attribute-value mismatch: the ⟨SSC⟩ clause includes a
value that is not valid for the specified attribute, i.e. it is
either outside or incompatible with the attribute domain.
• Violation of a multidimensional constraint on a
measure: the ⟨MC⟩ clause contains an aggregation operator
that is not allowed for the specified measure.
• Violation of a multidimensional constraint on an
attribute: the ⟨GC⟩ clause contains a descriptive attribute
without the corresponding dimensional attribute4.
• Implicit aggregation operator: the ⟨MC⟩ clause has an
implicit aggregation operator and the measure is
associated with only one aggregation operator or a default
operator is defined.
• Implicit attribute: the ⟨SSC⟩ clause has an implicit
attribute and the parsed value belongs to only one attribute
domain.
• Unparsed clause: The parser was not able to fully
understand the mapping and the returned parse forest includes
one or more dandling clauses.</p>
      <p>Each bullet above determines a possible annotation of the
nodes in the parsed forest. In case of implicit information, the
entities in the KB represented by the implicit nodes are
automatically added to the parse tree; thus no user action is required. All
the other cases cannot be automatically solved and the nodes
are annotated with a specific error type that will trigger a
usersystem interaction during the Disambiguation phase. Table 2
reports annotation examples.</p>
      <p>A textual query generates several parsing forests, one for
each mapping. In our approach, only the most promising one is
proposed to the user in the Disambiguation phase. This choice
comes from two main motivations:
• Proposing more than one alternative queries to the user
can be confusing and makes very dificult to contextualize
the disambiguation questions.
4According to DFM a descriptive attribute is an attribute that further describes a
dimensional level (i.e., it is related one-to-one with the level), but that can be used
for aggregation only in combination with the corresponding level.
MC</p>
      <p>GPSJ</p>
      <p>SC</p>
      <p>SCO
SCA
SCN
SSC</p>
      <p>AVM
unparsed</p>
      <p>SC
SCO
SCA
SCN
SSC</p>
      <p>Agg Mea Whr Attr Cop Val Gby Val
M = avg, UnitSales, where, Product, =, New York, group by, Regin
pessimistic
optimistic-pessimistic</p>
      <p>optimistic
• Proposing only the most promising choice may incur in
missing the optimal derivation but it makes easier to create
a baseline query. This one can be improved by adding or
removing clauses through further interactions enabled by
the OLAP operator module.</p>
      <p>Definition 4.5 (Parsing Forest Score). Given a mapping M and
the corresponding parsing forest P FM , we define its score as
Score(P FM ) = Score(M ′) where M ′ is the sub-sequence of M
belonging to the parsing tree PTM .</p>
      <p>The parsing forest holding the highest score is the one
proposed to the user. This ranking criterion is based on an
optimisticpessimistic forecast of the outcome of the Disambiguation phase.
On the one hand, we optimistically assume that the annotations
belonging to PTM will be positively solved in the Disambiguation
phase and the corresponding clauses and entities will be kept.
On the other hand, we pessimistically assume that non-parsed
clauses belonging to P FM will be dropped. The rationale of our
choice is that an annotated clause included in the parsing tree is
more likely to be a proper interpretation of the text. As shown
in Figure 7, a totally pessimistic criterion (i.e., exclude from the
score all the annotated clauses and entities) would carry forward
a too simple, but non-ambiguous, forest; conversely, a totally
optimistic criterion (i.e., consider the score of all the entities in
P FM ) would make preferable a large but largely non-parsed forest.
Please note that the bare score of the mapping (i.e., the one
available before parsing) corresponds to a totally optimistic choice
since it sums up the scores of all the entities in the mapping.</p>
      <p>The ranking criterion defined above enables the pruning of the
mappings to be parsed as shown by Algorithm 1 that reports the
pseudo-code for the Full query module. Reminding that mappings
are parsed in descending score order, let us assume that, at some
step, the best parse forest is P FM′ with score Score(P FM′ ). If
the next mapping to be parsed, M ′′, has score Score(M ′′) &lt;</p>
      <sec id="sec-5-1">
        <title>Algorithm 1 Parsing forest selection</title>
        <p>Require: M: set of valid mappings
Ensure: P F ∗: best parsing forest</p>
        <p>return P F ∗
1: M ← sort (M) ▷ sort mappings by score
2: P F ∗ ← ∅
3: while M , ∅ do ▷ while mapping space is not exhausted
4: M ← head(M) ▷ get the mapping with highest score
5: M ← M \ {M } ▷ remove it from M
6: P FM ← parse(M) ▷ parse the mapping
7: if score(P FM ) &gt; score(P F ∗) then</p>
        <p>▷ if current score is higher than previous score
8: P F ∗ ← P FM ▷ store the new parsing forest
9: M ← M \ {M ′ ∈ M, score(M ′) ≤ score(P F ∗)}
▷ remove mappings with lower scores from M
Score(P FM′ ), we can stop the algorithm and return P FM′ since
the optimistic score of M ′′ is an upper bound to the score of the
corresponding parse forest.</p>
        <p>Algorithm 1 works as follows. At first, mappings are sorted
by their score (Line 1), the best parse forest is initialized, and
the iteration begins. While the set of existing mappings is not
exhausted (Line 3), the best mapping is picked, removed from the
set of candidates, and its parsing forest is generated (Lines 4–6).
If the score of the current forest is higher than the score of the
stored one (Line 7), then the current forest is stored (Line 8) and
all the mappings with a lower score are removed from the search
space (Line 9) as the pruned mappings cannot produce parsing
forests with a score greater than what has been already parsed.
4.5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Parse Tree Disambiguation</title>
      <p>Each annotation in the parse forest requires a user choice to
restore query correctness. User-system interaction takes place
through a set of questions, one for each annotation. Obviously,
each answer must be parsed following the same step sequence
discussed so far. Table 3 reports the question templates for each
annotation type. Each question allows to either provide the
missing information or to drop the clause. Templates are standardized
and user choices are limited to keep interaction easy. This
allows also unskilled users to obtain a baseline query. Additional
clauses can be added through the OLAP operator module that
implements OLAP navigation.</p>
      <p>At the end of this step, the parse forest is reduced to a parse
tree since ambiguous clauses are either dropped or added to the
parsing tree.
4.6</p>
    </sec>
    <sec id="sec-7">
      <title>SQL Generation</title>
      <p>Given a parsed tree PTM , the generation of its corresponding
SQL requires to fill in the SELECT, WHERE, GROUP BY and FROM
statements. The SQL generation is applicable to both star and
snowflake schemata and is done as follows:
• SELECT: measures and aggregation operators from ⟨MC⟩
are added to the query selection clause together with the
attributes in the group by clause ⟨GC⟩;
• WHERE: predicate from the selection clause ⟨SC⟩ (i.e.,
values and their respective attributes) is added to the query
predicate;
• GROUP BY: attributes from the group by clause ⟨GC⟩ are
added to the query group by set;
□
• FROM: measures and attributes/values identify, respectively,
the fact and the dimension tables involved in the query.
Given these tables, the join path is identified by
following the referential integrity constraints (i.e., by following
foreign keys from dimension tables imported in the fact
table).</p>
      <p>Example 4.6 (SQL generation). Given the GPSJ query "sum the
unit sales by type in the month of July", its corresponding SQL is:
SELECT Type, sum(UnitSales)
FROM Sales s JOIN Product p ON (s.pid = p.id)</p>
      <p>JOIN Date d ON (s.did = d.id)
WHERE Month = "July"
GROUP BY Type
5</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL TESTS</title>
      <p>
        In this section, we evaluate our framework in terms of
efectiveness and eficiency. Tests are carried out on a real-word
benchmark of analytics queries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Since queries from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] refers to
private datasets, we mapped the natural language queries to the
Foodmart schema5. The Automatic KB feeding populated the
KB with 1 fact, 39 attributes, 12337 values and 12449 synonyms.
Additionally, only 50 synonyms where manually added in the KB
enrichment step (e.g., "for each", "for every", "per" are synonyms
of the group by statement). While mapping of natural language
queries to the Foodmart domain, we preserved the structure of
the original queries (e.g., word order, typos, etc.). Overall, 75%
of the queries in the dataset are valid GPSJ queries, confirming
how general and standard GPSJ queries are. The filtered
benchmark includes 110 queries. We consider token sub-sequences of
maximum length n = 4 (i.e., the [1..4]-grams) as no entity in the
KB is longer than 4 words. Each sub-sequence is associated to
the top N similar entities with similarity at least higher than α
such that at least a percentage β of the tokens in T is covered.
The value of β is fixed to 70% based on an empirical evaluation
of the benchmark queries. Table 4 summarizes the parameters
considered in our approach.
5.1
      </p>
    </sec>
    <sec id="sec-9">
      <title>Efectiveness</title>
      <p>The efectiveness of our framework quantifies how well the
translation of natural language queries meets user desiderata.
Efectiveness is primarily evaluated as the parse tree similarity
T Sim(PT , PT ∗) between the parse tree PT produced by our
system and the correct one PT ∗, which is manually written by us for
each query in the benchmark. Parse tree similarity is based on
the tree distance [23] that keeps into account both the number of
correctly parsed entities (i.e., the parse tree leaves) and the tree
structure that codes the query structure (e.g., selection clauses
"(A and B) or C" and "A and (B or C)" refers to the same parsed
entities but underlie diferent structures). In this subsection, we
evaluate how our framework behaves with(out) disambiguation.</p>
      <p>Let’s consider first the system behavior without
disambiguation. Figure 8 depicts the performance of our approach with
respect to variations in the number of retrieved top similar
entities (i.e., N ∈ {2, 4, 6}). Values are reported for the top-k trees (i.e.,
the k trees with the highest score). We remind that only one parse
forest is involved in the disambiguation phase; nonetheless, for
testing purposes, it is interesting to see if best parse tree belongs
5A public dataset about food sales between 1997 and 1998 (https://github.com/
julianhyde/foodmart-data-mysql).</p>
      <p>
        Symbol
N
α
β
n
to the top-k ranked ones. Efectiveness slightly change varying
N and it ranges in [0.88, 0.91]. Efectiveness is more afected by
the entity/token similarity threshold α and ranges in [0.83, 0.91].
In both cases, the best results are obtained when more similar
entities are admitted and more candidate mappings are
generated. Independently of the chosen thresholds the system results
very stable (i.e., the efectiveness variations are limited) and even
considering only one query to be returned its efectiveness is at
the state of the art [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18, 22</xref>
        ]. This confirms that (1) the choice
of proposing only one query to the user does not negatively
impact on performances (while it positively impacts on interaction
complexity and eficiency) and (2) our scoring function properly
ranks parse tree similarity to the correct interpretation for the
query since the best ranked is in most cases the most similar
to the correct solution. As the previous tests do not include
disambiguation, only 58 queries out of 110 are not ambiguous and
produce parse trees that can be fed as-is to the generation and
execution phases. This means that 52 queries — despite being
very similar to the correct tree, as shown by the aforementioned
results — are not directly executable without disambiguation (we
recall the ambiguity/error types from Table 2). Indeed, of these 52
queries, 38 contains 1 ambiguity annotation, 12 contains two
ambiguities annotations, and 2 contains three or more ambiguities
annotations. Figure 10 depicts the performance when the best
parse tree undergoes iterative disambiguation (i.e., an increasing
number of correcting action is applied). Starting from the best
configuration from the previous tests (for N = 6 and α = 0.4),
by applying an increasing number of correcting actions the
effectiveness increases from 0.89 up to 0.94. Unsolved diferences
between PT and PT ∗ are mainly due to missed entities in the
mappings.
      </p>
      <p>
        Although our approach shows efectiveness comparable to
state-of-art proposals [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18, 22</xref>
        ], it was not possible to run a
detailed comparison against them because the implementations are
not available and the provided descriptions are far from making
them reproducible. Furthermore, despite the availability of some
natural language datasets, the latter are hardly compatible with
GPSJ queries and are not based on real analytic workloads.
varying the number |M | of entities in the optimal tree
1
3
      </p>
      <p>5
|M|
7
tities in the optimal tree and the number of top similar
entities N (with α = 0.4).
5.2</p>
    </sec>
    <sec id="sec-10">
      <title>Eficiency</title>
      <p>We ran the tests on a machine equipped with Intel(R) Core(TM)
i7-6700 CPU @ 3.40GHz CPU and 8GB RAM, with the framework
implemented in Java.</p>
      <p>In Section 4.2 we have shown that the mapping search space
increases exponentially in the text length. Figure 11 confirms this
result showing the number of generated mappings as a function
of the number of entities |M | included in the optimal parse tree
PT ∗, |M | is strictly related to |T |. Note that, with reference to the
parameter values reported in Table 4, the configuration analyzed
in Figure 11 is the worst case since it determines the largest
search space due to the high number of admitted similar entities.
Noticeably, pruning rules strongly limit the number of mappings
to be actually parsed.</p>
      <p>Figure 12 shows the average execution time by varying |M |
and the number of allowed top similar entities N . The execution
time increase with the number of entities included in the optimal
parse tree, as such also the number of top similar entities impacts
the overall execution time. We recall that efectiveness remains
high also for N = 2 corresponding to an execution time of 1
second, raising to 101 seconds in the worst case. We emphasize
that the execution time corresponds to the time necessary for
the translation, and not to the time to actually execute them.
Queries are executed against the enterprise data mart and their
performance clearly depends on the underlying multidimensional
engine.</p>
      <p>105
104</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSIONS AND FUTURE WORKS</title>
      <p>In this paper, we proposed a conversational OLAP framework and
an original approach to translate a text describing an OLAP query
in a well-formed GPSJ query. Tests on a real case dataset have
shown state-of-the-art performances. More in detail, we have
shown that coupling a grammar-based text recognition approach
with a disambiguation phase determines a 94% accuracy. We are
now working in many diferent directions to complete the
framework by: (a) supporting an OLAP navigation session rather than
a single query, as this requires to extend the grammar and to have
memories of previous queries in order to properly modify them;
(b) designing a proper visual representation of the interpreted
text to improve the system efectiveness, which can be obtained
by defining a proper graphical metaphor of the query on top of a
conceptual visual formalism such as DFM; (c) testing the whole
framework on real users to verify its efectiveness from points of
view that go beyond accuracy, e.g. usability, immediacy,
memorability; (d) exploiting the Log to improve the disambiguation
efectiveness and to learn synonyms not available a-priori.
2019. Natural Language Querying of Complex Business Intelligence Queries.</p>
      <p>In SIGMOD Conference. ACM, 1997–2000.
[20] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Machine</p>
      <p>Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287 (2019).
[21] Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A
Holistic Approach for Query Evaluation and Result Vocalization in Voice-Based
OLAP. In Proc. of the 2019 Int. Conf. on Manage. of Data, SIGMOD Conf. 2019,
Amsterdam, The Netherlands, June 30 - July 5, 2019. 936–953.
[22] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017.</p>
      <p>SQLizer: query synthesis from natural language. PACMPL 1, OOPSLA (2017),
63:1–63:26.
[23] Kaizhong Zhang and Dennis E. Shasha. 1989. Simple Fast Algorithms for the
Editing Distance Between Trees and Related Problems. SIAM J. Comput. 18, 6
(1989), 1245–1262.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] [n. d.].
          <source>Hype Cycle for Artificial Intelligence</source>
          ,
          <year>2018</year>
          . http://www.gartner.com/ en/documents/3883863/hype-cycle
          <source>-for-artificial-intelligence-2018</source>
          . ([n. d.]).
          <source>Accessed: 2019-06-21.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Afolter</surname>
          </string-name>
          , Kurt Stockinger, and
          <string-name>
            <given-names>Abraham</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A comparative survey of recent natural language interfaces for databases</article-title>
          .
          <source>The VLDB Journal 28</source>
          ,
          <issue>5</issue>
          (
          <year>2019</year>
          ),
          <fpage>793</fpage>
          -
          <lpage>819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Julien</given-names>
            <surname>Aligon</surname>
          </string-name>
          , Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A collaborative filtering approach for recommending OLAP sessions</article-title>
          .
          <source>Decision Support Syst</source>
          .
          <volume>69</volume>
          (
          <year>2015</year>
          ),
          <fpage>20</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>John</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Beatty</surname>
          </string-name>
          .
          <year>1982</year>
          .
          <article-title>On the relationship between LL(1) and LR(1) grammars</article-title>
          .
          <source>J. ACM</source>
          <volume>29</volume>
          ,
          <issue>4</issue>
          (
          <year>1982</year>
          ),
          <fpage>1007</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Blunschi</surname>
          </string-name>
          , Claudio Jossen, Donald Kossmann, Magdalini Mori, and
          <string-name>
            <given-names>Kurt</given-names>
            <surname>Stockinger</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>SODA: Generating SQL for Business Users</article-title>
          .
          <source>PVLDB 5</source>
          ,
          <issue>10</issue>
          (
          <year>2012</year>
          ),
          <fpage>932</fpage>
          -
          <lpage>943</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kedar</given-names>
            <surname>Dhamdhere</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kevin S. McCurley</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ralfi Nahmias</surname>
            , Mukund Sundararajan, and
            <given-names>Qiqi</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Analyza: Exploring Data with Conversation</article-title>
          .
          <source>In IUI. ACM</source>
          ,
          <volume>493</volume>
          -
          <fpage>504</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Krista</given-names>
            <surname>Drushku</surname>
          </string-name>
          , Julien Aligon, Nicolas Labroche, Patrick Marcel, and
          <string-name>
            <given-names>Verónika</given-names>
            <surname>Peralta</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Interest-based recommendations for business intelligence users</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>86</volume>
          (
          <year>2019</year>
          ),
          <fpage>79</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Lorena</given-names>
            <surname>Etcheverry</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alejandro A.</given-names>
            <surname>Vaisman</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>QB4OLAP: A Vocabulary for OLAP Cubes on the Semantic Web</article-title>
          .
          <source>In COLD (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>905</volume>
          . CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Francia</surname>
          </string-name>
          , Matteo Golfarelli, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Augmented Business Intelligence</article-title>
          .
          <source>In DOLAP (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2324</volume>
          . CEURWS.org.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ramanathan</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Guha</surname>
            , Vineet Gupta, Vivek Raghunathan, and
            <given-names>Ramakrishnan</given-names>
          </string-name>
          <string-name>
            <surname>Srikant</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>User Modeling for a Personal Assistant</article-title>
          .
          <source>In WSDM. ACM</source>
          ,
          <volume>275</volume>
          -
          <fpage>284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Gupta</given-names>
          </string-name>
          , Venky Harinarayan, and
          <string-name>
            <given-names>Dallan</given-names>
            <surname>Quass</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Aggregate-Query Processing in Data Warehousing Environments</article-title>
          . In VLDB. Morgan Kaufmann,
          <fpage>358</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Kuchmann-Beauger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Falk</given-names>
            <surname>Brauer</surname>
          </string-name>
          , and
          <string-name>
            <surname>Marie-Aude Aufaure</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>QUASL: A framework for question answering and its Application to business intelligence</article-title>
          .
          <source>In RCIS. IEEE</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Fei</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Understanding Natural Language Queries over Relational Databases</article-title>
          .
          <source>SIGMOD Record 45</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>6</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Gabriel</surname>
            <given-names>Lyons</given-names>
          </string-name>
          , Vinh Tran, Carsten Binnig, Ugur Çetintemel, and
          <string-name>
            <given-names>Tim</given-names>
            <surname>Kraska</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Making the Case for Query-by-Voice with EchoQuery</article-title>
          .
          <source>In SIGMOD Conference. ACM</source>
          ,
          <volume>2129</volume>
          -
          <fpage>2132</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Stanford CoreNLP Natural Language Processing Toolkit</article-title>
          .
          <source>In ACL (System Demonstrations)</source>
          .
          <source>The Association for Computer Linguistics</source>
          ,
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>WordNet: A Lexical Database for English</article-title>
          .
          <source>Commun. ACM</source>
          <volume>38</volume>
          ,
          <issue>11</issue>
          (
          <year>1995</year>
          ),
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Amrita</surname>
            <given-names>Saha</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitesh M. Khapra</surname>
            , and
            <given-names>Karthik</given-names>
          </string-name>
          <string-name>
            <surname>Sankaranarayanan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards Building Large Scale Multimodal Domain-Aware Conversation Systems</article-title>
          . In AAAI. AAAI Press,
          <fpage>696</fpage>
          -
          <lpage>704</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Diptikalyan</surname>
            <given-names>Saha</given-names>
          </string-name>
          , Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas,
          <string-name>
            <surname>Ashish R. Mittal</surname>
            , and
            <given-names>Fatma</given-names>
          </string-name>
          <string-name>
            <surname>Özcan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>ATHENA: An OntologyDriven System for Natural Language Querying over Relational Data Stores</article-title>
          .
          <source>PVLDB 9</source>
          ,
          <issue>12</issue>
          (
          <year>2016</year>
          ),
          <fpage>1209</fpage>
          -
          <lpage>1220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jaydeep</surname>
            <given-names>Sen</given-names>
          </string-name>
          , Fatma Ozcan, Abdul Quamar, Greg Stager,
          <string-name>
            <surname>Ashish R. Mittal</surname>
          </string-name>
          , Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>