=Paper= {{Paper |id=Vol-2578/DARLIAP15 |storemode=property |title=Question Answering on OLAP-like Data Sources |pdfUrl=https://ceur-ws.org/Vol-2578/DARLIAP15.pdf |volume=Vol-2578 |authors=Nadine Steinmetz,Samar Shahabi Ghahfarokhi,Kai-Uwe Sattler |dblpUrl=https://dblp.org/rec/conf/edbt/SteinmetzGS20 }} ==Question Answering on OLAP-like Data Sources== https://ceur-ws.org/Vol-2578/DARLIAP15.pdf

Question Answering on OLAP-like Data Sources
Nadine Steinmetz Samar Shahabi-Ghahfarokhi Kai-Uwe Sattler
Technische Universität Ilmenau Technische Universität Ilmenau Technische Universität Ilmenau
Germany Germany Germany
nadine.steinmetz@tu-ilmenau.de samar.shahabi-ghahfarokhi@ kus@tu-ilmenau.de
tu-ilmenau.de
ABSTRACT are easily to use (e.g. for drill down/rollup operations) even for
Today, knowledge is mostly stored in structured data sources business users but limited in their expressiveness. Therefore,
and requires to be queried by formal query languages. Natural combining question answering with OLAP seems to be a nat-
language interfaces to these data sources enable users to query ural solution to provide access to analytical databases. QA for
the data without knowledge about the technical details of the (relational) OLAP means
query language which is particularly useful in analytics and deci-
• working with star or snowflake schemas consisting of fact
sion support. In this paper, we present an approach to query data
and dimension tables,
that is stored in an OLAP database only using natural language.
• answering questions referring to facts and dimensions/di-
We describe the preliminaries and transformation process from a
mension values.
question to the formal SQL query on a snowflake schema. The
approach takes into account synonyms to identify the relevant The goal of our work is an approach to answer basic questions
fact table and uses mapping techniques to appoint the correct for facts such as “What were the total sales of beverages in March
attributes and dimension tables within the schema. As an exam- 2018?” but also more complex questions requiring calculations,
ple, we utilized the Foodmart data source to implement the first sorting and ranking or other more advanced operations like
prototype and prepared over 40 different questions to assess our “What were the total sales per product class in May 2018 compared
approach and provide a benchmark for further studies. to May 2017?”.
The focus of our work is the processing of textual questions –
1 INTRODUCTION a speech recognition interface is out of the scope of our work but
can be easily added by using external cloud services such as Alexa
The developments in the fields of speech recognition and natural
Voice Service or Google Speech-to-Text. The main contributions
language processing (NLP) over the last few years have fostered
of our work are twofold. We present an approach for mapping
many novel applications. Apart from the now nearly ubiquitous
NL questions to SQL queries on a snowflake schema. And we
virtual assistants like Amazon Alexa or Apple Siri in smartphones,
describe a schema-agnostic technique to derive such mappings.
smart speakers, and other consumer devices, research in the field
In addition, we provide a set of NL questions as benchmark for
of question answering (QA) has made a significant progress since
the evaluation of QA systems for OLAP.
IBM’s DeepQA system Watson won the quiz show Jeopardy! in
The remainder of this paper is organized as follows. In Sect. 2
2011. QA systems – or in the special form of Natural Language
we discuss related approaches from NLIDB and QA. The pro-
Interfaces to Databases (NLIDB) [2] – provide a user-friendly
cessing of questions and the mapping approach to SQL queries
way to access any type of knowledge base simply by entering a
is described in Sect. 3. We have implemented this approach in
question formulated in natural language (NL). To answer such
a prototype which is evaluated using a set of questions on the
a question, the type of the question, objects and predicates as
Foodmart cube. The results of this evaluation presented in Sect. 4
well as entities have to be identified by parsing and analyzing the
demonstrate that our QA approach is capable of answering even
sentence. Then, the question has to be mapped to a query, e.g. in
complex questions. Finally, we conclude the results in Sect. 5 and
SQL or SPARQL, on a database or knowledge base providing the
point out to future work.
necessary facts. Particularly, RDF-based knowledge bases such as
DBpedia [6, 9, 10] or databases enriched by ontologies [8] have
been used successfully. 2 RELATED WORK
OLAP databases are used to answer multidimensional analyti- Li et al. introduced an interface (NaLIR) that transforms NL to
cal queries in the business intelligence context. Typically, OLAP is SQL [5]. The interface is interactive and requests feedback from
based on a multidimensional data model distinguishing between the user. Hereby, the user interacts with the interface during the
facts or measures and dimensions forming a so-called hypercube. query construction process at two different stages. First, the NL
In such a hypercube, dimensions span the data space and provide phrase is parsed into a lexical tree. Afterwards, the parse tree is
information about the facts represented by the cells of the cube. analyzed and specific nodes are identified which can be mapped
OLAP cubes are usually queried either via visual interfaces (e.g. parts of the underlying SQL schema information. At this stage,
Tableau) which generate and execute database queries, or simply if the mapping fails, the first feedback is requested by the user.
by standard query languages, such as SQL or MDX. While the Otherwise, the successful mappings are presented to the user.
latter provides the most powerful access, particularly SQL with After all nodes are interpreted correctly (with the help of the
advanced grouping features and OLAP functions is too difficult user), the parse tree is adapted according to the their system. If
for non-database experts. On the other hand, visual interfaces necessary, implicit nodes can be inserted and the user is able to
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
give feedback on that process. The query tree – verified by the
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, user – is transformed to a SQL query containing joins, aggregate
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- functions etc. The system is designed to answer rather simple
tribution 4.0 International (CC BY 4.0)
questions. The authors claim, the question Return the author who
has the most publications in database area is hard to answer. Also, NL questions that could also originate from a speech interface.
the system relies on a high amount of feedback by the user. In our approach, the processing and mapping is performed on a
The authors of the approach described above reference the complete sentence/question.
field of NLIDB for their scientific solutions as the interfaces are In recent years, question answering is mostly focussed on an-
designed to query (relational) databases in general. In addition, swering questions regarding an underlying semantic knowledge
the research field of question answering has emerged, especially base. NL questions are transferred to a SPARQL query based on
since new types of knowledge bases have been established, such the knowledge structured as RDF triples. The challenge Question
as graph databases or triple stores. Our approach is designed to Answering over Linked Data (QALD) has been established in
transform a NL question to SQL to be able to query a schema 2011 as part of the Extended Semantic Web Conference (ESWC).
stored in a database. However, we also relate our approach to The latest challenge took place as part of the 18th International
the research field of question answering, as it is based on OLAP. Semantic Web Conference (ISWC) [3]. In addition to many con-
A star or snowflake schema constitutes a specific and powerful curring systems that participated in recent years, the organizers
way of storing data respectively knowledge. of the challenge published all datasets containing at least 100
Naeem et al. presented an approach to generate OLAP queries questions and the corresponding SPARQL queries. As we are
based on NL [7]. The question is analyzed for Part-of-Speech, focussing on data sources stored in star/snowflake schema, the
linguistic dependencies and semantic roles. They utilize SBVR comparison to the systems and datasets is not applicable for our
which is a vocabulary specific to business use cases. Different approach presented in this paper.
parts of the NL question are mapped to different pre-defined
semantic roles. The SQL query is then generated according to Table 1: Comparison of related approaches to our ap-
these assigned roles. The approach seems to be able to process proach
rather simple questions, but more complicated queries containing
comparisons, grouping and other functions are not considered. NaLIR Naeem Sen et Tableau Our
Sen et al. just recently presented an approach to build com- [5] et al. al. [8] Approach
plex SQL queries using NL [8]. The authors focus on queries [7]
that include subqueries and comparisons between different sub- SQL ✓ ✓ ✓ ✓ ✓
queries. Their approach is based on very few rules, but tailored OLAP ✓ ✓ ✓
to a specific benchmark containing finance data. The heart of Schema ✓ ✓
their system is an ontology containing over 150 concepts and agnostic
several hundred properties. Input questions are mapped to the NL ✓ ✓ ✓ ✓
ontology and then the SQL query is constructed based on the question
ontological elements. As already stated above, our approach is Complex ✓ ✓ ✓
designed to be as agnostic as possible regarding the underlying queries
data source. Therefore, a change of the data source is enabled
without changing (parts of) the system.
There are several further approaches to NLIDBs published In terms of a better understanding the contribution of our
in recent years. In addition to the analytical approaches as de- approach, we compare related systems using the following char-
scribed above, there are systems that transform natural language acteristics:
to SQL using neural networks. But as we also chose an analytical • data source is OLAP-like and the approach constructs SQL
approach, these systems are out of scope in terms of related work. queries based on facts and dimensions
Just recently, a survey has been published comparing NLIs for • the approach is agnostic regarding the schema of the data
databases [1]. The authors compare 24 interfaces that transform source
NL to a formal query language (namely SQL or SPARQL). The • NL questions are supported
comparison is based on a set of 10 NL questions that have been • complex queries containing grouping, subqueries are sup-
constructed by means of different challenges that developers are ported
facing when transforming the input into the respective query. Table 1 shows the comparison of our approach to related ap-
Unfortunately, none of the presented systems are based on OLAP- proaches regarding the defined characteristics. Obviously, our
like data sources, but some are transforming NL to SQL queries approach is bridging the gap between OLAP-like data sources
- as we describe those systems above. For the evaluation of our and NL questions in a schema agnostic way – and is also able to
approach, we adopt the categorization of questions by means of answer complex questions.
query challenges, as described in detail in Section 4.
In terms of Business Intelligence (BI) scenarios, Tableau is an 3 METHOD
established software tool1 . It focusses on querying business data
The proposed approach considers any given knowledge base
and visualizing the data in a very intuitive way. The user can eas-
structured as a snowflake schema having one or more fact tables
ily manipulate the visualization and create different perspectives.
and several dimension tables. Only one table of a dimension is
The user can query the data using NL, but an input is immediately
directly connected to the fact table. Thereby, the length of the
mapped to the underlying data source respectively pre-defined
longest path between a fact table and the deepest dimension table
terms and the user is provided with concrete items of the data
can be long and at least pmax > 1.
source – functions, relations, attributes or concrete data terms.
For our approach, we utilized the Foodmart data source2 – as
After choosing one of the suggestions, the user can complete or
described further in Section 4. In the schema of the Foodmart
refine the input query. In contrast to that, we focus on complete
data source the longest path from the fact table to a dimension
1 https://www.tableau.com/ 2 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql
table is pmax = 2. Therefore, and also for simplification reasons Besides the actual data source containing the facts and dimen-
in describing our approach, we assume a snowflake schema with sions, we only utilize an additional data source having stored the
pmax = 2. Hence, we utilize the term first level dimension table for synonyms, extracted schema information and SQL vocabulary.
the tables directly connected to the fact table (p = 1) and second The concrete extraction of information and further knowledge
level dimension table for dimension tables that are connected via required for processing NL questions is described in detail in the
one dimension table in between (p = 2). following sections.
However, our approach is also able to handle snowflake sche-
3.2.1 Schema information. Provided that the name(s) of the
mas having pmax > 2.
fact table(s) are known, we extract and store further required
There are certain preliminaries for the analysis of the under-
information from the information schema of the database. This
lying data. We extract information from the schema and use it as
includes:
pre-knowledge for the queries. In addition, we consider further
knowledge about the specific data and general language trans- • names of dimension tables
formation issues. The NL questions are processed based on this • join attributes
preparative information and as a result a formal SQL query is • names of fact attributes
created to retrieve the requested data from the snowflake schema • synonyms for table and attribute names
and present it to the user. More details about our approach are Information about foreign keys is either given by the database
described in the following sections. constraints or automatically detected by an algorithm, such as
For our approach, some information about the underlying [4]. In this way, the join attributes for fact and dimension tables
data source is required to be able to transform the NL question are detected. All remaining attributes from the fact table(s) are
to a formal query. Most of the required information might be assumed to be fact attributes containing the actual measures and
provided by the data owner in database constraints, such as numbers.
foreign keys, fact and dimension table names. Our approach is In addition, we analyze and clean the attribute and table names
able to handle unknown data sources and extract the required and store them ready to be mapped by terms of the NL question.
information from the given tables. In this way, we are able to However, the NL question might not mention the exact (anno-
handle data sources given as a set of csv files or an undocumented tated) name of the affected attribute or fact table. Therefore, we
sql database provided by an URL. In the following sections we utilize the datamuse API3 to add synonyms for attributes and
describe the required process steps. tables.

3.1 Preliminaries 3.2.2 SQL operators and functions. The required application
of SQL operators and functions in the formal query mostly can
The first and foremost prerequisite for our approach is having be identified by certain keywords in the NL question of any
the data in a snowflake schema. In addition, we require the infor- domain. These keywords are assumed to be static throughout
mation about which tables in a given database are the fact tables. questions of different domains. Therefore, we utilize a list of
Obviously, all other tables are assumed to be dimension tables. potential keywords to be able to map them to the correct SQL
If no schema containing foreign key constraints is given, we operator resp. function. For instance, the keyword total indicates
identify the constraints using an automatic detection algorithm, the use of the SQL function SUM for the requested attribute. We
such as described in [4]. For the mapping of NL phrases to the provide the complete mapping list as download4 .
database, the tables and attributes require to have descriptive
names – in the best case. For synthetically created data sources 3.2.3 Thesaurus. Time information can be stored in various
this might not be true. We identified several solutions for that ways. In return, time information can be referenced in various
challenge: ways in NL. For instance, the question How much bread was sold
• automatic detection of type of data based on the containing in the first quarter of 2010? is targeted on all sales in the first
data – this is effective for common data, such as person three months – namely January, February, and March. In this
names, locations or dates case, we either have to filter for dates having 1, 2 or 3 in the
• analysis of sample queries and respective NL questions – month field of the date or the date is stored having the month
this requires user input information separately and we have to filter on this attribute.
• manual annotation of table names and attributes – this In addition, the time information contains a separate field con-
also requires user input taining the respective quarter. In either way, we have to analyze
the underlying data source and how date information is stored.
For the description of our QA approach, we assume to have
Furthermore, it is required to add synonymous information of
descriptive names for tables and attributes – as it is the case for
how dates might be referenced in NL. Therefore, we maintain a
popular benchmarks, such as TPC-H or TPC-DS, but also for the
thesaurus as a mapping table for date information. Dependent on
Foodmart data source. The next section describes the knowledge
the underlying data source, terms like first quarter are mapped
we extract from an unknown database to satisfy our prerequisites.
to the data source, as e.g. to Q1.
3.2 Knowledge about Data
3.3 Processing the Natural Language
We extract the following information required to map NL to a
The NL question is preprocessed and mapped to our thesaurus
snowflake schema:
and the underlying data source. After preprocessing and extrac-
• information about the schema – names of fact and di- tion of specific terms, the following checks are executed:
mension tables, join attributes and fact attributes, plus
• which SQL operator – "total" -> SUM, "average"->AVG
synonyms
• mapping list to identify NL phrases for SQL operators 3 http://www.datamuse.com/api/
4 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sql_operators.tsv
• thesaurus for general synonyms
Table 2: POS Tags that are taken into account for the ex- with k being an extracted term which can consist of one or more
traction of terms from the NL question words – which results in one or more POS tags for each extracted
term.
Tag(s) Type Example In the next steps, these terms are checked for query-related
CD cardinal numeral 1998 operators and the schema of the data source. The list of tuples is
FW foreign word les checked item per item and if a check is positive, the respective
JJ* adjective total tuple is replaced with a new tuple depending on the different
NN* noun sales, costs checks as described in detail in the following sections.
PDT predeterminer both sales and costs 3.3.2 Operators. Based on what the question asks for, several
POS possessive ending January’s SQL operators might be relevant to be used in the query. We
RB* adverb good, better, best utilize a mapping list of manually identified NL phrases that
VB* verb sold indicate the use of a SQL operator. For instance, the NL phrase
WRB wh-adverb when, how maximum suggests to use the MAX operator in the SELECT clause
or different indicates the use of DISTINCT. If a term from the NL
question is matching a term from our mapping list, the term is
• which fact table – e.g. "sales"-> sales_fact replaced with the respective SQL operator in the intermediate
• which dimension table – "which store [...]?", "In which city representation desc of the question. The tuple ti (containing term
[...]?" and POS tag(s)) in desc is replaced by the operator/function si as
• which attribute from fact table – "How many units where a string:
ordered [...?]" -> units_ordered from inventory_fact
• is there a term from a dimension table? – "How much si ∈ S (3)
bread [...]?" -> "Bread" is found as data term in the field for with S being the set of all SQL operators and functions.
product categories within the product_class dimension
After these checks, the NL question is presented as an interme- 3.3.3 Fact Tables. As a preliminary we know the names of the
diate formal description. This formal description is then trans- fact tables in the snowflake schema. To identify the correct fact
formed to the actual SQL query5 . table (in case there are more than one fact table in the schema)
we check the remaining terms in the tuples if they are matching
3.3.1 Preprocessing. The NL question is tagged for Part-of- a fact table’s name. As it is not very likely having the exact fact
Speech (POS) using the NLTK library6 and the Brown tagset7 . tables’ name in the question, we utilize the datamuse API (as
Subsequently, the respective terms are extracted from the NL described in Section 3.2.1) to map a term and identify the correct
question according to a predefined white list of POS tags. Table fact table. For our example, the datamuse API provides the term
2 shows the white list of POS tags and respective sample words sold as one of the synonyms for sales.
for each tag. The information about the correct fact table is stored sepa-
If there are continuous words having a noun tag (NN, NNS, rately in addition to the intermediate representation desc of the
NNP, NNPS) these words are extracted as coherent terms. Words question.
marked with other tags are extracted as single words. Words with
tags not contained in the white list are dismissed. In addition, 3.3.4 Dimension Tables. In some cases a question might refer
further word combinations are identified, such as third month, directly to a dimension of the schema. For instance, a data source
last month, how many. with localized information one dimension could be the region. In
For instance, for the question that case, a question Which region locates the most stores? directly
refers to the dimension region. The tuple ti of the matching term
How many paper wipes were sold in the first month of
is replaced with a new tuple d:
1997?
the following terms are extracted: [how many],[paper wipes], di = ( ′′ ,′′ ,dim j , (key j1 , ...,key jn )) (4)
[were], [sold], [first month], [1997]. with ′′ being empty strings, dim j the name of the dimension table,
Subsequently, the list of extracted terms is checked for terms (key j1 , ...,key jn ) the list of join attributes between the dimension
contained in our thesaurus. For our example, the term first month table and the fact table and potential second level dimension
could be translated to January or 1 depending on how dates are tables. The first empty string in the tuple denotes an unspecified
stored in our data source. data constraint, in terms of not having a concrete data term found
After this preprocessing step, the question is represented as in the data source, such as Bread or January. The second empty
a list of n tuples – the question description desc. Each tuple string stands for an unassigned attribute for the corresponding
contains the extracted term (respectively its translation) and the table – there is no specific attribute affected.
POS tag(s).
3.3.5 Attributes. Questions might also refer directly to at-
tributes from the fact or dimension tables. For instance, a data
desc = (t 1 ,t 2 , ...,tn ) (1)
source about sales for different products might include a dimen-
with n = number of extracted terms, sion table listing all available products and their properties. In
ti = (ki , (POS 1 , ...,POSm )) (2) that case, a possible question could be What is the net weight
of [...]?. The net weight for all products is stored in a field in a
dimension table. The respective term in desc is mapped to the
5 A detailed transformation process of a sample NL question to a SQL query is illus-
trated here: https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sample_
transformation.pdf attribute and the tuple t is replaced with a tuple a:
6 http://www.nltk.org/
7 http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM ai = ( ′′ ,attr j ,dim j , (key j1 , ...,key jn )) (5)
with ′′ being an empty string, attr j the name of the attribute on position three (dim j ) and the join attribute of this dimension
mapped to the term, dim j the name of the dimension table, table to the fact table is used in the SELECT clause. Needless to
(key j1 , ...,key jn ) the list of join attributes between the dimen- mention, if desc contains more than one tuple of these type, all
sion table and the fact table and potential second level dimension attributes are added and separated by a comma. As described
tables. Similar to di , the empty string denotes an unspecified data in Section 3.3.2, our approach is able to add SQL operators and
constraint. functions to the query. In that case, the operator/function is part
of desc at the relative position as mentioned in the SQL query. For
3.3.6 Term. Concluding, for the remaining tuples t of the list
instance, the question How many different products [...]? requires
it is checked, if the term is stored as data in the underlying data
the functions COUNT (for How many) and DISTINCT (for different).
source. In this way, the information for the WHERE clause con-
Both mentions are followed directly by the concrete reference
straining attributes to concrete data is retrieved. For all dimension
in the data source. Therefore, SQL functions are assigned to the
tables views are created having the concrete data in one field and
attribute referenced in the tuple directly following the function
the corresponding attribute name in a second field. These views
in desc. Another special case is the existence of a compare opera-
are created for all dimension tables separately. Remaining terms
tor in desc. In that case, the reference that the NL question asks
from the NL question are queried regarding the data fields in
for a comparative query is stored in desc as string compare. The
these views. In case of a match, a tuple is created containing the
string is set in desc at the position between the tuples stating the
matching term, the attribute name, the name of the dimension
comparative data constraints or before all tuples relevant for the
table and the join attribute for this dimension table:
comparison. In the NL question such a type of query is identified
by NL phrases like compared to or comparing a and b. For instance,
w i = (term,attr j ,dim j , (key j1 , ...,key jn )) (6) the question What were [...] in January 1997 compared to March
with term being the data term found in the data source, attr j 1998? requires two subqueries – one for January 1997 and one
the attribute name where the data term is stored in the data for March 1998. In this case, the number of required subqueries is
source, dim j the dimension table where the data term is stored, counted and respective aliases created. For each alias, attributes
(key j1 , ...,key jn ) the list of join attributes between the dimension as previously described are added to the SELECT clause having
table and the fact table and potential second level dimension the alias as prefix. As described in Section 3.3.2 we identify NL
tables. phrases that refer to SQL functions/operators, such as SUM or AVG,
The previous tuple containing the matching term in the de- and store as a string si in desc. When creating the SELECT clause,
scription desc is replaced by the newly created tuple w i . this function is added to attributes contained in the following
tuple of si within desc. In case there are more than one attribute
3.4 Construction of the SQL Query added to the SELECT clause, the remaining attributes (including
table references) are added to a separate list to be processed later
After processing all checks as described above, desc consists of
when the GROUP BY case is checked.
processed tuples/strings only. All initial tuples that could not be
processed/mapped to the data source are dismissed. In the next 3.4.2 FROM clause. The FROM clause includes the previously
step, desc is transformed to the actual SQL query. identified fact table combined with INNER JOINs to the affected
All tuples of desc are analyzed and the containing distinct (first and second level) dimension tables. As described in Section
dimension tables are extracted and stored as a list for the cre- 3.2.1, the join attributes for fact and dimension tables are provided
ation of the joins in the query. As the schema might also contain and utilized for the joins in the FROM clause of the query. For the
second level dimension tables, the list is analyzed if second level special case of comparative subqueries, the FROM clause contains
dimension tables are contained and the corresponding first level the subqueries for each constraint from the NL question using
dimension table is missing. If so, the first level dimension table is the previously created alias. For each subquery a tuple w i exists
added to the list of tables. in desc. w i contains the table, the attribute and the data term
As we identified the correct fact table from the NL question, required for creating the SQL query containing a constraint on
we have to check if the question can actually be answered based an attribute.
on the retrieved dimension tables. Although we are able to handle
different fact tables in one schema, it is possible that a fact table 3.4.3 WHERE clause. The WHERE clause is constructed based
is not connected to a dimension table identified in the question. on the tuples w contained in desc. All consecutive tuples of type
If that is the case, we have to exit the transformation process and w (having the first string set to a concrete data term) are added
give feedback to the user that the question cannot be answered to the WHERE clause separated by AND. Each tuple w contains
based on the underlying data source. the table (position 3), the attribute (position 2) and the con-
crete data term (position 1) required to construct the constraint:
3.4.1 SELECT clause. The SELECT clause is created based on w[2].w[1]=’w[0]’.
the tuples in desc that contain at least one empty string at the Concluding, the existence of hints for the use of the GROUP,
beginning (of type a and d) - resulting from steps 3.3.4 and 3.3.5. ORDER and LIMIT clauses are checked.
As already described above, these two types of tuples contain
empty string(s) at the first (and second) position. The first empty 3.4.4 Further operators. For the GROUP clause a separate list
string in these tuples means that no data constraint is identified. is created during the creation of the SELECT clause. If there is an
The second empty string denotes an unassigned attribute for the aggregation function used in the SELECT clause, the remaining
corresponding table. Hence, these two types of tuples are consid- attributes identified for selection are added to this separate list.
ered to add the attributes to the SELECT clause. If only the first This list is checked for containing items and if so, these attributes
string in the tuple is empty, the attribute set on second position are added to the end of the query surrounded by the grouping
in the tuple (attr j ) is directly used in the SELECT clause. If also function. The ORDER clause is utilized in case there is an aggregate
the second string is empty, the tuple contains the affected table function in the SELECT clause. If so, the attribute used for the
aggregate function is also used for ordering the results. For the connected and a potential query does not present any result.
LIMIT clause the NL question is analyzed for references to only The user might be presented with an alternative question by
present a limited amount of results. For instance, the phrases first, simplifying the question to an answerable query. For instance,
last, top followed by a number are hints to use a LIMIT clause the question What were the costs for bread in January 1997? refers
following the corresponding number at the end of the SQL query. to expenses regarding bread for specific time period. When a fact
table about expenses is not connected to specific products, the
3.5 Potential Extensions user might be called back We cannot provide the costs for bread.
In the current version, our approach is able to answer a various Would you rather be presented with the costs for all products for
number of different question types – described more in detail in January 1997?.
Section 4. Nevertheless, we identified several issues we already
3.5.2 Rules. As mentioned, our approach is considered ag-
considered as a future concept but did not integrate a solution,
nostic regarding the actual underlying data source. There are
yet.
several prerequisites required for our transformation process, but
3.5.1 User Feedback. A NL interface might be expected to act in general we extract the essential information from the (informa-
(almost) like a human when it comes to answering questions. As tion schema of the) data source and add additional information
already stated above, there are questions that cannot be answered by using APIs, such as datamuse. However, there are questions
directly. In the current version, our prototype would simply exit that require a more detailed knowledge about the data source.
the process and give a feedback like This question cannot be For instance, the question How many products were in stock [...]?
answered based on the current data source. But, we prefer to design might be answered directly from the data source if the informa-
the interface in a more conversational way. If the process comes tion of items in stock are stored explicitly. But, there could also
across a problem, the interface should communicate the problem be the case, that this information must be calculated from the
to the user and ask for advice/help. So far, we identified two existing information. For this example, the data source might
different issues when a user’s feedback would be required: provide the information of items ordered and items sold. Items
• Ambiguity in stock are calculated by subtracting the number of items sold
• Missing data from the number of items ordered.
The first case refers to data terms within the data source that To solve this type of issue, we consider explicit rules similar
occur multiple times. This might be the case for different cities to MDX (Multidimensional expressions – a query language for
having the same name (ambiguity for the same type, e.g. cities) OLAP data sources, based on SQL). Thereby, so-called calculated
or data terms that occur in different attributes in different tables members can be created using various expressions and operators.
(ambiguity over different types - stored in different tables, e.g. a A member can be stored in the OLAP cube to be available in
person and a city having the same name). In the current version, further queries.
our approach groups all results when the term occurs multiple We are aware of the fact, that this type of rules cannot be
times in the same field (as for different cities with the same name) retrieved from the (unknown) data source following our agnostic
– different cities with the same name are treated as if they are the approach. Therefore, this extension builds upon the extension
same. In the second case, our approach uses the first occurrence on user feedback as described above. We consider two different
of the data term in data source when searching for table and ways of retrieving additional rules:
attribute. In both cases, the best way of solving the ambiguity (1) A user provides additional information about the data
would be to call for the user’s help. For the first case, the approach source when the data source is initialized. This requires
is required to take additional information from the data source the user to have further knowledge about the information
and present it to the user when asking for advice which term is schema and the contained data.
the correct one. For instance, when a question asks for something (2) The rules are retrieved as part of the user feedback when
in Richmond the user should be called back: Do you mean the a question cannot be answered. The unknown term could
Richmond in San Francisco district or in Vancouver district?. In be presented to the user together with (parts of) the infor-
the second case, the schema information of the data source must mation schema.
be taken into account. The different tables and corresponding In either case, the user feedback could be retrieved in a visual
attributes must be presented to the user, so a selection can be way. The user might mark parts of the schema and add operators
communicated. For instance, the user might be asked Do you or comments. Thereby, the user is not required to have detailed
mean the region with city Camacho or the employee with last knowledge about SQL or the desired rule language.
name Camacho?. Obviously, this call back could be avoided as we
should be aware that with Camacho a place could be referred just 4 EVALUATION
by analyzing the NL question more in detail. But, our approach
is as agnostic as possible regarding the data source. Although, 4.1 Domain Application
we might know what a place is in the NL question, we are not Data stored in an OLAP-like structure is especially useful for
aware which dimension tables store localized information. BI scenarios. A (filtered) fact can be queried and the additional
For the case of missing data, the NL question is assumed to dimensions allow the user to drill up or down along the (hier-
ask for terms that are not contained in the data source or a term archical) structured data. Thus, NLIs for OLAP databases are
from a dimension table that is not connected to the fact table. an effective instrument for data scientists in marketing or other
In the first case, an incorrectly written term in the NL question business contexts. We therefore initially apply our approach to
might be the cause for the issue. Therefore, in the first step we a business scenario and utilized data provided by Foodmart and
will integrate a spell-checker to be able to automatically correct structured in a snowflake schema8 . A similar data source is the
words. In the second case, the NL question contains data source
8 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql
information and reference to a fact table, but the tables are not
TPC-H benchmark9 . It also comprises several fact and dimen- How many units by Jeffers were in stock in the first month of 1997?
sion tables including speaking attribute and table names. Our – in case the term in stock cannot be mapped directly to an at-
approach is therefore directly applicable to this benchmark. But, tribute of the data source. In this case, in stock must be calculated
due to the size of the dataset and for illustration purposes, we uti- from available fields, such as ordered and shipped or similar at-
lized the Foodmart data source10 . To be able to test and evaluate tributes. This type of challenge is defined as Concept. Therefore,
our system, we created a set of questions containing several chal- the queries contain the challenges (J, A, F, C) and optionally (O, S).
lenging SQL operators or functions. The benchmark is described
more in detail in the next section. Table 3 shows sample questions and the respective challenges
for each difficulty level. The full dataset containing over 40 dif-
4.2 Benchmark ferent questions based on the Foodmart data source is available
We are not aware of a benchmark containing questions and as download11 . The dataset includes three columns, with the NL
queries based on an OLAP data source. Hence, we are not able to question in the first column, the difficulty level in the second
present a qualitative evaluation in terms of accuracy or precision column, and the corresponding SQL query in the third column.
in comparison to other approaches. Though, we are eager to The dataset contains 14 questions of level 1, 12 questions of level
create and provide such a benchmark for further research ap- 2, 11 questions of level 3 and 4 questions of level 4.
proaches. Therefore, we utilized the Foodmart data source where In the current version, our prototype is able to process ques-
data is stored in a snowflake schema with first and second level tions of level 1-3 directly and questions of level 4 for specific cases,
dimensions. The data source contains three different fact tables – such as calculating in stock from the attributes units_ordered and
holding data about sales, inventory and expense facts – and 17 units_shipped in the Foodmart data source. Regarding the bench-
dimension tables – holding data about additional information, mark, this means that our approach is able to answer 93% of
such as stores, regions, customers, products etc. Each fact table the questions. Of course, we are aware of the fact that these re-
is connected to at least four dimension tables directly and sev- sults are not statistically firm – neither in terms of the number
eral dimension tables are connected in second level via other of questions nor regarding the subjective nature of the bench-
dimension tables. mark. Unfortunately, none of the competing systems presented
We identified different question types and levels of difficulty in Section 2 is available as API or other source. Therefore, we are
for a NL interface to OLAP. There are simple questions with not able to compare the quality of the results of our approach
restrictions on one dimension table or complicated questions to other approaches. We aim to encourage other researchers to
requiring calculations, subqueries and/or several SQL function- utilize and expand our benchmark. Thereby, a qualitative evalua-
s/operators. For the development process and assessment, we tion can be enabled in the future. But, the current results show
collected a number of over 40 different questions and grouped that our approach is able to construct SQL queries for a good
them together according to difficulty levels. To further explain number of NL questions with different difficulty levels facing
the different levels, we adopt the approach presented by Affolter several challenges.
et al .[1]. The authors classify different questions by the required As described in Section 3.5, we plan to integrate rules to be
usage of operators and functions for the respective formal query. able to do calculations and constraints as required by compli-
The following operators are applicable for our use case: Join, cated questions of level 4. Several concepts might be identified
Filter, Aggregation, Ordering, Subquery, Concept. Concept de- automatically or added manually by the respective user. This
notes a pre-defined term, as e.g. “great movie” or “in stock” (when applies for concepts like important, in stock or great (in terms of
this information must be calculated and is not stored in the data good reviews). Another special case is What are the total sales in
source explicitly). Using these challenges, we identified four dif- the business year 1997/98?. Here, business year might refer to a
ferent levels: time period from July 1997 to June 1998. The data source does not
Level 1 refers to rather simple questions with a simple con- contain sales information in terms of business years. Therefore,
straint of one or more dimensions like How many different prod- the query must contain constraints from pre-defined rules where
ucts were sold in October 1998?. Here, only a COUNT on one attribute concrete terms are mapped to data item constraints.
and a constraint in one dimension table is required. The queries
contain the challenges (J, A, F). 4.3 Portability of the Benchmark’s Domain to
Level 2 refers to questions containing aggregations and group- other Domains
ing like Show for each brand the total sales in 1998. The word each Initially, we applied our approach to an existing data source struc-
indicates the use of the an aggregation in the SELECT clause and tured in a snowflake schema from the domain of business data.
a grouping function for the respective attribute associated with This type of data design makes sense when you need insights
brand. The queries contain the challenges (J, A, F, O). into your data and receive metrics or ratios, as e.g. mostly ques-
Level 3 refers to queries containing subqueries for compared tions that start with how many/much or what were the total xy
results or comparison using <, >, == like In which month less for ab. For this type of questions, it is a reasonable approach to
bread was sold than in January 1997?. For this type of question, structure the respective data in fact and dimension tables and
a subquery containing the data for January 1997 is constructed make the data source accessible via a QA application. In this way,
and it is compared using the operator < and the HAVING clause. business analysts or marketing strategists are able to get insight
The queries contain the challenges (J, A, S, F, O). into their key data without being forced to have knowledge about
Level 4 refers to complicated queries containing calculations any formal query language. Although, we utilized the Foodmart
on attributes or pre-defined constraints of data terms, such as data source for our first prototype, the data source and the data’s
9 http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf domain can be easily replaced respectively changed. For instance,
10 The queries described in Section 2.4 of the TPC-H specification might reflect similar questions as collected in our benchmark are applicable
real-world business questions, but cannot be summarized in one NL question -
which our approach is aiming at. 11 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/queryDataset.tsv
Table 3: Question examples for the Foodmart data source

Level Question Challenges
How many different J(sales_fact, customer, time_by_day)
1 customers were in the A(COUNT(customer_id))
first quarter of 1997? F(quarter=’Q1’, year=’1997’)
J(sales_fact, product, product_class, store, time_by_day)
How much bread was A(COUNT(product_id), GROUP BY(store_city))
2
sold per city in 1997? O(COUNT(product_id))
F(product_category=’Bread’, year=’1997’)
J(sales_fact, time_by_day, product, product_class)
In which month less A(COUNT(product_id)
3 bread was sold than in S(HAVING(COUNT(product_id)) < F(month=’January’, year=’1997’, product_category=’Bread’))
January 1997? F(month=’January’, year=’1997’, product_category=’Bread’)
O(COUNT(product_id))
J(inventory_fact, product, time_by_day)
How many units by
A(SUM(units_ordered, SUM(units_shipped)))
4 Jeffers were in stock in
F(brand_name=’Jeffers’, month=’January’, year=’1997’)
the first month of 1997?
C(in stock = units_ordered - units_shipped)

for data about issue trackers and software projects, such as How more descriptive hints about the schema, if essential parts are
many bug fixes were committed in January 2019 per developer?. missing (such as speaking attribute or table names). The imple-
mentation of rules – as described in Section 3.5.2 – is a desired
5 SUMMARY & OUTLOOK enhancement of our approach. This enables the construction of
In this paper, we presented our approach in the research field even more complex queries and give answers to questions of the
of NLIDB specifically for OLAP data sources. We designed our highest difficulty level.
prototype system to be as agnostic as possible regarding the un- Overall, our approach is already able to construct SQL queries
derlying data source. Required information is extracted from the for rather complex NL questions. The addressed GUI will make
schema and the datamuse API is utilized to add synonyms to the the prototype available for users and the enhancements will
data that is used to map phrases from the NL question to terms of complete our approach in terms of complexion of queries.
the data source. In our approach, the NL question analyzed and
an intermediate representation of the question is created. After 6 ACKNOWLEDGEMENTS
several processing steps, this representation contains different This work was partially funded by the German Research Founda-
types of tuples which are utilized in the subsequent construction tion (DFG) under grant no. SA782/26.
of the SQL query. Depending on the type of the tuple it is used
to create the SELECT, FROM, or WHERE clause. Specific terms iden- REFERENCES
tified in the NL question indicate the use of SQL operators and [1] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative
survey of recent natural language interfaces for databases. The VLDB Journal
functions, such as SUM, MAX, GROUP BY, or LIMIT. 28, 5 (01 Oct 2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
With this approach, we are able to answer questions of dif- [2] Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natu-
ferent difficulty levels. In default of an OLAP benchmark, we ral language interfaces to databases - an introduction. Natural Language
Engineering 1, 1 (1995), 29–81.
created a dataset containing more than 40 questions based on [3] Key-Sun Choi, Luis Espinosa Anke, Thierry Declerck, Dagmar Gromann, Jin-
the Foodmart data source. The benchmark is available for down- Dong Kim, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Ricardo
Usbeck (Eds.). 2018. Joint Proceedings of ISWC 2018 Workshops SemDeep-4 and
load and we are planning to collect more questions to be able to NLIWOD-4. Workshop on Semantic Deep Learning (SemDeep-2018). Vol. 2241.
evaluate our approach and compare it to other approaches. CEURS.
Future work includes the implementation of the extensions [4] Lan Jiang and Felix Naumann. 2019. Holistic primary key and foreign key
detection. Journal of Intelligent Information Systems (2019), 1–23.
described in Section 3.5. The first and foremost prerequisite for [5] F. Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language
these extensions is the design and development of a graphical Interface for Relational Databases. Proc. VLDB Endow. 8, 1 (Sept. 2014), 73–84.
user interface (GUI). On the one hand, our intended user frontend https://doi.org/10.14778/2735461.2735468
[6] Giuseppe M. Mazzeo and Carlo Zaniolo. 2016. Answering Controlled Natural
is targeted on being interactive regarding the presented results Language Questions on RDF Knowledge Bases. In Proceedings of the 19th Inter-
from the data source. On the other hand, further developments national Conference on Extending Database Technology, EDBT 2016, Bordeaux,
France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 608–611.
aim at an conversational interface. The user is either presented [7] M. Asif Naeem and Imran Sarwar Bajwa. 2012. Generating OLAP Queries
with concrete results from the data source, or asked for feedback from Natural Language Specification. In ICACCI ’12. ACM, 768–773. https:
to give a relevant answer for an initially unclear question. For //doi.org/10.1145/2345396.2345522
[8] Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa
further data insights, the user is enabled to request more informa- Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019.
tion by asking questions that build upon the preceding questions. Natural Language Querying of Complex Business Intelligence Queries. In
Furthermore, the user frontend should provide the user with a SIGMOD ’19. ACM, New York, NY, USA, 1997–2000. https://doi.org/10.1145/
3299869.3320248
facility to add an own data source and ask questions on it. The [9] Nadine Steinmetz, Ann-Katrin Arning, and Kai-Uwe Sattler. 2019. From
data source might be provided as a bunch of csv files or an SQL Natural Language Questions to SPARQL Queries: A Pattern-based Approach.
In Datenbanksysteme für Business, Technologie und Web (BTW 2019). 289–308.
file or as an URL to a distant database. The data source is analyzed [10] Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, and
regarding the required information for our approach. Eventually Dongyan Zhao. 2014. Natural Language Question Answering over RDF: A
the user is involved to extract the required information or give Graph Data Driven Approach. In SIGMOD ’14. 313–324.