=Paper= {{Paper |id=Vol-2578/DARLIAP15 |storemode=property |title=Question Answering on OLAP-like Data Sources |pdfUrl=https://ceur-ws.org/Vol-2578/DARLIAP15.pdf |volume=Vol-2578 |authors=Nadine Steinmetz,Samar Shahabi Ghahfarokhi,Kai-Uwe Sattler |dblpUrl=https://dblp.org/rec/conf/edbt/SteinmetzGS20 }} ==Question Answering on OLAP-like Data Sources== https://ceur-ws.org/Vol-2578/DARLIAP15.pdf
                    Question Answering on OLAP-like Data Sources
               Nadine Steinmetz                                Samar Shahabi-Ghahfarokhi                              Kai-Uwe Sattler
       Technische Universität Ilmenau                           Technische Universität Ilmenau                Technische Universität Ilmenau
                  Germany                                                 Germany                                       Germany
      nadine.steinmetz@tu-ilmenau.de                             samar.shahabi-ghahfarokhi@                        kus@tu-ilmenau.de
                                                                        tu-ilmenau.de
ABSTRACT                                                                               are easily to use (e.g. for drill down/rollup operations) even for
Today, knowledge is mostly stored in structured data sources                           business users but limited in their expressiveness. Therefore,
and requires to be queried by formal query languages. Natural                          combining question answering with OLAP seems to be a nat-
language interfaces to these data sources enable users to query                        ural solution to provide access to analytical databases. QA for
the data without knowledge about the technical details of the                          (relational) OLAP means
query language which is particularly useful in analytics and deci-
                                                                                           • working with star or snowflake schemas consisting of fact
sion support. In this paper, we present an approach to query data
                                                                                             and dimension tables,
that is stored in an OLAP database only using natural language.
                                                                                           • answering questions referring to facts and dimensions/di-
We describe the preliminaries and transformation process from a
                                                                                             mension values.
question to the formal SQL query on a snowflake schema. The
approach takes into account synonyms to identify the relevant                          The goal of our work is an approach to answer basic questions
fact table and uses mapping techniques to appoint the correct                          for facts such as “What were the total sales of beverages in March
attributes and dimension tables within the schema. As an exam-                         2018?” but also more complex questions requiring calculations,
ple, we utilized the Foodmart data source to implement the first                       sorting and ranking or other more advanced operations like
prototype and prepared over 40 different questions to assess our                       “What were the total sales per product class in May 2018 compared
approach and provide a benchmark for further studies.                                  to May 2017?”.
                                                                                          The focus of our work is the processing of textual questions –
1    INTRODUCTION                                                                      a speech recognition interface is out of the scope of our work but
                                                                                       can be easily added by using external cloud services such as Alexa
The developments in the fields of speech recognition and natural
                                                                                       Voice Service or Google Speech-to-Text. The main contributions
language processing (NLP) over the last few years have fostered
                                                                                       of our work are twofold. We present an approach for mapping
many novel applications. Apart from the now nearly ubiquitous
                                                                                       NL questions to SQL queries on a snowflake schema. And we
virtual assistants like Amazon Alexa or Apple Siri in smartphones,
                                                                                       describe a schema-agnostic technique to derive such mappings.
smart speakers, and other consumer devices, research in the field
                                                                                       In addition, we provide a set of NL questions as benchmark for
of question answering (QA) has made a significant progress since
                                                                                       the evaluation of QA systems for OLAP.
IBM’s DeepQA system Watson won the quiz show Jeopardy! in
                                                                                          The remainder of this paper is organized as follows. In Sect. 2
2011. QA systems – or in the special form of Natural Language
                                                                                       we discuss related approaches from NLIDB and QA. The pro-
Interfaces to Databases (NLIDB) [2] – provide a user-friendly
                                                                                       cessing of questions and the mapping approach to SQL queries
way to access any type of knowledge base simply by entering a
                                                                                       is described in Sect. 3. We have implemented this approach in
question formulated in natural language (NL). To answer such
                                                                                       a prototype which is evaluated using a set of questions on the
a question, the type of the question, objects and predicates as
                                                                                       Foodmart cube. The results of this evaluation presented in Sect. 4
well as entities have to be identified by parsing and analyzing the
                                                                                       demonstrate that our QA approach is capable of answering even
sentence. Then, the question has to be mapped to a query, e.g. in
                                                                                       complex questions. Finally, we conclude the results in Sect. 5 and
SQL or SPARQL, on a database or knowledge base providing the
                                                                                       point out to future work.
necessary facts. Particularly, RDF-based knowledge bases such as
DBpedia [6, 9, 10] or databases enriched by ontologies [8] have
been used successfully.                                                                2   RELATED WORK
   OLAP databases are used to answer multidimensional analyti-                         Li et al. introduced an interface (NaLIR) that transforms NL to
cal queries in the business intelligence context. Typically, OLAP is                   SQL [5]. The interface is interactive and requests feedback from
based on a multidimensional data model distinguishing between                          the user. Hereby, the user interacts with the interface during the
facts or measures and dimensions forming a so-called hypercube.                        query construction process at two different stages. First, the NL
In such a hypercube, dimensions span the data space and provide                        phrase is parsed into a lexical tree. Afterwards, the parse tree is
information about the facts represented by the cells of the cube.                      analyzed and specific nodes are identified which can be mapped
OLAP cubes are usually queried either via visual interfaces (e.g.                      parts of the underlying SQL schema information. At this stage,
Tableau) which generate and execute database queries, or simply                        if the mapping fails, the first feedback is requested by the user.
by standard query languages, such as SQL or MDX. While the                             Otherwise, the successful mappings are presented to the user.
latter provides the most powerful access, particularly SQL with                        After all nodes are interpreted correctly (with the help of the
advanced grouping features and OLAP functions is too difficult                         user), the parse tree is adapted according to the their system. If
for non-database experts. On the other hand, visual interfaces                         necessary, implicit nodes can be inserted and the user is able to
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       give feedback on that process. The query tree – verified by the
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       user – is transformed to a SQL query containing joins, aggregate
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              functions etc. The system is designed to answer rather simple
tribution 4.0 International (CC BY 4.0)
                                                                                       questions. The authors claim, the question Return the author who
has the most publications in database area is hard to answer. Also,   NL questions that could also originate from a speech interface.
the system relies on a high amount of feedback by the user.           In our approach, the processing and mapping is performed on a
    The authors of the approach described above reference the         complete sentence/question.
field of NLIDB for their scientific solutions as the interfaces are      In recent years, question answering is mostly focussed on an-
designed to query (relational) databases in general. In addition,     swering questions regarding an underlying semantic knowledge
the research field of question answering has emerged, especially      base. NL questions are transferred to a SPARQL query based on
since new types of knowledge bases have been established, such        the knowledge structured as RDF triples. The challenge Question
as graph databases or triple stores. Our approach is designed to      Answering over Linked Data (QALD) has been established in
transform a NL question to SQL to be able to query a schema           2011 as part of the Extended Semantic Web Conference (ESWC).
stored in a database. However, we also relate our approach to         The latest challenge took place as part of the 18th International
the research field of question answering, as it is based on OLAP.     Semantic Web Conference (ISWC) [3]. In addition to many con-
A star or snowflake schema constitutes a specific and powerful        curring systems that participated in recent years, the organizers
way of storing data respectively knowledge.                           of the challenge published all datasets containing at least 100
    Naeem et al. presented an approach to generate OLAP queries       questions and the corresponding SPARQL queries. As we are
based on NL [7]. The question is analyzed for Part-of-Speech,         focussing on data sources stored in star/snowflake schema, the
linguistic dependencies and semantic roles. They utilize SBVR         comparison to the systems and datasets is not applicable for our
which is a vocabulary specific to business use cases. Different       approach presented in this paper.
parts of the NL question are mapped to different pre-defined
semantic roles. The SQL query is then generated according to          Table 1: Comparison of related approaches to our ap-
these assigned roles. The approach seems to be able to process        proach
rather simple questions, but more complicated queries containing
comparisons, grouping and other functions are not considered.                         NaLIR      Naeem        Sen et    Tableau      Our
    Sen et al. just recently presented an approach to build com-                      [5]        et  al.      al. [8]                Approach
plex SQL queries using NL [8]. The authors focus on queries                                      [7]
that include subqueries and comparisons between different sub-         SQL            ✓          ✓            ✓          ✓           ✓
queries. Their approach is based on very few rules, but tailored       OLAP                      ✓                       ✓           ✓
to a specific benchmark containing finance data. The heart of          Schema                                            ✓           ✓
their system is an ontology containing over 150 concepts and           agnostic
several hundred properties. Input questions are mapped to the          NL             ✓          ✓            ✓                      ✓
ontology and then the SQL query is constructed based on the            question
ontological elements. As already stated above, our approach is         Complex                                ✓          ✓           ✓
designed to be as agnostic as possible regarding the underlying        queries
data source. Therefore, a change of the data source is enabled
without changing (parts of) the system.
    There are several further approaches to NLIDBs published             In terms of a better understanding the contribution of our
in recent years. In addition to the analytical approaches as de-      approach, we compare related systems using the following char-
scribed above, there are systems that transform natural language      acteristics:
to SQL using neural networks. But as we also chose an analytical           • data source is OLAP-like and the approach constructs SQL
approach, these systems are out of scope in terms of related work.           queries based on facts and dimensions
    Just recently, a survey has been published comparing NLIs for          • the approach is agnostic regarding the schema of the data
databases [1]. The authors compare 24 interfaces that transform              source
NL to a formal query language (namely SQL or SPARQL). The                  • NL questions are supported
comparison is based on a set of 10 NL questions that have been             • complex queries containing grouping, subqueries are sup-
constructed by means of different challenges that developers are             ported
facing when transforming the input into the respective query.         Table 1 shows the comparison of our approach to related ap-
Unfortunately, none of the presented systems are based on OLAP-       proaches regarding the defined characteristics. Obviously, our
like data sources, but some are transforming NL to SQL queries        approach is bridging the gap between OLAP-like data sources
- as we describe those systems above. For the evaluation of our       and NL questions in a schema agnostic way – and is also able to
approach, we adopt the categorization of questions by means of        answer complex questions.
query challenges, as described in detail in Section 4.
    In terms of Business Intelligence (BI) scenarios, Tableau is an   3    METHOD
established software tool1 . It focusses on querying business data
                                                                      The proposed approach considers any given knowledge base
and visualizing the data in a very intuitive way. The user can eas-
                                                                      structured as a snowflake schema having one or more fact tables
ily manipulate the visualization and create different perspectives.
                                                                      and several dimension tables. Only one table of a dimension is
The user can query the data using NL, but an input is immediately
                                                                      directly connected to the fact table. Thereby, the length of the
mapped to the underlying data source respectively pre-defined
                                                                      longest path between a fact table and the deepest dimension table
terms and the user is provided with concrete items of the data
                                                                      can be long and at least pmax > 1.
source – functions, relations, attributes or concrete data terms.
                                                                         For our approach, we utilized the Foodmart data source2 – as
After choosing one of the suggestions, the user can complete or
                                                                      described further in Section 4. In the schema of the Foodmart
refine the input query. In contrast to that, we focus on complete
                                                                      data source the longest path from the fact table to a dimension
1 https://www.tableau.com/                                            2 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql
table is pmax = 2. Therefore, and also for simplification reasons      Besides the actual data source containing the facts and dimen-
in describing our approach, we assume a snowflake schema with          sions, we only utilize an additional data source having stored the
pmax = 2. Hence, we utilize the term first level dimension table for   synonyms, extracted schema information and SQL vocabulary.
the tables directly connected to the fact table (p = 1) and second     The concrete extraction of information and further knowledge
level dimension table for dimension tables that are connected via      required for processing NL questions is described in detail in the
one dimension table in between (p = 2).                                following sections.
   However, our approach is also able to handle snowflake sche-
                                                                          3.2.1 Schema information. Provided that the name(s) of the
mas having pmax > 2.
                                                                       fact table(s) are known, we extract and store further required
   There are certain preliminaries for the analysis of the under-
                                                                       information from the information schema of the database. This
lying data. We extract information from the schema and use it as
                                                                       includes:
pre-knowledge for the queries. In addition, we consider further
knowledge about the specific data and general language trans-               • names of dimension tables
formation issues. The NL questions are processed based on this              • join attributes
preparative information and as a result a formal SQL query is               • names of fact attributes
created to retrieve the requested data from the snowflake schema            • synonyms for table and attribute names
and present it to the user. More details about our approach are           Information about foreign keys is either given by the database
described in the following sections.                                   constraints or automatically detected by an algorithm, such as
   For our approach, some information about the underlying             [4]. In this way, the join attributes for fact and dimension tables
data source is required to be able to transform the NL question        are detected. All remaining attributes from the fact table(s) are
to a formal query. Most of the required information might be           assumed to be fact attributes containing the actual measures and
provided by the data owner in database constraints, such as            numbers.
foreign keys, fact and dimension table names. Our approach is             In addition, we analyze and clean the attribute and table names
able to handle unknown data sources and extract the required           and store them ready to be mapped by terms of the NL question.
information from the given tables. In this way, we are able to         However, the NL question might not mention the exact (anno-
handle data sources given as a set of csv files or an undocumented     tated) name of the affected attribute or fact table. Therefore, we
sql database provided by an URL. In the following sections we          utilize the datamuse API3 to add synonyms for attributes and
describe the required process steps.                                   tables.

3.1    Preliminaries                                                      3.2.2 SQL operators and functions. The required application
                                                                       of SQL operators and functions in the formal query mostly can
The first and foremost prerequisite for our approach is having         be identified by certain keywords in the NL question of any
the data in a snowflake schema. In addition, we require the infor-     domain. These keywords are assumed to be static throughout
mation about which tables in a given database are the fact tables.     questions of different domains. Therefore, we utilize a list of
Obviously, all other tables are assumed to be dimension tables.        potential keywords to be able to map them to the correct SQL
If no schema containing foreign key constraints is given, we           operator resp. function. For instance, the keyword total indicates
identify the constraints using an automatic detection algorithm,       the use of the SQL function SUM for the requested attribute. We
such as described in [4]. For the mapping of NL phrases to the         provide the complete mapping list as download4 .
database, the tables and attributes require to have descriptive
names – in the best case. For synthetically created data sources          3.2.3 Thesaurus. Time information can be stored in various
this might not be true. We identified several solutions for that       ways. In return, time information can be referenced in various
challenge:                                                             ways in NL. For instance, the question How much bread was sold
     • automatic detection of type of data based on the containing     in the first quarter of 2010? is targeted on all sales in the first
       data – this is effective for common data, such as person        three months – namely January, February, and March. In this
       names, locations or dates                                       case, we either have to filter for dates having 1, 2 or 3 in the
     • analysis of sample queries and respective NL questions –        month field of the date or the date is stored having the month
       this requires user input                                        information separately and we have to filter on this attribute.
     • manual annotation of table names and attributes – this          In addition, the time information contains a separate field con-
       also requires user input                                        taining the respective quarter. In either way, we have to analyze
                                                                       the underlying data source and how date information is stored.
For the description of our QA approach, we assume to have
                                                                       Furthermore, it is required to add synonymous information of
descriptive names for tables and attributes – as it is the case for
                                                                       how dates might be referenced in NL. Therefore, we maintain a
popular benchmarks, such as TPC-H or TPC-DS, but also for the
                                                                       thesaurus as a mapping table for date information. Dependent on
Foodmart data source. The next section describes the knowledge
                                                                       the underlying data source, terms like first quarter are mapped
we extract from an unknown database to satisfy our prerequisites.
                                                                       to the data source, as e.g. to Q1.
3.2    Knowledge about Data
                                                                       3.3     Processing the Natural Language
We extract the following information required to map NL to a
                                                                       The NL question is preprocessed and mapped to our thesaurus
snowflake schema:
                                                                       and the underlying data source. After preprocessing and extrac-
   • information about the schema – names of fact and di-              tion of specific terms, the following checks are executed:
      mension tables, join attributes and fact attributes, plus
                                                                            • which SQL operator – "total" -> SUM, "average"->AVG
      synonyms
   • mapping list to identify NL phrases for SQL operators             3 http://www.datamuse.com/api/
                                                                       4 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sql_operators.tsv
   • thesaurus for general synonyms
Table 2: POS Tags that are taken into account for the ex-                              with k being an extracted term which can consist of one or more
traction of terms from the NL question                                                 words – which results in one or more POS tags for each extracted
                                                                                       term.
 Tag(s)                     Type                          Example                         In the next steps, these terms are checked for query-related
 CD                         cardinal numeral              1998                         operators and the schema of the data source. The list of tuples is
 FW                         foreign word                  les                          checked item per item and if a check is positive, the respective
 JJ*                        adjective                     total                        tuple is replaced with a new tuple depending on the different
 NN*                        noun                          sales, costs                 checks as described in detail in the following sections.
 PDT                        predeterminer                 both sales and costs            3.3.2 Operators. Based on what the question asks for, several
 POS                        possessive ending             January’s                    SQL operators might be relevant to be used in the query. We
 RB*                        adverb                        good, better, best           utilize a mapping list of manually identified NL phrases that
 VB*                        verb                          sold                         indicate the use of a SQL operator. For instance, the NL phrase
 WRB                        wh-adverb                     when, how                    maximum suggests to use the MAX operator in the SELECT clause
                                                                                       or different indicates the use of DISTINCT. If a term from the NL
                                                                                       question is matching a term from our mapping list, the term is
    • which fact table – e.g. "sales"-> sales_fact                                     replaced with the respective SQL operator in the intermediate
    • which dimension table – "which store [...]?", "In which city                     representation desc of the question. The tuple ti (containing term
       [...]?"                                                                         and POS tag(s)) in desc is replaced by the operator/function si as
    • which attribute from fact table – "How many units where                          a string:
       ordered [...?]" -> units_ordered from inventory_fact
    • is there a term from a dimension table? – "How much                                                                 si ∈ S                           (3)
       bread [...]?" -> "Bread" is found as data term in the field for                 with S being the set of all SQL operators and functions.
       product categories within the product_class dimension
After these checks, the NL question is presented as an interme-                           3.3.3 Fact Tables. As a preliminary we know the names of the
diate formal description. This formal description is then trans-                       fact tables in the snowflake schema. To identify the correct fact
formed to the actual SQL query5 .                                                      table (in case there are more than one fact table in the schema)
                                                                                       we check the remaining terms in the tuples if they are matching
   3.3.1 Preprocessing. The NL question is tagged for Part-of-                         a fact table’s name. As it is not very likely having the exact fact
Speech (POS) using the NLTK library6 and the Brown tagset7 .                           tables’ name in the question, we utilize the datamuse API (as
Subsequently, the respective terms are extracted from the NL                           described in Section 3.2.1) to map a term and identify the correct
question according to a predefined white list of POS tags. Table                       fact table. For our example, the datamuse API provides the term
2 shows the white list of POS tags and respective sample words                         sold as one of the synonyms for sales.
for each tag.                                                                             The information about the correct fact table is stored sepa-
   If there are continuous words having a noun tag (NN, NNS,                           rately in addition to the intermediate representation desc of the
NNP, NNPS) these words are extracted as coherent terms. Words                          question.
marked with other tags are extracted as single words. Words with
tags not contained in the white list are dismissed. In addition,                           3.3.4 Dimension Tables. In some cases a question might refer
further word combinations are identified, such as third month,                         directly to a dimension of the schema. For instance, a data source
last month, how many.                                                                  with localized information one dimension could be the region. In
   For instance, for the question                                                      that case, a question Which region locates the most stores? directly
                                                                                       refers to the dimension region. The tuple ti of the matching term
How many paper wipes were sold in the first month of
                                                                                       is replaced with a new tuple d:
1997?
the following terms are extracted: [how many],[paper wipes],                                           di = ( ′′ ,′′ ,dim j , (key j1 , ...,key jn ))      (4)
[were], [sold], [first month], [1997].                                                 with ′′ being empty strings, dim j the name of the dimension table,
   Subsequently, the list of extracted terms is checked for terms                      (key j1 , ...,key jn ) the list of join attributes between the dimension
contained in our thesaurus. For our example, the term first month                      table and the fact table and potential second level dimension
could be translated to January or 1 depending on how dates are                         tables. The first empty string in the tuple denotes an unspecified
stored in our data source.                                                             data constraint, in terms of not having a concrete data term found
   After this preprocessing step, the question is represented as                       in the data source, such as Bread or January. The second empty
a list of n tuples – the question description desc. Each tuple                         string stands for an unassigned attribute for the corresponding
contains the extracted term (respectively its translation) and the                     table – there is no specific attribute affected.
POS tag(s).
                                                                                          3.3.5 Attributes. Questions might also refer directly to at-
                                                                                       tributes from the fact or dimension tables. For instance, a data
                            desc = (t 1 ,t 2 , ...,tn )                         (1)
                                                                                       source about sales for different products might include a dimen-
with n = number of extracted terms,                                                    sion table listing all available products and their properties. In
                        ti = (ki , (POS 1 , ...,POSm ))                         (2)    that case, a possible question could be What is the net weight
                                                                                       of [...]?. The net weight for all products is stored in a field in a
                                                                                       dimension table. The respective term in desc is mapped to the
5 A detailed transformation process of a sample NL question to a SQL query is illus-
trated here: https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sample_
transformation.pdf                                                                     attribute and the tuple t is replaced with a tuple a:
6 http://www.nltk.org/
7 http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM                                                    ai = ( ′′ ,attr j ,dim j , (key j1 , ...,key jn ))    (5)
with ′′ being an empty string, attr j the name of the attribute            on position three (dim j ) and the join attribute of this dimension
mapped to the term, dim j the name of the dimension table,                 table to the fact table is used in the SELECT clause. Needless to
(key j1 , ...,key jn ) the list of join attributes between the dimen-      mention, if desc contains more than one tuple of these type, all
sion table and the fact table and potential second level dimension         attributes are added and separated by a comma. As described
tables. Similar to di , the empty string denotes an unspecified data       in Section 3.3.2, our approach is able to add SQL operators and
constraint.                                                                functions to the query. In that case, the operator/function is part
                                                                           of desc at the relative position as mentioned in the SQL query. For
    3.3.6 Term. Concluding, for the remaining tuples t of the list
                                                                           instance, the question How many different products [...]? requires
it is checked, if the term is stored as data in the underlying data
                                                                           the functions COUNT (for How many) and DISTINCT (for different).
source. In this way, the information for the WHERE clause con-
                                                                           Both mentions are followed directly by the concrete reference
straining attributes to concrete data is retrieved. For all dimension
                                                                           in the data source. Therefore, SQL functions are assigned to the
tables views are created having the concrete data in one field and
                                                                           attribute referenced in the tuple directly following the function
the corresponding attribute name in a second field. These views
                                                                           in desc. Another special case is the existence of a compare opera-
are created for all dimension tables separately. Remaining terms
                                                                           tor in desc. In that case, the reference that the NL question asks
from the NL question are queried regarding the data fields in
                                                                           for a comparative query is stored in desc as string compare. The
these views. In case of a match, a tuple is created containing the
                                                                           string is set in desc at the position between the tuples stating the
matching term, the attribute name, the name of the dimension
                                                                           comparative data constraints or before all tuples relevant for the
table and the join attribute for this dimension table:
                                                                           comparison. In the NL question such a type of query is identified
                                                                           by NL phrases like compared to or comparing a and b. For instance,
            w i = (term,attr j ,dim j , (key j1 , ...,key jn ))     (6)    the question What were [...] in January 1997 compared to March
with term being the data term found in the data source, attr j             1998? requires two subqueries – one for January 1997 and one
the attribute name where the data term is stored in the data               for March 1998. In this case, the number of required subqueries is
source, dim j the dimension table where the data term is stored,           counted and respective aliases created. For each alias, attributes
(key j1 , ...,key jn ) the list of join attributes between the dimension   as previously described are added to the SELECT clause having
table and the fact table and potential second level dimension              the alias as prefix. As described in Section 3.3.2 we identify NL
tables.                                                                    phrases that refer to SQL functions/operators, such as SUM or AVG,
   The previous tuple containing the matching term in the de-              and store as a string si in desc. When creating the SELECT clause,
scription desc is replaced by the newly created tuple w i .                this function is added to attributes contained in the following
                                                                           tuple of si within desc. In case there are more than one attribute
3.4    Construction of the SQL Query                                       added to the SELECT clause, the remaining attributes (including
                                                                           table references) are added to a separate list to be processed later
After processing all checks as described above, desc consists of
                                                                           when the GROUP BY case is checked.
processed tuples/strings only. All initial tuples that could not be
processed/mapped to the data source are dismissed. In the next                 3.4.2 FROM clause. The FROM clause includes the previously
step, desc is transformed to the actual SQL query.                         identified fact table combined with INNER JOINs to the affected
    All tuples of desc are analyzed and the containing distinct            (first and second level) dimension tables. As described in Section
dimension tables are extracted and stored as a list for the cre-           3.2.1, the join attributes for fact and dimension tables are provided
ation of the joins in the query. As the schema might also contain          and utilized for the joins in the FROM clause of the query. For the
second level dimension tables, the list is analyzed if second level        special case of comparative subqueries, the FROM clause contains
dimension tables are contained and the corresponding first level           the subqueries for each constraint from the NL question using
dimension table is missing. If so, the first level dimension table is      the previously created alias. For each subquery a tuple w i exists
added to the list of tables.                                               in desc. w i contains the table, the attribute and the data term
    As we identified the correct fact table from the NL question,          required for creating the SQL query containing a constraint on
we have to check if the question can actually be answered based            an attribute.
on the retrieved dimension tables. Although we are able to handle
different fact tables in one schema, it is possible that a fact table         3.4.3 WHERE clause. The WHERE clause is constructed based
is not connected to a dimension table identified in the question.          on the tuples w contained in desc. All consecutive tuples of type
If that is the case, we have to exit the transformation process and        w (having the first string set to a concrete data term) are added
give feedback to the user that the question cannot be answered             to the WHERE clause separated by AND. Each tuple w contains
based on the underlying data source.                                       the table (position 3), the attribute (position 2) and the con-
                                                                           crete data term (position 1) required to construct the constraint:
   3.4.1 SELECT clause. The SELECT clause is created based on              w[2].w[1]=’w[0]’.
the tuples in desc that contain at least one empty string at the              Concluding, the existence of hints for the use of the GROUP,
beginning (of type a and d) - resulting from steps 3.3.4 and 3.3.5.        ORDER and LIMIT clauses are checked.
As already described above, these two types of tuples contain
empty string(s) at the first (and second) position. The first empty            3.4.4 Further operators. For the GROUP clause a separate list
string in these tuples means that no data constraint is identified.        is created during the creation of the SELECT clause. If there is an
The second empty string denotes an unassigned attribute for the            aggregation function used in the SELECT clause, the remaining
corresponding table. Hence, these two types of tuples are consid-          attributes identified for selection are added to this separate list.
ered to add the attributes to the SELECT clause. If only the first         This list is checked for containing items and if so, these attributes
string in the tuple is empty, the attribute set on second position         are added to the end of the query surrounded by the grouping
in the tuple (attr j ) is directly used in the SELECT clause. If also      function. The ORDER clause is utilized in case there is an aggregate
the second string is empty, the tuple contains the affected table          function in the SELECT clause. If so, the attribute used for the
aggregate function is also used for ordering the results. For the        connected and a potential query does not present any result.
LIMIT clause the NL question is analyzed for references to only          The user might be presented with an alternative question by
present a limited amount of results. For instance, the phrases first,    simplifying the question to an answerable query. For instance,
last, top followed by a number are hints to use a LIMIT clause           the question What were the costs for bread in January 1997? refers
following the corresponding number at the end of the SQL query.          to expenses regarding bread for specific time period. When a fact
                                                                         table about expenses is not connected to specific products, the
3.5    Potential Extensions                                              user might be called back We cannot provide the costs for bread.
In the current version, our approach is able to answer a various         Would you rather be presented with the costs for all products for
number of different question types – described more in detail in         January 1997?.
Section 4. Nevertheless, we identified several issues we already
                                                                            3.5.2 Rules. As mentioned, our approach is considered ag-
considered as a future concept but did not integrate a solution,
                                                                         nostic regarding the actual underlying data source. There are
yet.
                                                                         several prerequisites required for our transformation process, but
    3.5.1 User Feedback. A NL interface might be expected to act         in general we extract the essential information from the (informa-
(almost) like a human when it comes to answering questions. As           tion schema of the) data source and add additional information
already stated above, there are questions that cannot be answered        by using APIs, such as datamuse. However, there are questions
directly. In the current version, our prototype would simply exit        that require a more detailed knowledge about the data source.
the process and give a feedback like This question cannot be             For instance, the question How many products were in stock [...]?
answered based on the current data source. But, we prefer to design      might be answered directly from the data source if the informa-
the interface in a more conversational way. If the process comes         tion of items in stock are stored explicitly. But, there could also
across a problem, the interface should communicate the problem           be the case, that this information must be calculated from the
to the user and ask for advice/help. So far, we identified two           existing information. For this example, the data source might
different issues when a user’s feedback would be required:               provide the information of items ordered and items sold. Items
      • Ambiguity                                                        in stock are calculated by subtracting the number of items sold
      • Missing data                                                     from the number of items ordered.
The first case refers to data terms within the data source that             To solve this type of issue, we consider explicit rules similar
occur multiple times. This might be the case for different cities        to MDX (Multidimensional expressions – a query language for
having the same name (ambiguity for the same type, e.g. cities)          OLAP data sources, based on SQL). Thereby, so-called calculated
or data terms that occur in different attributes in different tables     members can be created using various expressions and operators.
(ambiguity over different types - stored in different tables, e.g. a     A member can be stored in the OLAP cube to be available in
person and a city having the same name). In the current version,         further queries.
our approach groups all results when the term occurs multiple               We are aware of the fact, that this type of rules cannot be
times in the same field (as for different cities with the same name)     retrieved from the (unknown) data source following our agnostic
– different cities with the same name are treated as if they are the     approach. Therefore, this extension builds upon the extension
same. In the second case, our approach uses the first occurrence         on user feedback as described above. We consider two different
of the data term in data source when searching for table and             ways of retrieving additional rules:
attribute. In both cases, the best way of solving the ambiguity             (1) A user provides additional information about the data
would be to call for the user’s help. For the first case, the approach          source when the data source is initialized. This requires
is required to take additional information from the data source                 the user to have further knowledge about the information
and present it to the user when asking for advice which term is                 schema and the contained data.
the correct one. For instance, when a question asks for something           (2) The rules are retrieved as part of the user feedback when
in Richmond the user should be called back: Do you mean the                     a question cannot be answered. The unknown term could
Richmond in San Francisco district or in Vancouver district?. In                be presented to the user together with (parts of) the infor-
the second case, the schema information of the data source must                 mation schema.
be taken into account. The different tables and corresponding            In either case, the user feedback could be retrieved in a visual
attributes must be presented to the user, so a selection can be          way. The user might mark parts of the schema and add operators
communicated. For instance, the user might be asked Do you               or comments. Thereby, the user is not required to have detailed
mean the region with city Camacho or the employee with last              knowledge about SQL or the desired rule language.
name Camacho?. Obviously, this call back could be avoided as we
should be aware that with Camacho a place could be referred just         4 EVALUATION
by analyzing the NL question more in detail. But, our approach
is as agnostic as possible regarding the data source. Although,          4.1 Domain Application
we might know what a place is in the NL question, we are not             Data stored in an OLAP-like structure is especially useful for
aware which dimension tables store localized information.                BI scenarios. A (filtered) fact can be queried and the additional
    For the case of missing data, the NL question is assumed to          dimensions allow the user to drill up or down along the (hier-
ask for terms that are not contained in the data source or a term        archical) structured data. Thus, NLIs for OLAP databases are
from a dimension table that is not connected to the fact table.          an effective instrument for data scientists in marketing or other
In the first case, an incorrectly written term in the NL question        business contexts. We therefore initially apply our approach to
might be the cause for the issue. Therefore, in the first step we        a business scenario and utilized data provided by Foodmart and
will integrate a spell-checker to be able to automatically correct       structured in a snowflake schema8 . A similar data source is the
words. In the second case, the NL question contains data source
                                                                         8 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql
information and reference to a fact table, but the tables are not
TPC-H benchmark9 . It also comprises several fact and dimen-                       How many units by Jeffers were in stock in the first month of 1997?
sion tables including speaking attribute and table names. Our                      – in case the term in stock cannot be mapped directly to an at-
approach is therefore directly applicable to this benchmark. But,                  tribute of the data source. In this case, in stock must be calculated
due to the size of the dataset and for illustration purposes, we uti-              from available fields, such as ordered and shipped or similar at-
lized the Foodmart data source10 . To be able to test and evaluate                 tributes. This type of challenge is defined as Concept. Therefore,
our system, we created a set of questions containing several chal-                 the queries contain the challenges (J, A, F, C) and optionally (O, S).
lenging SQL operators or functions. The benchmark is described
more in detail in the next section.                                                   Table 3 shows sample questions and the respective challenges
                                                                                   for each difficulty level. The full dataset containing over 40 dif-
4.2     Benchmark                                                                  ferent questions based on the Foodmart data source is available
We are not aware of a benchmark containing questions and                           as download11 . The dataset includes three columns, with the NL
queries based on an OLAP data source. Hence, we are not able to                    question in the first column, the difficulty level in the second
present a qualitative evaluation in terms of accuracy or precision                 column, and the corresponding SQL query in the third column.
in comparison to other approaches. Though, we are eager to                         The dataset contains 14 questions of level 1, 12 questions of level
create and provide such a benchmark for further research ap-                       2, 11 questions of level 3 and 4 questions of level 4.
proaches. Therefore, we utilized the Foodmart data source where                       In the current version, our prototype is able to process ques-
data is stored in a snowflake schema with first and second level                   tions of level 1-3 directly and questions of level 4 for specific cases,
dimensions. The data source contains three different fact tables –                 such as calculating in stock from the attributes units_ordered and
holding data about sales, inventory and expense facts – and 17                     units_shipped in the Foodmart data source. Regarding the bench-
dimension tables – holding data about additional information,                      mark, this means that our approach is able to answer 93% of
such as stores, regions, customers, products etc. Each fact table                  the questions. Of course, we are aware of the fact that these re-
is connected to at least four dimension tables directly and sev-                   sults are not statistically firm – neither in terms of the number
eral dimension tables are connected in second level via other                      of questions nor regarding the subjective nature of the bench-
dimension tables.                                                                  mark. Unfortunately, none of the competing systems presented
   We identified different question types and levels of difficulty                 in Section 2 is available as API or other source. Therefore, we are
for a NL interface to OLAP. There are simple questions with                        not able to compare the quality of the results of our approach
restrictions on one dimension table or complicated questions                       to other approaches. We aim to encourage other researchers to
requiring calculations, subqueries and/or several SQL function-                    utilize and expand our benchmark. Thereby, a qualitative evalua-
s/operators. For the development process and assessment, we                        tion can be enabled in the future. But, the current results show
collected a number of over 40 different questions and grouped                      that our approach is able to construct SQL queries for a good
them together according to difficulty levels. To further explain                   number of NL questions with different difficulty levels facing
the different levels, we adopt the approach presented by Affolter                  several challenges.
et al .[1]. The authors classify different questions by the required                  As described in Section 3.5, we plan to integrate rules to be
usage of operators and functions for the respective formal query.                  able to do calculations and constraints as required by compli-
The following operators are applicable for our use case: Join,                     cated questions of level 4. Several concepts might be identified
Filter, Aggregation, Ordering, Subquery, Concept. Concept de-                      automatically or added manually by the respective user. This
notes a pre-defined term, as e.g. “great movie” or “in stock” (when                applies for concepts like important, in stock or great (in terms of
this information must be calculated and is not stored in the data                  good reviews). Another special case is What are the total sales in
source explicitly). Using these challenges, we identified four dif-                the business year 1997/98?. Here, business year might refer to a
ferent levels:                                                                     time period from July 1997 to June 1998. The data source does not
   Level 1 refers to rather simple questions with a simple con-                    contain sales information in terms of business years. Therefore,
straint of one or more dimensions like How many different prod-                    the query must contain constraints from pre-defined rules where
ucts were sold in October 1998?. Here, only a COUNT on one attribute               concrete terms are mapped to data item constraints.
and a constraint in one dimension table is required. The queries
contain the challenges (J, A, F).                                                  4.3     Portability of the Benchmark’s Domain to
   Level 2 refers to questions containing aggregations and group-                          other Domains
ing like Show for each brand the total sales in 1998. The word each                Initially, we applied our approach to an existing data source struc-
indicates the use of the an aggregation in the SELECT clause and                   tured in a snowflake schema from the domain of business data.
a grouping function for the respective attribute associated with                   This type of data design makes sense when you need insights
brand. The queries contain the challenges (J, A, F, O).                            into your data and receive metrics or ratios, as e.g. mostly ques-
   Level 3 refers to queries containing subqueries for compared                    tions that start with how many/much or what were the total xy
results or comparison using <, >, == like In which month less                      for ab. For this type of questions, it is a reasonable approach to
bread was sold than in January 1997?. For this type of question,                   structure the respective data in fact and dimension tables and
a subquery containing the data for January 1997 is constructed                     make the data source accessible via a QA application. In this way,
and it is compared using the operator < and the HAVING clause.                     business analysts or marketing strategists are able to get insight
The queries contain the challenges (J, A, S, F, O).                                into their key data without being forced to have knowledge about
   Level 4 refers to complicated queries containing calculations                   any formal query language. Although, we utilized the Foodmart
on attributes or pre-defined constraints of data terms, such as                    data source for our first prototype, the data source and the data’s
9 http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf          domain can be easily replaced respectively changed. For instance,
10 The queries described in Section 2.4 of the TPC-H specification might reflect   similar questions as collected in our benchmark are applicable
real-world business questions, but cannot be summarized in one NL question -
which our approach is aiming at.                                                   11 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/queryDataset.tsv
                                       Table 3: Question examples for the Foodmart data source

    Level   Question                   Challenges
            How many different         J(sales_fact, customer, time_by_day)
    1       customers were in the      A(COUNT(customer_id))
            first quarter of 1997?     F(quarter=’Q1’, year=’1997’)
                                       J(sales_fact, product, product_class, store, time_by_day)
            How much bread was         A(COUNT(product_id), GROUP BY(store_city))
    2
            sold per city in 1997?     O(COUNT(product_id))
                                       F(product_category=’Bread’, year=’1997’)
                                       J(sales_fact, time_by_day, product, product_class)
            In which month less        A(COUNT(product_id)
    3       bread was sold than in     S(HAVING(COUNT(product_id)) < F(month=’January’, year=’1997’, product_category=’Bread’))
            January 1997?              F(month=’January’, year=’1997’, product_category=’Bread’)
                                       O(COUNT(product_id))
                                       J(inventory_fact, product, time_by_day)
            How many units by
                                       A(SUM(units_ordered, SUM(units_shipped)))
    4       Jeffers were in stock in
                                       F(brand_name=’Jeffers’, month=’January’, year=’1997’)
            the first month of 1997?
                                       C(in stock = units_ordered - units_shipped)


for data about issue trackers and software projects, such as How       more descriptive hints about the schema, if essential parts are
many bug fixes were committed in January 2019 per developer?.          missing (such as speaking attribute or table names). The imple-
                                                                       mentation of rules – as described in Section 3.5.2 – is a desired
5       SUMMARY & OUTLOOK                                              enhancement of our approach. This enables the construction of
In this paper, we presented our approach in the research field         even more complex queries and give answers to questions of the
of NLIDB specifically for OLAP data sources. We designed our           highest difficulty level.
prototype system to be as agnostic as possible regarding the un-          Overall, our approach is already able to construct SQL queries
derlying data source. Required information is extracted from the       for rather complex NL questions. The addressed GUI will make
schema and the datamuse API is utilized to add synonyms to the         the prototype available for users and the enhancements will
data that is used to map phrases from the NL question to terms of      complete our approach in terms of complexion of queries.
the data source. In our approach, the NL question analyzed and
an intermediate representation of the question is created. After       6    ACKNOWLEDGEMENTS
several processing steps, this representation contains different       This work was partially funded by the German Research Founda-
types of tuples which are utilized in the subsequent construction      tion (DFG) under grant no. SA782/26.
of the SQL query. Depending on the type of the tuple it is used
to create the SELECT, FROM, or WHERE clause. Specific terms iden-      REFERENCES
tified in the NL question indicate the use of SQL operators and         [1] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative
                                                                            survey of recent natural language interfaces for databases. The VLDB Journal
functions, such as SUM, MAX, GROUP BY, or LIMIT.                            28, 5 (01 Oct 2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
    With this approach, we are able to answer questions of dif-         [2] Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natu-
ferent difficulty levels. In default of an OLAP benchmark, we               ral language interfaces to databases - an introduction. Natural Language
                                                                            Engineering 1, 1 (1995), 29–81.
created a dataset containing more than 40 questions based on            [3] Key-Sun Choi, Luis Espinosa Anke, Thierry Declerck, Dagmar Gromann, Jin-
the Foodmart data source. The benchmark is available for down-              Dong Kim, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Ricardo
                                                                            Usbeck (Eds.). 2018. Joint Proceedings of ISWC 2018 Workshops SemDeep-4 and
load and we are planning to collect more questions to be able to            NLIWOD-4. Workshop on Semantic Deep Learning (SemDeep-2018). Vol. 2241.
evaluate our approach and compare it to other approaches.                   CEURS.
    Future work includes the implementation of the extensions           [4] Lan Jiang and Felix Naumann. 2019. Holistic primary key and foreign key
                                                                            detection. Journal of Intelligent Information Systems (2019), 1–23.
described in Section 3.5. The first and foremost prerequisite for       [5] F. Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language
these extensions is the design and development of a graphical               Interface for Relational Databases. Proc. VLDB Endow. 8, 1 (Sept. 2014), 73–84.
user interface (GUI). On the one hand, our intended user frontend           https://doi.org/10.14778/2735461.2735468
                                                                        [6] Giuseppe M. Mazzeo and Carlo Zaniolo. 2016. Answering Controlled Natural
is targeted on being interactive regarding the presented results            Language Questions on RDF Knowledge Bases. In Proceedings of the 19th Inter-
from the data source. On the other hand, further developments               national Conference on Extending Database Technology, EDBT 2016, Bordeaux,
                                                                            France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 608–611.
aim at an conversational interface. The user is either presented        [7] M. Asif Naeem and Imran Sarwar Bajwa. 2012. Generating OLAP Queries
with concrete results from the data source, or asked for feedback           from Natural Language Specification. In ICACCI ’12. ACM, 768–773. https:
to give a relevant answer for an initially unclear question. For            //doi.org/10.1145/2345396.2345522
                                                                        [8] Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa
further data insights, the user is enabled to request more informa-         Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019.
tion by asking questions that build upon the preceding questions.           Natural Language Querying of Complex Business Intelligence Queries. In
Furthermore, the user frontend should provide the user with a               SIGMOD ’19. ACM, New York, NY, USA, 1997–2000. https://doi.org/10.1145/
                                                                            3299869.3320248
facility to add an own data source and ask questions on it. The         [9] Nadine Steinmetz, Ann-Katrin Arning, and Kai-Uwe Sattler. 2019. From
data source might be provided as a bunch of csv files or an SQL             Natural Language Questions to SPARQL Queries: A Pattern-based Approach.
                                                                            In Datenbanksysteme für Business, Technologie und Web (BTW 2019). 289–308.
file or as an URL to a distant database. The data source is analyzed   [10] Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, and
regarding the required information for our approach. Eventually             Dongyan Zhao. 2014. Natural Language Question Answering over RDF: A
the user is involved to extract the required information or give            Graph Data Driven Approach. In SIGMOD ’14. 313–324.