Question Answering on OLAP-like Data Sources Nadine Steinmetz Samar Shahabi-Ghahfarokhi Kai-Uwe Sattler Technische Universität Ilmenau Technische Universität Ilmenau Technische Universität Ilmenau Germany Germany Germany nadine.steinmetz@tu-ilmenau.de samar.shahabi-ghahfarokhi@ kus@tu-ilmenau.de tu-ilmenau.de ABSTRACT are easily to use (e.g. for drill down/rollup operations) even for Today, knowledge is mostly stored in structured data sources business users but limited in their expressiveness. Therefore, and requires to be queried by formal query languages. Natural combining question answering with OLAP seems to be a nat- language interfaces to these data sources enable users to query ural solution to provide access to analytical databases. QA for the data without knowledge about the technical details of the (relational) OLAP means query language which is particularly useful in analytics and deci- • working with star or snowflake schemas consisting of fact sion support. In this paper, we present an approach to query data and dimension tables, that is stored in an OLAP database only using natural language. • answering questions referring to facts and dimensions/di- We describe the preliminaries and transformation process from a mension values. question to the formal SQL query on a snowflake schema. The approach takes into account synonyms to identify the relevant The goal of our work is an approach to answer basic questions fact table and uses mapping techniques to appoint the correct for facts such as “What were the total sales of beverages in March attributes and dimension tables within the schema. As an exam- 2018?” but also more complex questions requiring calculations, ple, we utilized the Foodmart data source to implement the first sorting and ranking or other more advanced operations like prototype and prepared over 40 different questions to assess our “What were the total sales per product class in May 2018 compared approach and provide a benchmark for further studies. to May 2017?”. The focus of our work is the processing of textual questions – 1 INTRODUCTION a speech recognition interface is out of the scope of our work but can be easily added by using external cloud services such as Alexa The developments in the fields of speech recognition and natural Voice Service or Google Speech-to-Text. The main contributions language processing (NLP) over the last few years have fostered of our work are twofold. We present an approach for mapping many novel applications. Apart from the now nearly ubiquitous NL questions to SQL queries on a snowflake schema. And we virtual assistants like Amazon Alexa or Apple Siri in smartphones, describe a schema-agnostic technique to derive such mappings. smart speakers, and other consumer devices, research in the field In addition, we provide a set of NL questions as benchmark for of question answering (QA) has made a significant progress since the evaluation of QA systems for OLAP. IBM’s DeepQA system Watson won the quiz show Jeopardy! in The remainder of this paper is organized as follows. In Sect. 2 2011. QA systems – or in the special form of Natural Language we discuss related approaches from NLIDB and QA. The pro- Interfaces to Databases (NLIDB) [2] – provide a user-friendly cessing of questions and the mapping approach to SQL queries way to access any type of knowledge base simply by entering a is described in Sect. 3. We have implemented this approach in question formulated in natural language (NL). To answer such a prototype which is evaluated using a set of questions on the a question, the type of the question, objects and predicates as Foodmart cube. The results of this evaluation presented in Sect. 4 well as entities have to be identified by parsing and analyzing the demonstrate that our QA approach is capable of answering even sentence. Then, the question has to be mapped to a query, e.g. in complex questions. Finally, we conclude the results in Sect. 5 and SQL or SPARQL, on a database or knowledge base providing the point out to future work. necessary facts. Particularly, RDF-based knowledge bases such as DBpedia [6, 9, 10] or databases enriched by ontologies [8] have been used successfully. 2 RELATED WORK OLAP databases are used to answer multidimensional analyti- Li et al. introduced an interface (NaLIR) that transforms NL to cal queries in the business intelligence context. Typically, OLAP is SQL [5]. The interface is interactive and requests feedback from based on a multidimensional data model distinguishing between the user. Hereby, the user interacts with the interface during the facts or measures and dimensions forming a so-called hypercube. query construction process at two different stages. First, the NL In such a hypercube, dimensions span the data space and provide phrase is parsed into a lexical tree. Afterwards, the parse tree is information about the facts represented by the cells of the cube. analyzed and specific nodes are identified which can be mapped OLAP cubes are usually queried either via visual interfaces (e.g. parts of the underlying SQL schema information. At this stage, Tableau) which generate and execute database queries, or simply if the mapping fails, the first feedback is requested by the user. by standard query languages, such as SQL or MDX. While the Otherwise, the successful mappings are presented to the user. latter provides the most powerful access, particularly SQL with After all nodes are interpreted correctly (with the help of the advanced grouping features and OLAP functions is too difficult user), the parse tree is adapted according to the their system. If for non-database experts. On the other hand, visual interfaces necessary, implicit nodes can be inserted and the user is able to © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- give feedback on that process. The query tree – verified by the ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, user – is transformed to a SQL query containing joins, aggregate Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- functions etc. The system is designed to answer rather simple tribution 4.0 International (CC BY 4.0) questions. The authors claim, the question Return the author who has the most publications in database area is hard to answer. Also, NL questions that could also originate from a speech interface. the system relies on a high amount of feedback by the user. In our approach, the processing and mapping is performed on a The authors of the approach described above reference the complete sentence/question. field of NLIDB for their scientific solutions as the interfaces are In recent years, question answering is mostly focussed on an- designed to query (relational) databases in general. In addition, swering questions regarding an underlying semantic knowledge the research field of question answering has emerged, especially base. NL questions are transferred to a SPARQL query based on since new types of knowledge bases have been established, such the knowledge structured as RDF triples. The challenge Question as graph databases or triple stores. Our approach is designed to Answering over Linked Data (QALD) has been established in transform a NL question to SQL to be able to query a schema 2011 as part of the Extended Semantic Web Conference (ESWC). stored in a database. However, we also relate our approach to The latest challenge took place as part of the 18th International the research field of question answering, as it is based on OLAP. Semantic Web Conference (ISWC) [3]. In addition to many con- A star or snowflake schema constitutes a specific and powerful curring systems that participated in recent years, the organizers way of storing data respectively knowledge. of the challenge published all datasets containing at least 100 Naeem et al. presented an approach to generate OLAP queries questions and the corresponding SPARQL queries. As we are based on NL [7]. The question is analyzed for Part-of-Speech, focussing on data sources stored in star/snowflake schema, the linguistic dependencies and semantic roles. They utilize SBVR comparison to the systems and datasets is not applicable for our which is a vocabulary specific to business use cases. Different approach presented in this paper. parts of the NL question are mapped to different pre-defined semantic roles. The SQL query is then generated according to Table 1: Comparison of related approaches to our ap- these assigned roles. The approach seems to be able to process proach rather simple questions, but more complicated queries containing comparisons, grouping and other functions are not considered. NaLIR Naeem Sen et Tableau Our Sen et al. just recently presented an approach to build com- [5] et al. al. [8] Approach plex SQL queries using NL [8]. The authors focus on queries [7] that include subqueries and comparisons between different sub- SQL ✓ ✓ ✓ ✓ ✓ queries. Their approach is based on very few rules, but tailored OLAP ✓ ✓ ✓ to a specific benchmark containing finance data. The heart of Schema ✓ ✓ their system is an ontology containing over 150 concepts and agnostic several hundred properties. Input questions are mapped to the NL ✓ ✓ ✓ ✓ ontology and then the SQL query is constructed based on the question ontological elements. As already stated above, our approach is Complex ✓ ✓ ✓ designed to be as agnostic as possible regarding the underlying queries data source. Therefore, a change of the data source is enabled without changing (parts of) the system. There are several further approaches to NLIDBs published In terms of a better understanding the contribution of our in recent years. In addition to the analytical approaches as de- approach, we compare related systems using the following char- scribed above, there are systems that transform natural language acteristics: to SQL using neural networks. But as we also chose an analytical • data source is OLAP-like and the approach constructs SQL approach, these systems are out of scope in terms of related work. queries based on facts and dimensions Just recently, a survey has been published comparing NLIs for • the approach is agnostic regarding the schema of the data databases [1]. The authors compare 24 interfaces that transform source NL to a formal query language (namely SQL or SPARQL). The • NL questions are supported comparison is based on a set of 10 NL questions that have been • complex queries containing grouping, subqueries are sup- constructed by means of different challenges that developers are ported facing when transforming the input into the respective query. Table 1 shows the comparison of our approach to related ap- Unfortunately, none of the presented systems are based on OLAP- proaches regarding the defined characteristics. Obviously, our like data sources, but some are transforming NL to SQL queries approach is bridging the gap between OLAP-like data sources - as we describe those systems above. For the evaluation of our and NL questions in a schema agnostic way – and is also able to approach, we adopt the categorization of questions by means of answer complex questions. query challenges, as described in detail in Section 4. In terms of Business Intelligence (BI) scenarios, Tableau is an 3 METHOD established software tool1 . It focusses on querying business data The proposed approach considers any given knowledge base and visualizing the data in a very intuitive way. The user can eas- structured as a snowflake schema having one or more fact tables ily manipulate the visualization and create different perspectives. and several dimension tables. Only one table of a dimension is The user can query the data using NL, but an input is immediately directly connected to the fact table. Thereby, the length of the mapped to the underlying data source respectively pre-defined longest path between a fact table and the deepest dimension table terms and the user is provided with concrete items of the data can be long and at least pmax > 1. source – functions, relations, attributes or concrete data terms. For our approach, we utilized the Foodmart data source2 – as After choosing one of the suggestions, the user can complete or described further in Section 4. In the schema of the Foodmart refine the input query. In contrast to that, we focus on complete data source the longest path from the fact table to a dimension 1 https://www.tableau.com/ 2 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql table is pmax = 2. Therefore, and also for simplification reasons Besides the actual data source containing the facts and dimen- in describing our approach, we assume a snowflake schema with sions, we only utilize an additional data source having stored the pmax = 2. Hence, we utilize the term first level dimension table for synonyms, extracted schema information and SQL vocabulary. the tables directly connected to the fact table (p = 1) and second The concrete extraction of information and further knowledge level dimension table for dimension tables that are connected via required for processing NL questions is described in detail in the one dimension table in between (p = 2). following sections. However, our approach is also able to handle snowflake sche- 3.2.1 Schema information. Provided that the name(s) of the mas having pmax > 2. fact table(s) are known, we extract and store further required There are certain preliminaries for the analysis of the under- information from the information schema of the database. This lying data. We extract information from the schema and use it as includes: pre-knowledge for the queries. In addition, we consider further knowledge about the specific data and general language trans- • names of dimension tables formation issues. The NL questions are processed based on this • join attributes preparative information and as a result a formal SQL query is • names of fact attributes created to retrieve the requested data from the snowflake schema • synonyms for table and attribute names and present it to the user. More details about our approach are Information about foreign keys is either given by the database described in the following sections. constraints or automatically detected by an algorithm, such as For our approach, some information about the underlying [4]. In this way, the join attributes for fact and dimension tables data source is required to be able to transform the NL question are detected. All remaining attributes from the fact table(s) are to a formal query. Most of the required information might be assumed to be fact attributes containing the actual measures and provided by the data owner in database constraints, such as numbers. foreign keys, fact and dimension table names. Our approach is In addition, we analyze and clean the attribute and table names able to handle unknown data sources and extract the required and store them ready to be mapped by terms of the NL question. information from the given tables. In this way, we are able to However, the NL question might not mention the exact (anno- handle data sources given as a set of csv files or an undocumented tated) name of the affected attribute or fact table. Therefore, we sql database provided by an URL. In the following sections we utilize the datamuse API3 to add synonyms for attributes and describe the required process steps. tables. 3.1 Preliminaries 3.2.2 SQL operators and functions. The required application of SQL operators and functions in the formal query mostly can The first and foremost prerequisite for our approach is having be identified by certain keywords in the NL question of any the data in a snowflake schema. In addition, we require the infor- domain. These keywords are assumed to be static throughout mation about which tables in a given database are the fact tables. questions of different domains. Therefore, we utilize a list of Obviously, all other tables are assumed to be dimension tables. potential keywords to be able to map them to the correct SQL If no schema containing foreign key constraints is given, we operator resp. function. For instance, the keyword total indicates identify the constraints using an automatic detection algorithm, the use of the SQL function SUM for the requested attribute. We such as described in [4]. For the mapping of NL phrases to the provide the complete mapping list as download4 . database, the tables and attributes require to have descriptive names – in the best case. For synthetically created data sources 3.2.3 Thesaurus. Time information can be stored in various this might not be true. We identified several solutions for that ways. In return, time information can be referenced in various challenge: ways in NL. For instance, the question How much bread was sold • automatic detection of type of data based on the containing in the first quarter of 2010? is targeted on all sales in the first data – this is effective for common data, such as person three months – namely January, February, and March. In this names, locations or dates case, we either have to filter for dates having 1, 2 or 3 in the • analysis of sample queries and respective NL questions – month field of the date or the date is stored having the month this requires user input information separately and we have to filter on this attribute. • manual annotation of table names and attributes – this In addition, the time information contains a separate field con- also requires user input taining the respective quarter. In either way, we have to analyze the underlying data source and how date information is stored. For the description of our QA approach, we assume to have Furthermore, it is required to add synonymous information of descriptive names for tables and attributes – as it is the case for how dates might be referenced in NL. Therefore, we maintain a popular benchmarks, such as TPC-H or TPC-DS, but also for the thesaurus as a mapping table for date information. Dependent on Foodmart data source. The next section describes the knowledge the underlying data source, terms like first quarter are mapped we extract from an unknown database to satisfy our prerequisites. to the data source, as e.g. to Q1. 3.2 Knowledge about Data 3.3 Processing the Natural Language We extract the following information required to map NL to a The NL question is preprocessed and mapped to our thesaurus snowflake schema: and the underlying data source. After preprocessing and extrac- • information about the schema – names of fact and di- tion of specific terms, the following checks are executed: mension tables, join attributes and fact attributes, plus • which SQL operator – "total" -> SUM, "average"->AVG synonyms • mapping list to identify NL phrases for SQL operators 3 http://www.datamuse.com/api/ 4 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sql_operators.tsv • thesaurus for general synonyms Table 2: POS Tags that are taken into account for the ex- with k being an extracted term which can consist of one or more traction of terms from the NL question words – which results in one or more POS tags for each extracted term. Tag(s) Type Example In the next steps, these terms are checked for query-related CD cardinal numeral 1998 operators and the schema of the data source. The list of tuples is FW foreign word les checked item per item and if a check is positive, the respective JJ* adjective total tuple is replaced with a new tuple depending on the different NN* noun sales, costs checks as described in detail in the following sections. PDT predeterminer both sales and costs 3.3.2 Operators. Based on what the question asks for, several POS possessive ending January’s SQL operators might be relevant to be used in the query. We RB* adverb good, better, best utilize a mapping list of manually identified NL phrases that VB* verb sold indicate the use of a SQL operator. For instance, the NL phrase WRB wh-adverb when, how maximum suggests to use the MAX operator in the SELECT clause or different indicates the use of DISTINCT. If a term from the NL question is matching a term from our mapping list, the term is • which fact table – e.g. "sales"-> sales_fact replaced with the respective SQL operator in the intermediate • which dimension table – "which store [...]?", "In which city representation desc of the question. The tuple ti (containing term [...]?" and POS tag(s)) in desc is replaced by the operator/function si as • which attribute from fact table – "How many units where a string: ordered [...?]" -> units_ordered from inventory_fact • is there a term from a dimension table? – "How much si ∈ S (3) bread [...]?" -> "Bread" is found as data term in the field for with S being the set of all SQL operators and functions. product categories within the product_class dimension After these checks, the NL question is presented as an interme- 3.3.3 Fact Tables. As a preliminary we know the names of the diate formal description. This formal description is then trans- fact tables in the snowflake schema. To identify the correct fact formed to the actual SQL query5 . table (in case there are more than one fact table in the schema) we check the remaining terms in the tuples if they are matching 3.3.1 Preprocessing. The NL question is tagged for Part-of- a fact table’s name. As it is not very likely having the exact fact Speech (POS) using the NLTK library6 and the Brown tagset7 . tables’ name in the question, we utilize the datamuse API (as Subsequently, the respective terms are extracted from the NL described in Section 3.2.1) to map a term and identify the correct question according to a predefined white list of POS tags. Table fact table. For our example, the datamuse API provides the term 2 shows the white list of POS tags and respective sample words sold as one of the synonyms for sales. for each tag. The information about the correct fact table is stored sepa- If there are continuous words having a noun tag (NN, NNS, rately in addition to the intermediate representation desc of the NNP, NNPS) these words are extracted as coherent terms. Words question. marked with other tags are extracted as single words. Words with tags not contained in the white list are dismissed. In addition, 3.3.4 Dimension Tables. In some cases a question might refer further word combinations are identified, such as third month, directly to a dimension of the schema. For instance, a data source last month, how many. with localized information one dimension could be the region. In For instance, for the question that case, a question Which region locates the most stores? directly refers to the dimension region. The tuple ti of the matching term How many paper wipes were sold in the first month of is replaced with a new tuple d: 1997? the following terms are extracted: [how many],[paper wipes], di = ( ′′ ,′′ ,dim j , (key j1 , ...,key jn )) (4) [were], [sold], [first month], [1997]. with ′′ being empty strings, dim j the name of the dimension table, Subsequently, the list of extracted terms is checked for terms (key j1 , ...,key jn ) the list of join attributes between the dimension contained in our thesaurus. For our example, the term first month table and the fact table and potential second level dimension could be translated to January or 1 depending on how dates are tables. The first empty string in the tuple denotes an unspecified stored in our data source. data constraint, in terms of not having a concrete data term found After this preprocessing step, the question is represented as in the data source, such as Bread or January. The second empty a list of n tuples – the question description desc. Each tuple string stands for an unassigned attribute for the corresponding contains the extracted term (respectively its translation) and the table – there is no specific attribute affected. POS tag(s). 3.3.5 Attributes. Questions might also refer directly to at- tributes from the fact or dimension tables. For instance, a data desc = (t 1 ,t 2 , ...,tn ) (1) source about sales for different products might include a dimen- with n = number of extracted terms, sion table listing all available products and their properties. In ti = (ki , (POS 1 , ...,POSm )) (2) that case, a possible question could be What is the net weight of [...]?. The net weight for all products is stored in a field in a dimension table. The respective term in desc is mapped to the 5 A detailed transformation process of a sample NL question to a SQL query is illus- trated here: https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/sample_ transformation.pdf attribute and the tuple t is replaced with a tuple a: 6 http://www.nltk.org/ 7 http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM ai = ( ′′ ,attr j ,dim j , (key j1 , ...,key jn )) (5) with ′′ being an empty string, attr j the name of the attribute on position three (dim j ) and the join attribute of this dimension mapped to the term, dim j the name of the dimension table, table to the fact table is used in the SELECT clause. Needless to (key j1 , ...,key jn ) the list of join attributes between the dimen- mention, if desc contains more than one tuple of these type, all sion table and the fact table and potential second level dimension attributes are added and separated by a comma. As described tables. Similar to di , the empty string denotes an unspecified data in Section 3.3.2, our approach is able to add SQL operators and constraint. functions to the query. In that case, the operator/function is part of desc at the relative position as mentioned in the SQL query. For 3.3.6 Term. Concluding, for the remaining tuples t of the list instance, the question How many different products [...]? requires it is checked, if the term is stored as data in the underlying data the functions COUNT (for How many) and DISTINCT (for different). source. In this way, the information for the WHERE clause con- Both mentions are followed directly by the concrete reference straining attributes to concrete data is retrieved. For all dimension in the data source. Therefore, SQL functions are assigned to the tables views are created having the concrete data in one field and attribute referenced in the tuple directly following the function the corresponding attribute name in a second field. These views in desc. Another special case is the existence of a compare opera- are created for all dimension tables separately. Remaining terms tor in desc. In that case, the reference that the NL question asks from the NL question are queried regarding the data fields in for a comparative query is stored in desc as string compare. The these views. In case of a match, a tuple is created containing the string is set in desc at the position between the tuples stating the matching term, the attribute name, the name of the dimension comparative data constraints or before all tuples relevant for the table and the join attribute for this dimension table: comparison. In the NL question such a type of query is identified by NL phrases like compared to or comparing a and b. For instance, w i = (term,attr j ,dim j , (key j1 , ...,key jn )) (6) the question What were [...] in January 1997 compared to March with term being the data term found in the data source, attr j 1998? requires two subqueries – one for January 1997 and one the attribute name where the data term is stored in the data for March 1998. In this case, the number of required subqueries is source, dim j the dimension table where the data term is stored, counted and respective aliases created. For each alias, attributes (key j1 , ...,key jn ) the list of join attributes between the dimension as previously described are added to the SELECT clause having table and the fact table and potential second level dimension the alias as prefix. As described in Section 3.3.2 we identify NL tables. phrases that refer to SQL functions/operators, such as SUM or AVG, The previous tuple containing the matching term in the de- and store as a string si in desc. When creating the SELECT clause, scription desc is replaced by the newly created tuple w i . this function is added to attributes contained in the following tuple of si within desc. In case there are more than one attribute 3.4 Construction of the SQL Query added to the SELECT clause, the remaining attributes (including table references) are added to a separate list to be processed later After processing all checks as described above, desc consists of when the GROUP BY case is checked. processed tuples/strings only. All initial tuples that could not be processed/mapped to the data source are dismissed. In the next 3.4.2 FROM clause. The FROM clause includes the previously step, desc is transformed to the actual SQL query. identified fact table combined with INNER JOINs to the affected All tuples of desc are analyzed and the containing distinct (first and second level) dimension tables. As described in Section dimension tables are extracted and stored as a list for the cre- 3.2.1, the join attributes for fact and dimension tables are provided ation of the joins in the query. As the schema might also contain and utilized for the joins in the FROM clause of the query. For the second level dimension tables, the list is analyzed if second level special case of comparative subqueries, the FROM clause contains dimension tables are contained and the corresponding first level the subqueries for each constraint from the NL question using dimension table is missing. If so, the first level dimension table is the previously created alias. For each subquery a tuple w i exists added to the list of tables. in desc. w i contains the table, the attribute and the data term As we identified the correct fact table from the NL question, required for creating the SQL query containing a constraint on we have to check if the question can actually be answered based an attribute. on the retrieved dimension tables. Although we are able to handle different fact tables in one schema, it is possible that a fact table 3.4.3 WHERE clause. The WHERE clause is constructed based is not connected to a dimension table identified in the question. on the tuples w contained in desc. All consecutive tuples of type If that is the case, we have to exit the transformation process and w (having the first string set to a concrete data term) are added give feedback to the user that the question cannot be answered to the WHERE clause separated by AND. Each tuple w contains based on the underlying data source. the table (position 3), the attribute (position 2) and the con- crete data term (position 1) required to construct the constraint: 3.4.1 SELECT clause. The SELECT clause is created based on w[2].w[1]=’w[0]’. the tuples in desc that contain at least one empty string at the Concluding, the existence of hints for the use of the GROUP, beginning (of type a and d) - resulting from steps 3.3.4 and 3.3.5. ORDER and LIMIT clauses are checked. As already described above, these two types of tuples contain empty string(s) at the first (and second) position. The first empty 3.4.4 Further operators. For the GROUP clause a separate list string in these tuples means that no data constraint is identified. is created during the creation of the SELECT clause. If there is an The second empty string denotes an unassigned attribute for the aggregation function used in the SELECT clause, the remaining corresponding table. Hence, these two types of tuples are consid- attributes identified for selection are added to this separate list. ered to add the attributes to the SELECT clause. If only the first This list is checked for containing items and if so, these attributes string in the tuple is empty, the attribute set on second position are added to the end of the query surrounded by the grouping in the tuple (attr j ) is directly used in the SELECT clause. If also function. The ORDER clause is utilized in case there is an aggregate the second string is empty, the tuple contains the affected table function in the SELECT clause. If so, the attribute used for the aggregate function is also used for ordering the results. For the connected and a potential query does not present any result. LIMIT clause the NL question is analyzed for references to only The user might be presented with an alternative question by present a limited amount of results. For instance, the phrases first, simplifying the question to an answerable query. For instance, last, top followed by a number are hints to use a LIMIT clause the question What were the costs for bread in January 1997? refers following the corresponding number at the end of the SQL query. to expenses regarding bread for specific time period. When a fact table about expenses is not connected to specific products, the 3.5 Potential Extensions user might be called back We cannot provide the costs for bread. In the current version, our approach is able to answer a various Would you rather be presented with the costs for all products for number of different question types – described more in detail in January 1997?. Section 4. Nevertheless, we identified several issues we already 3.5.2 Rules. As mentioned, our approach is considered ag- considered as a future concept but did not integrate a solution, nostic regarding the actual underlying data source. There are yet. several prerequisites required for our transformation process, but 3.5.1 User Feedback. A NL interface might be expected to act in general we extract the essential information from the (informa- (almost) like a human when it comes to answering questions. As tion schema of the) data source and add additional information already stated above, there are questions that cannot be answered by using APIs, such as datamuse. However, there are questions directly. In the current version, our prototype would simply exit that require a more detailed knowledge about the data source. the process and give a feedback like This question cannot be For instance, the question How many products were in stock [...]? answered based on the current data source. But, we prefer to design might be answered directly from the data source if the informa- the interface in a more conversational way. If the process comes tion of items in stock are stored explicitly. But, there could also across a problem, the interface should communicate the problem be the case, that this information must be calculated from the to the user and ask for advice/help. So far, we identified two existing information. For this example, the data source might different issues when a user’s feedback would be required: provide the information of items ordered and items sold. Items • Ambiguity in stock are calculated by subtracting the number of items sold • Missing data from the number of items ordered. The first case refers to data terms within the data source that To solve this type of issue, we consider explicit rules similar occur multiple times. This might be the case for different cities to MDX (Multidimensional expressions – a query language for having the same name (ambiguity for the same type, e.g. cities) OLAP data sources, based on SQL). Thereby, so-called calculated or data terms that occur in different attributes in different tables members can be created using various expressions and operators. (ambiguity over different types - stored in different tables, e.g. a A member can be stored in the OLAP cube to be available in person and a city having the same name). In the current version, further queries. our approach groups all results when the term occurs multiple We are aware of the fact, that this type of rules cannot be times in the same field (as for different cities with the same name) retrieved from the (unknown) data source following our agnostic – different cities with the same name are treated as if they are the approach. Therefore, this extension builds upon the extension same. In the second case, our approach uses the first occurrence on user feedback as described above. We consider two different of the data term in data source when searching for table and ways of retrieving additional rules: attribute. In both cases, the best way of solving the ambiguity (1) A user provides additional information about the data would be to call for the user’s help. For the first case, the approach source when the data source is initialized. This requires is required to take additional information from the data source the user to have further knowledge about the information and present it to the user when asking for advice which term is schema and the contained data. the correct one. For instance, when a question asks for something (2) The rules are retrieved as part of the user feedback when in Richmond the user should be called back: Do you mean the a question cannot be answered. The unknown term could Richmond in San Francisco district or in Vancouver district?. In be presented to the user together with (parts of) the infor- the second case, the schema information of the data source must mation schema. be taken into account. The different tables and corresponding In either case, the user feedback could be retrieved in a visual attributes must be presented to the user, so a selection can be way. The user might mark parts of the schema and add operators communicated. For instance, the user might be asked Do you or comments. Thereby, the user is not required to have detailed mean the region with city Camacho or the employee with last knowledge about SQL or the desired rule language. name Camacho?. Obviously, this call back could be avoided as we should be aware that with Camacho a place could be referred just 4 EVALUATION by analyzing the NL question more in detail. But, our approach is as agnostic as possible regarding the data source. Although, 4.1 Domain Application we might know what a place is in the NL question, we are not Data stored in an OLAP-like structure is especially useful for aware which dimension tables store localized information. BI scenarios. A (filtered) fact can be queried and the additional For the case of missing data, the NL question is assumed to dimensions allow the user to drill up or down along the (hier- ask for terms that are not contained in the data source or a term archical) structured data. Thus, NLIs for OLAP databases are from a dimension table that is not connected to the fact table. an effective instrument for data scientists in marketing or other In the first case, an incorrectly written term in the NL question business contexts. We therefore initially apply our approach to might be the cause for the issue. Therefore, in the first step we a business scenario and utilized data provided by Foodmart and will integrate a spell-checker to be able to automatically correct structured in a snowflake schema8 . A similar data source is the words. In the second case, the NL question contains data source 8 https://github.com/rsim/mondrian_demo/blob/master/db/foodmart.sql information and reference to a fact table, but the tables are not TPC-H benchmark9 . It also comprises several fact and dimen- How many units by Jeffers were in stock in the first month of 1997? sion tables including speaking attribute and table names. Our – in case the term in stock cannot be mapped directly to an at- approach is therefore directly applicable to this benchmark. But, tribute of the data source. In this case, in stock must be calculated due to the size of the dataset and for illustration purposes, we uti- from available fields, such as ordered and shipped or similar at- lized the Foodmart data source10 . To be able to test and evaluate tributes. This type of challenge is defined as Concept. Therefore, our system, we created a set of questions containing several chal- the queries contain the challenges (J, A, F, C) and optionally (O, S). lenging SQL operators or functions. The benchmark is described more in detail in the next section. Table 3 shows sample questions and the respective challenges for each difficulty level. The full dataset containing over 40 dif- 4.2 Benchmark ferent questions based on the Foodmart data source is available We are not aware of a benchmark containing questions and as download11 . The dataset includes three columns, with the NL queries based on an OLAP data source. Hence, we are not able to question in the first column, the difficulty level in the second present a qualitative evaluation in terms of accuracy or precision column, and the corresponding SQL query in the third column. in comparison to other approaches. Though, we are eager to The dataset contains 14 questions of level 1, 12 questions of level create and provide such a benchmark for further research ap- 2, 11 questions of level 3 and 4 questions of level 4. proaches. Therefore, we utilized the Foodmart data source where In the current version, our prototype is able to process ques- data is stored in a snowflake schema with first and second level tions of level 1-3 directly and questions of level 4 for specific cases, dimensions. The data source contains three different fact tables – such as calculating in stock from the attributes units_ordered and holding data about sales, inventory and expense facts – and 17 units_shipped in the Foodmart data source. Regarding the bench- dimension tables – holding data about additional information, mark, this means that our approach is able to answer 93% of such as stores, regions, customers, products etc. Each fact table the questions. Of course, we are aware of the fact that these re- is connected to at least four dimension tables directly and sev- sults are not statistically firm – neither in terms of the number eral dimension tables are connected in second level via other of questions nor regarding the subjective nature of the bench- dimension tables. mark. Unfortunately, none of the competing systems presented We identified different question types and levels of difficulty in Section 2 is available as API or other source. Therefore, we are for a NL interface to OLAP. There are simple questions with not able to compare the quality of the results of our approach restrictions on one dimension table or complicated questions to other approaches. We aim to encourage other researchers to requiring calculations, subqueries and/or several SQL function- utilize and expand our benchmark. Thereby, a qualitative evalua- s/operators. For the development process and assessment, we tion can be enabled in the future. But, the current results show collected a number of over 40 different questions and grouped that our approach is able to construct SQL queries for a good them together according to difficulty levels. To further explain number of NL questions with different difficulty levels facing the different levels, we adopt the approach presented by Affolter several challenges. et al .[1]. The authors classify different questions by the required As described in Section 3.5, we plan to integrate rules to be usage of operators and functions for the respective formal query. able to do calculations and constraints as required by compli- The following operators are applicable for our use case: Join, cated questions of level 4. Several concepts might be identified Filter, Aggregation, Ordering, Subquery, Concept. Concept de- automatically or added manually by the respective user. This notes a pre-defined term, as e.g. “great movie” or “in stock” (when applies for concepts like important, in stock or great (in terms of this information must be calculated and is not stored in the data good reviews). Another special case is What are the total sales in source explicitly). Using these challenges, we identified four dif- the business year 1997/98?. Here, business year might refer to a ferent levels: time period from July 1997 to June 1998. The data source does not Level 1 refers to rather simple questions with a simple con- contain sales information in terms of business years. Therefore, straint of one or more dimensions like How many different prod- the query must contain constraints from pre-defined rules where ucts were sold in October 1998?. Here, only a COUNT on one attribute concrete terms are mapped to data item constraints. and a constraint in one dimension table is required. The queries contain the challenges (J, A, F). 4.3 Portability of the Benchmark’s Domain to Level 2 refers to questions containing aggregations and group- other Domains ing like Show for each brand the total sales in 1998. The word each Initially, we applied our approach to an existing data source struc- indicates the use of the an aggregation in the SELECT clause and tured in a snowflake schema from the domain of business data. a grouping function for the respective attribute associated with This type of data design makes sense when you need insights brand. The queries contain the challenges (J, A, F, O). into your data and receive metrics or ratios, as e.g. mostly ques- Level 3 refers to queries containing subqueries for compared tions that start with how many/much or what were the total xy results or comparison using <, >, == like In which month less for ab. For this type of questions, it is a reasonable approach to bread was sold than in January 1997?. For this type of question, structure the respective data in fact and dimension tables and a subquery containing the data for January 1997 is constructed make the data source accessible via a QA application. In this way, and it is compared using the operator < and the HAVING clause. business analysts or marketing strategists are able to get insight The queries contain the challenges (J, A, S, F, O). into their key data without being forced to have knowledge about Level 4 refers to complicated queries containing calculations any formal query language. Although, we utilized the Foodmart on attributes or pre-defined constraints of data terms, such as data source for our first prototype, the data source and the data’s 9 http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf domain can be easily replaced respectively changed. For instance, 10 The queries described in Section 2.4 of the TPC-H specification might reflect similar questions as collected in our benchmark are applicable real-world business questions, but cannot be summarized in one NL question - which our approach is aiming at. 11 https://dbgit.prakinf.tu-ilmenau.de/code/qa-data/blob/master/queryDataset.tsv Table 3: Question examples for the Foodmart data source Level Question Challenges How many different J(sales_fact, customer, time_by_day) 1 customers were in the A(COUNT(customer_id)) first quarter of 1997? F(quarter=’Q1’, year=’1997’) J(sales_fact, product, product_class, store, time_by_day) How much bread was A(COUNT(product_id), GROUP BY(store_city)) 2 sold per city in 1997? O(COUNT(product_id)) F(product_category=’Bread’, year=’1997’) J(sales_fact, time_by_day, product, product_class) In which month less A(COUNT(product_id) 3 bread was sold than in S(HAVING(COUNT(product_id)) < F(month=’January’, year=’1997’, product_category=’Bread’)) January 1997? F(month=’January’, year=’1997’, product_category=’Bread’) O(COUNT(product_id)) J(inventory_fact, product, time_by_day) How many units by A(SUM(units_ordered, SUM(units_shipped))) 4 Jeffers were in stock in F(brand_name=’Jeffers’, month=’January’, year=’1997’) the first month of 1997? C(in stock = units_ordered - units_shipped) for data about issue trackers and software projects, such as How more descriptive hints about the schema, if essential parts are many bug fixes were committed in January 2019 per developer?. missing (such as speaking attribute or table names). The imple- mentation of rules – as described in Section 3.5.2 – is a desired 5 SUMMARY & OUTLOOK enhancement of our approach. This enables the construction of In this paper, we presented our approach in the research field even more complex queries and give answers to questions of the of NLIDB specifically for OLAP data sources. We designed our highest difficulty level. prototype system to be as agnostic as possible regarding the un- Overall, our approach is already able to construct SQL queries derlying data source. Required information is extracted from the for rather complex NL questions. The addressed GUI will make schema and the datamuse API is utilized to add synonyms to the the prototype available for users and the enhancements will data that is used to map phrases from the NL question to terms of complete our approach in terms of complexion of queries. the data source. In our approach, the NL question analyzed and an intermediate representation of the question is created. After 6 ACKNOWLEDGEMENTS several processing steps, this representation contains different This work was partially funded by the German Research Founda- types of tuples which are utilized in the subsequent construction tion (DFG) under grant no. SA782/26. of the SQL query. Depending on the type of the tuple it is used to create the SELECT, FROM, or WHERE clause. Specific terms iden- REFERENCES tified in the NL question indicate the use of SQL operators and [1] Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal functions, such as SUM, MAX, GROUP BY, or LIMIT. 28, 5 (01 Oct 2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8 With this approach, we are able to answer questions of dif- [2] Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natu- ferent difficulty levels. In default of an OLAP benchmark, we ral language interfaces to databases - an introduction. Natural Language Engineering 1, 1 (1995), 29–81. created a dataset containing more than 40 questions based on [3] Key-Sun Choi, Luis Espinosa Anke, Thierry Declerck, Dagmar Gromann, Jin- the Foodmart data source. The benchmark is available for down- Dong Kim, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Ricardo Usbeck (Eds.). 2018. Joint Proceedings of ISWC 2018 Workshops SemDeep-4 and load and we are planning to collect more questions to be able to NLIWOD-4. Workshop on Semantic Deep Learning (SemDeep-2018). Vol. 2241. evaluate our approach and compare it to other approaches. CEURS. Future work includes the implementation of the extensions [4] Lan Jiang and Felix Naumann. 2019. Holistic primary key and foreign key detection. Journal of Intelligent Information Systems (2019), 1–23. described in Section 3.5. The first and foremost prerequisite for [5] F. Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language these extensions is the design and development of a graphical Interface for Relational Databases. Proc. VLDB Endow. 8, 1 (Sept. 2014), 73–84. user interface (GUI). On the one hand, our intended user frontend https://doi.org/10.14778/2735461.2735468 [6] Giuseppe M. Mazzeo and Carlo Zaniolo. 2016. Answering Controlled Natural is targeted on being interactive regarding the presented results Language Questions on RDF Knowledge Bases. In Proceedings of the 19th Inter- from the data source. On the other hand, further developments national Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016. 608–611. aim at an conversational interface. The user is either presented [7] M. Asif Naeem and Imran Sarwar Bajwa. 2012. Generating OLAP Queries with concrete results from the data source, or asked for feedback from Natural Language Specification. In ICACCI ’12. ACM, 768–773. https: to give a relevant answer for an initially unclear question. For //doi.org/10.1145/2345396.2345522 [8] Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa further data insights, the user is enabled to request more informa- Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019. tion by asking questions that build upon the preceding questions. Natural Language Querying of Complex Business Intelligence Queries. In Furthermore, the user frontend should provide the user with a SIGMOD ’19. ACM, New York, NY, USA, 1997–2000. https://doi.org/10.1145/ 3299869.3320248 facility to add an own data source and ask questions on it. The [9] Nadine Steinmetz, Ann-Katrin Arning, and Kai-Uwe Sattler. 2019. From data source might be provided as a bunch of csv files or an SQL Natural Language Questions to SPARQL Queries: A Pattern-based Approach. In Datenbanksysteme für Business, Technologie und Web (BTW 2019). 289–308. file or as an URL to a distant database. The data source is analyzed [10] Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, and regarding the required information for our approach. Eventually Dongyan Zhao. 2014. Natural Language Question Answering over RDF: A the user is involved to extract the required information or give Graph Data Driven Approach. In SIGMOD ’14. 313–324.