ceur-ws.org/Vol-2399/paper14.pdf A cross-domain natural language interface to databases using adversarial text method Wenlu Wang† supervised by Wei-Shinn Ku† and Haixun Wang‡ Auburn University† and WeWork Research‡ wenluwang@auburn.edu ABSTRACT the NLI model is able to focus on the semantic meaning and A natural language interface (NLI) to databases is an inter- agnostic of the natural language question, which facilitates face that supports natural language queries to be executed the cross-domain generalization. by database management systems (DBMS). However, most We first “strip” a natural language question (shown in NLIs are domain specific due to the complexity of the nat- Figure 1), each type of the query (SQL and Lambda ex- ural language questions, and an NLI trained on one domain pression in our examples) is treated equally, then translate is hard to be transferred another due to the discrepancies the tagged question to a structured query. The definition of between di↵erent ontology. Inspired by the idea of stripping “strip” is enclosing a phrase that describes a data element domain-specific information out of natural language ques- (tables, columns, values, keywords, etc.) appearing in the tions, we propose a cross-domain NLI with a general pur- query by inserting a symbol (k1 , v1 , etc.) representing the pose question tagging strategy and a multi-language neural type and index of the data element in front of the phrase, translation model. Our question tagging strategy is able and an “end of element” symbol (e.g., heoei) at the end to extract the “skeleton” of the question that represents of the phrase. In Figure 1, we show two types of queries its semantic structure for any domain. With question tag- (Lambda Expression and SQL), k represents a column field, ging, every domain will be handled equally with a single a table name, or a keyword, and v represents a value. multi-language neural translation model. Our preliminary Question Which cities are located in Virginia ? experiments show that our multi-domain model has excel- Query city(A), location(A, B), const(B, stateid(“Virginia”)) lent cross-domain transferability. Question hlambdai Which hk1 i cities heoei are hk2 i located in heoei hv2 i Virginia heoei ? Query k1 (A), k2 (A, B), const(B, stateid(v2 )) 1. INTRODUCTION (1) Question Which movies were scheduled to release on May 19 2019 ? Relational databases are widely adopted in real-world ap- Query SELECT movie WHERE release date = May 19 2019 plications [15, 14]. However, it requires a certain knowledge Question hSQLi Which hk1 i movies heoei were scheduled of query languages to operate on DBMSs, which motivated hk2 i to release on heoei hv2 i May 19 2019 heoei ? the study of NLI to databases [1] with the purpose of making Query SELECT k1 WHERE k2 = v2 DBMSs operable by anyone without training. (2) The challenges of NLI to databases lies in the discrepan- cies of di↵erent ontology, which makes general-purpose NLI Figure 1: Two types of queries (SQL and Lambda expres- hard to achieve. Most existing general purpose NLIs ex- sion) with corresponding Natural language questions. ploit syntax-guided decoding and require the grammar of the structured queries (domain specific grammar) as part of Another challenge of our cross-domain task is to handle the model. Such a model cannot be shared between di↵erent di↵erent query types. The aforementioned symbol insertion grammars, while we propose a general purpose model where strategy is able to handle questions of di↵erent types equally di↵erent types of queries and di↵erence domains share the but fails to di↵erentiate them. Inspired by Google’s multi- same components. To overcome the obstacles of general- lingual translation model [11] where an artificial token is izing one NLI model to di↵erent domains or even unseen introduced at the beginning of the input sentence to indi- domains, we perform a pre-processing step inspired by the cate the target language. We prefix a query type symbol to strategy of separating domain-specific information from the indicate the target query type the NLI model should covert question [22]. By detaching domain-specific data elements, to (e.g., hSQLi, hlambdai). For instance, consider the fol- lowing question –> SQL pair: Which is the highest score? –> SELECT MAX(score) It will be modified to: hSQLi Which is the highest score? –> SELECT MAX(score) Such an approach only needs to prefix one additional token, we will validate in the preliminary experiments that it is the simplest but e↵ective approach. Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los The core design of our symbol insertion strategy lies in Angeles, California. Copyright (C) 2019 for this paper by its authors. Copying permitted for how to identify the phrase that describes a data element private and academic purposes. appears in the corresponding query. The phrase might not be the exact words of the data element. In Figure 1(1), trained, it will produce positive predictions for data element the data element “release date” is described as “to release e = [“location”,“release date”]. As the true label, p is not on” in the question. Inspired by gradient-based adversarial involved in the process. After translating q 0 to p0 , where text method, we propose an adversarial method towards a data elements are represented as symbols, we perform the data element detector. Given a natural language question q final step of converting p0 back to p. and a data element e, the data element detector will predict whether e is mentioned in q. city (A), location (A, B), const(B, stateid(Virginia)) p1 Figure 1 presents two examples. In example (1), the ques- (A), (A, B), const(B, stateid()) p1’ tion (imply natural language question in this paper) is con- e1 Which cities are located in Virginia q1’ verted to a lambda expression, and example (2) is converted location Adversarial Text Method to a SQL query. Which cities are located in Virginia ? q1 Database BC seq2seq e2 Which movies were scheduled to release on May 19 2019 ? q2 2. ADVERSARIAL TEXT METHOD release date Adversarial Text Method It has been demonstrated that adding a carefully crafted Which movies were scheduled to release on May 19 2019 q ’ 2 small noise is able to fool the deep neural network models p2’ SELECT k1 WHERE k2 EQUAL v2 into wrong predictions, while the small noise makes unno- p2 ticeable visual di↵erence [4]. Most of the adversarial attack SELECT movie WHERE release date EQUAL May 19 2019 methods on the text [12, 19, 10] try to perturb the features Figure 2: An example of cross-domain framework corre- (e.g., words, characters, and phrases) that are the most in- sponds to Figure 1, di↵erent query types are treated equally. fluential on the predictions. Inspired by gradient-based ad- versarial text attacks [5], we propose our own solution to identify the position of a data element in a question. 3.2 Data Element Search Data elements include table names, column fields, and col- 3. DESIGN umn values in databases, and keywords in query grammar. For example, “movies”, “release date”, and “May 19 2019” 3.1 Overview are data elements in SQL query “SELECT movie WHERE Given a (question, query) pair, our core methodology is release date EQUAL May 19 2019”, and the other el- to insert pre-designed symbols and enclose data elements ements SELECT WHERE EQUAL belong to the template of mentioned in the question to achieve the purpose of handling SQL sketch. In “city(A),location(A,B),const(B, every sample (of di↵erent domain/type) equally. Figure 2 stateid(Virginia))”, “city” is a table name, “stateid” shows the framework of our approach bottom-up. is a column field, “Virginia” is a column value, and “lo- cation” is a keyword in lambda expression grammar. The 1. Build a binary classifier BC as a data element detec- challenge is to discover all the data elements from the ques- tor to predict whether a data element e appears in tion , and for symbol insertion purpose, we need to discover a question q’s corresponding query p by the semantic which phrase describes each data element. With such sym- meaning of the question, which takes q and e as inputs bols insertion, we relieve the burden of identify the data without referencing p. elements from seq2seq model, and makes it focus on learn- ing semantic structure of the question and logic of the data 2. Inspired by [5], we search for the most influential phrase elements. in the question using gradient-based adversarial text We have two challenges to tackle for data element search methods. We refer “the most influential phrase” as (use Figure 2 as an example): the phrase that describes the data element e theoreti- cally. 1. Identify whether a data element is described in the question. Ultimately, we are trying to detect all the 3. We insert symbols in q to enclose the phrases that data elements that constitute the query, and we have describes the data elements, denoted as q 0 . Since a to infer those data elements from the natural language query type symbol (e.g., hSQLi) is prefixed to q 0 , a question based on its semantics. In q1 , we need to user is able to select a desired query type. identify all the data elements that are described in the question (e.g., “city”, “location”, and “Virginia”). In 4. Build a multi-lingual cross-domain sequence-to-sequence q2 , we need to identify “release data”, “movie”, and (seq2seq) model to translate q 0 to p0 , and p0 is a query “May 19 2019”. where the data elements are replaced by symbols in- serted in q. 2. The phrase that describes a data element needs to be identified by its semantic meaning and contextual com- 5. Inserted symbols are replaced with data elements to prehension. In q1 of Figure 2, “located in” is identified form the original query (convert p0 back to p). as the most influential phrase while describing the key Figure 2 shows two examples of (q, q 0 , p0 , p) correspond to word “location”. In q2 , “to release on” is identified as Figure 1. In Figure 1(1), data elements “city” and “Vir- the phrase that describes “release date”. ginia” are able to be detected by comparing against the To address these two challenges, we propose our general database using string match directly. So are “movie” and purpose data element search strategy with two steps: “May 19 2019” in Figure 1(2). However, detecting data elements “location” and “release date” is problematic, so • We propose a Data Element Detector (Sec 3.2.1) we use the pre-trained binary classifier BC. If BC is well for the first challenge, which is a binary classifier with a question q and data element e as an input. The de- Data Element Detector. In particular, the noise for tector is able to detect whether the data element e is each token qi is proportional to @L(q,e) @qi . mentioned in question q. As presented in Figure 2, a Data Element Detector is shared among all the do- - JSMA [16]. We calculate the Jacobian-based saliency mains. map based on rL(q, e), and perturb one token at a time. The chosen token has the highest saliency value. • In the case of positive prediction in the previous step, we propose an Adversarial Text Method (Sec 3.2.2) Since all the methods are trying to add minimum noise for the second challenge, which relies on the informa- that influences the prediction the most. The locations tion learned by the binary classifier from the first step. where the noise is added will be the positions of tokens that constitute the most influential phrase – i,e., the 3.2.1 Data Element Detector phrase that describes the data element e. We use a bi-directional attentive recurrent neural network to achieve question understanding. For a question q com- 3. We search for a phrase in the question where adding a posed of n tokens [q1 , ..., qn ] and a data element e composed small perturbation will a↵ect the prediction dramati- of m tokens [e1 , ..., em ], we use a pre-trained Glove embed- cally. ding to initiate a word embedding layer. On top of the em- bedding layer, we use LSTM cells to produce hidden states The challenge of our adversarial text method is the discrete- for each time step (each word in q). We build a similar ness of the text domain. Words or characters are discrete structure for e. We denote the top layer hidden states as variables thus indi↵erentiable. To overcome such problem, q q q e e e h = [h1 , · · · , hn ] h = [h1 , · · · , hm ] we propose to calculate the loss gradient (rL) of the tar- We build a bi-directional LSTM layer on hq with attention get model w.r.t. the embedding of each word, where the over he . embedding space is smooth. ! 3.3 Neural Machine Translation d0 = 0  e We denote an question post symbol insertion as q 0 and != h t !t the corresponding query post symbol replacement as p0 . We ⇣ t train a seq2seq model to translate q 0 to p0 : ! !⌘ dt = LSTM! !t , dt 1 p0 = seq2seq(q 0 ) ! e! T q e tj = v Tanh(W0 hj + W1 h + W2 dt 1 ) X Encoder is a stacked bi-directional multi-layer recurrent neu- ↵! = e!/ tj tj e !0 tj ral network (RNN). Decoder is a one-layer attentive RNN. j0 We use a single multilingual neural translation model for n X ! ↵! q our cross-domain NLI task. We believe with a prefixed query t = tj hj j=1 type symbol (e.g., hSQLi), a multi-domain model is able to handle di↵erent query types, and each query type is treated where W0 , W1 , W2 ,and v are model parameters. Here t equally. enumerates each time step for e, and j enumerates each ! token in q. We compute bi-directional output dt = [ dt , dt ], 4. RELATED WORK and feed it to a multi-layer perception for binary prediction. NLI to databases was first formally introduced in [1]. Se- 3.2.2 Adversarial Text Method mantic parsing [17, 23, 9] and cross-domain semantic pars- With the adversarial text method, given a data element e ing [7, 20] are applied in NLI to databases. However, due to that has a positive prediction from the binary classifier, we the incompatibility among di↵erent domains, cross-domain propose to search a phrase of the question that describes e. task remains unsolved. The sketch-based solutions are also We describe our adversarial text method as follows. extensively studied, which is first proposed in [24]. A deep model is trained to fill the slots in the sketch. An exten- 1. We have trained a Data Element Detector that takes a sion of sketch-based solution [26] relies on a knowledge base question q and a data element e as inputs and predicts to identify the column values. Such a strategy is confined whether e is described in q. in pre-defined sketch and existing knowledge base. seq2seq model are also exploited to serve as a translator [8, 28], 2. We search for the most influential phrase in q using which has no limitations on query sketch. gradient-based adversarial text methods [5]. There are three possible directions (the loss gradient of the Data 5. PRELIMINARY EXPERIMENTS Element Detector with q and e as inputs is rL(q, e)): Test Domain Dataset Method - DeepFool [13]. We iteratively search the optimal Accqm Accex direction in which only a minimum perturbation is Seq2SQL [28] 51.6% 60.4% needed to a↵ect the prediction. Theoretically, the op- Single WikiSQL SQLNet [24] 61.3% 68.0% f (q,e) TypeSQL [26] 75.4% 82.6% timal direction is ||rL(q,e)|| 2 rL(q, e) where f (·) de- 2 WikiSQL 74.5% 82.7% notes the Data Element Detector. Multi OVERNIGHT Ours-multi 76.8% - Geo880 84.1% - - Fast Gradient Method FGM [6]. We add a noise that is proportional to either rL(q, e) or sign(rL(q, e)) Table 1: Comparison of models. to the original sample to change the prediction of the 5.1 Evaluation [7] J. Herzig and J. Berant. Neural semantic parsing over We conduct our preliminary experiments using a seq2seq multiple knowledge-bases. arXiv preprint arXiv:1702.01569, 2017. model with stacked GRU. We use query-match accuracy [8] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and Accqm for evaluation, we match synthesized queries against L. Zettlemoyer. Learning a neural semantic parser from the ground truth p. We also compare the execution results user feedback. In ACL, volume 1, pages 963–973, 2017. as [28], denoted as Accex , if applicable. [9] R. Jia and P. Liang. Data recombination for neural We jointly train our multi-domain model on WikiSQL [28], semantic parsing. 2016. Geo880 [27] , and OVERNIGHT [23], their query types are [10] R. Jia and P. Liang. Adversarial examples for evaluating SQL, Lambda expression, and SQL (we use the dataset that reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017. manually converted to SQL in [22]). Some of the domains [11] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, are over sampled to balance the number of training samples Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, among all the domains. Our method is shown in Table 1 G. Corrado, et al. Googles multilingual neural machine as Ours-multi. Since all the domains are trained with in translation system: Enabling zero-shot translation. ACL, a single model, the accuracy of multi-domain model does 5:339–351, 2017. not exhibit a large improvement. However, we believe it is a [12] B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi. Deep model capacity issue since the accuries of all the domains are text classification can be fooled. arXiv preprint very close or better than state-of-the-art performance. We arXiv:1704.08006, 2017. observe that the seq2seq model is able to infer both SQL and [13] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural Lambda expression as long as a tag (e.g., hSQLi, hlambdai networks. In CVPR, pages 2574–2582, 2016. ) indicating the query type is provided. [14] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun. The application of data mining techniques in financial fraud 5.2 Spatial Domain detection: A classification framework and an academic review of literature. Decision Support Systems, Geo880 Restaurant 50(3):559–569, 2011. Domain Method Accdm Method Accdm [15] E. W. Ngai, L. Xiu, and D. C. Chau. Application of data SEQ2TREE [2] 87.1% PEK03 [18] 97.0% mining techniques in customer relationship management: A Single TRANX [25] 88.2% TM00 [21] 99.6% literature review and classification. Expert systems with JL16 [9] 89.3% FKZ18 [3] 100% applications, 36(2):2592–2602, 2009. Ours-multi 90.7% 100% [16] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE EuroS&P, pages Table 2: Comparison of models in Spatial Domain. 372–387. IEEE, 2016. [17] P. Pasupat and P. Liang. Compositional semantic parsing We conduct preliminary evaluations on the spatial domain on semi-structured tables. In ACL, 2015. (Geo880 [27] and Restaurant [21]), which is a more difficult [18] A. Popescu, O. Etzioni, and H. A. Kautz. Towards a theory of natural language interfaces to databases. In Proceedings task since the dataset in the spatial domain is sparse. We of the 8th International Conference on Intelligent User adopt the data element detector as a spatial comprehension Interfaces, 2003. model, and inject spatial semantics using symbol insertions. [19] S. Samanta and S. Mehta. Towards crafting text adversarial In this setting, we jointly train a multi-domain model with samples. arXiv preprint arXiv:1707.02812, 2017. both Geo880 and Restaurant training sets (Restaurant is [20] Y. Su and X. Yan. Cross-domain semantic parsing via oversampled to balance all the domains), and evaluate on paraphrasing. arXiv preprint arXiv:1704.05974, 2017. the test set of each separately (since both are for lambda ex- [21] L. R. Tang and R. J. Mooney. Automated construction of pression queries, a prefix symbol is not inserted). As shown database interfaces: Integrating statistical and relational learning for semantic parsing. In ACL, pages 133–141, 2000. in Table 2 (we use denotation match accuracy Accdm for [22] W. Wang, Y. Tian, H. Xiong, H. Wang, and W.-S. Ku. A evaluation), our method (Ours-multi) outperforms previous transfer-learnable natural language interface for databases. methods. arXiv preprint arXiv:1809.02649, 2018. [23] Y. Wang, J. Berant, and P. Liang. Building a semantic 6.[1] I.REFERENCES Androutsopoulos, G. D. Ritchie, and P. Thanisch. parser overnight. In ACL, volume 1, pages 1332–1342, 2015. [24] X. Xu, C. Liu, and D. Song. Sqlnet: Generating structured Natural language interfaces to databases–an introduction. queries from natural language without reinforcement Natural language engineering, 1(01):29–81, 1995. learning. arXiv preprint arXiv:1711.04436, 2017. [2] L. Dong and M. Lapata. Language to logical form with [25] P. Yin and G. Neubig. Tranx: A transition-based neural neural attention. arXiv preprint arXiv:1601.01280, 2016. abstract syntax parser for semantic parsing and code [3] C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang, generation. arXiv preprint arXiv:1810.02720, 2018. K. Ramanathan, S. Sadasivam, R. Zhang, and D. Radev. [26] T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. Radev. Typesql: Improving text-to-sql evaluation methodology. arXiv Knowledge-based type-aware neural text-to-sql generation. preprint arXiv:1806.09029, 2018. arXiv preprint arXiv:1804.09769, 2018. [4] Z. Gong, W. Wang, and W.-S. Ku. Adversarial and clean [27] J. M. Zelle and R. J. Mooney. Learning to parse database data are not twins. arXiv preprint arXiv:1704.04960, 2017. queries using inductive logic programming. In AAAI, pages [5] Z. Gong, W. Wang, B. Li, D. Song, and W.-S. Ku. 1050–1055, 1996. Adversarial texts with gradient methods. arXiv preprint [28] V. Zhong, C. Xiong, and R. Socher. Seq2sql: Generating arXiv:1801.07175, 2018. structured queries from natural language using [6] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and reinforcement learning. arXiv preprint arXiv:1709.00103, harnessing adversarial examples. arXiv preprint 2017. arXiv:1412.6572, 2014.