ceur-ws.org/Vol-2399/paper14.pdf


     A cross-domain natural language interface to databases
                using adversarial text method

                                                               Wenlu Wang†
                                          supervised by Wei-Shinn Ku† and Haixun Wang‡
                                            Auburn University† and WeWork Research‡
                                                      wenluwang@auburn.edu

ABSTRACT                                                                  the NLI model is able to focus on the semantic meaning and
A natural language interface (NLI) to databases is an inter-              agnostic of the natural language question, which facilitates
face that supports natural language queries to be executed                the cross-domain generalization.
by database management systems (DBMS). However, most                        We first “strip” a natural language question (shown in
NLIs are domain specific due to the complexity of the nat-                Figure 1), each type of the query (SQL and Lambda ex-
ural language questions, and an NLI trained on one domain                 pression in our examples) is treated equally, then translate
is hard to be transferred another due to the discrepancies                the tagged question to a structured query. The definition of
between di↵erent ontology. Inspired by the idea of stripping              “strip” is enclosing a phrase that describes a data element
domain-specific information out of natural language ques-                 (tables, columns, values, keywords, etc.) appearing in the
tions, we propose a cross-domain NLI with a general pur-                  query by inserting a symbol (k1 , v1 , etc.) representing the
pose question tagging strategy and a multi-language neural                type and index of the data element in front of the phrase,
translation model. Our question tagging strategy is able                  and an “end of element” symbol (e.g., heoei) at the end
to extract the “skeleton” of the question that represents                 of the phrase. In Figure 1, we show two types of queries
its semantic structure for any domain. With question tag-                 (Lambda Expression and SQL), k represents a column field,
ging, every domain will be handled equally with a single                  a table name, or a keyword, and v represents a value.
multi-language neural translation model. Our preliminary                   Question   Which cities are located in Virginia ?
experiments show that our multi-domain model has excel-                     Query     city(A), location(A, B), const(B, stateid(“Virginia”))
lent cross-domain transferability.                                         Question   hlambdai Which hk1 i cities heoei are hk2 i located in heoei
                                                                                      hv2 i Virginia heoei ?
                                                                             Query    k1 (A), k2 (A, B), const(B, stateid(v2 ))
1.    INTRODUCTION                                                                                          (1)
                                                                           Question   Which movies were scheduled to release on May 19 2019 ?
   Relational databases are widely adopted in real-world ap-
                                                                            Query     SELECT movie WHERE release date = May 19 2019
plications [15, 14]. However, it requires a certain knowledge              Question   hSQLi Which hk1 i movies heoei were scheduled
of query languages to operate on DBMSs, which motivated                               hk2 i to release on heoei hv2 i May 19 2019 heoei ?
the study of NLI to databases [1] with the purpose of making                 Query    SELECT k1 WHERE k2 = v2
DBMSs operable by anyone without training.                                                                  (2)
   The challenges of NLI to databases lies in the discrepan-
cies of di↵erent ontology, which makes general-purpose NLI                Figure 1: Two types of queries (SQL and Lambda expres-
hard to achieve. Most existing general purpose NLIs ex-                   sion) with corresponding Natural language questions.
ploit syntax-guided decoding and require the grammar of
the structured queries (domain specific grammar) as part of                  Another challenge of our cross-domain task is to handle
the model. Such a model cannot be shared between di↵erent                 di↵erent query types. The aforementioned symbol insertion
grammars, while we propose a general purpose model where                  strategy is able to handle questions of di↵erent types equally
di↵erent types of queries and di↵erence domains share the                 but fails to di↵erentiate them. Inspired by Google’s multi-
same components. To overcome the obstacles of general-                    lingual translation model [11] where an artificial token is
izing one NLI model to di↵erent domains or even unseen                    introduced at the beginning of the input sentence to indi-
domains, we perform a pre-processing step inspired by the                 cate the target language. We prefix a query type symbol to
strategy of separating domain-specific information from the               indicate the target query type the NLI model should covert
question [22]. By detaching domain-specific data elements,                to (e.g., hSQLi, hlambdai). For instance, consider the fol-
                                                                          lowing question –> SQL pair:
                                                                             Which is the highest score? –> SELECT MAX(score)
                                                                          It will be modified to:
                                                                             hSQLi Which is the highest score? –> SELECT MAX(score)
                                                                          Such an approach only needs to prefix one additional token,
                                                                          we will validate in the preliminary experiments that it is the
                                                                          simplest but e↵ective approach.
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los            The core design of our symbol insertion strategy lies in
Angeles, California.
Copyright (C) 2019 for this paper by its authors. Copying permitted for   how to identify the phrase that describes a data element
private and academic purposes.                                            appears in the corresponding query. The phrase might not
be the exact words of the data element. In Figure 1(1),              trained, it will produce positive predictions for data element
the data element “release date” is described as “to release          e = [“location”,“release date”]. As the true label, p is not
on” in the question. Inspired by gradient-based adversarial          involved in the process. After translating q 0 to p0 , where
text method, we propose an adversarial method towards a              data elements are represented as symbols, we perform the
data element detector. Given a natural language question q           final step of converting p0 back to p.
and a data element e, the data element detector will predict
whether e is mentioned in q.                                                                                city (A),    location (A, B), const(B, stateid(Virginia))               p1


   Figure 1 presents two examples. In example (1), the ques-                                                <k1> (A),    <k2> (A, B),       const(B, stateid(<v2>))                 p1’

tion (imply natural language question in this paper) is con-                         e1
                                                                                              <lambda> Which <k1> cities <eoe> are <k2> located in <eoe> <v2> Virginia <eoe>        q1’

verted to a lambda expression, and example (2) is converted                        location                       Adversarial Text Method
to a SQL query.
                                                                                                                   Which cities are located in Virginia ?
                                                                                                                                                                                    q1
                                                                     Database                       BC                                                                                    seq2seq
                                                                                     e2                            Which movies were scheduled to release on May 19 2019 ?          q2

2.     ADVERSARIAL TEXT METHOD                                                     release
                                                                                     date
                                                                                                                  Adversarial Text Method


   It has been demonstrated that adding a carefully crafted                     <SQL> Which <k1> movies <eoe> were scheduled <k2> to release on <eoe> <v2> May 19 2019 <eoe> q ’
                                                                                                                                                                              2
small noise is able to fool the deep neural network models                                                                                                                          p2’
                                                                                                         SELECT     k1     WHERE             k2        EQUAL            v2
into wrong predictions, while the small noise makes unno-
                                                                                                                                                                                    p2
ticeable visual di↵erence [4]. Most of the adversarial attack                                            SELECT   movie     WHERE       release date    EQUAL         May 19 2019


methods on the text [12, 19, 10] try to perturb the features         Figure 2: An example of cross-domain framework corre-
(e.g., words, characters, and phrases) that are the most in-         sponds to Figure 1, di↵erent query types are treated equally.
fluential on the predictions. Inspired by gradient-based ad-
versarial text attacks [5], we propose our own solution to
identify the position of a data element in a question.               3.2 Data Element Search
                                                                        Data elements include table names, column fields, and col-
3.     DESIGN                                                        umn values in databases, and keywords in query grammar.
                                                                     For example, “movies”, “release date”, and “May 19 2019”
3.1      Overview                                                    are data elements in SQL query “SELECT movie WHERE
  Given a (question, query) pair, our core methodology is            release date EQUAL May 19 2019”, and the other el-
to insert pre-designed symbols and enclose data elements             ements SELECT WHERE EQUAL belong to the template of
mentioned in the question to achieve the purpose of handling         SQL sketch. In “city(A),location(A,B),const(B,
every sample (of di↵erent domain/type) equally. Figure 2             stateid(Virginia))”, “city” is a table name, “stateid”
shows the framework of our approach bottom-up.                       is a column field, “Virginia” is a column value, and “lo-
                                                                     cation” is a keyword in lambda expression grammar. The
     1. Build a binary classifier BC as a data element detec-        challenge is to discover all the data elements from the ques-
        tor to predict whether a data element e appears in           tion , and for symbol insertion purpose, we need to discover
        a question q’s corresponding query p by the semantic         which phrase describes each data element. With such sym-
        meaning of the question, which takes q and e as inputs       bols insertion, we relieve the burden of identify the data
        without referencing p.                                       elements from seq2seq model, and makes it focus on learn-
                                                                     ing semantic structure of the question and logic of the data
     2. Inspired by [5], we search for the most influential phrase   elements.
        in the question using gradient-based adversarial text           We have two challenges to tackle for data element search
        methods. We refer “the most influential phrase” as           (use Figure 2 as an example):
        the phrase that describes the data element e theoreti-
        cally.                                                           1. Identify whether a data element is described in the
                                                                            question. Ultimately, we are trying to detect all the
     3. We insert symbols in q to enclose the phrases that                  data elements that constitute the query, and we have
        describes the data elements, denoted as q 0 . Since a               to infer those data elements from the natural language
        query type symbol (e.g., hSQLi) is prefixed to q 0 , a              question based on its semantics. In q1 , we need to
        user is able to select a desired query type.                        identify all the data elements that are described in the
                                                                            question (e.g., “city”, “location”, and “Virginia”). In
     4. Build a multi-lingual cross-domain sequence-to-sequence
                                                                            q2 , we need to identify “release data”, “movie”, and
        (seq2seq) model to translate q 0 to p0 , and p0 is a query
                                                                            “May 19 2019”.
        where the data elements are replaced by symbols in-
        serted in q.                                                     2. The phrase that describes a data element needs to be
                                                                            identified by its semantic meaning and contextual com-
     5. Inserted symbols are replaced with data elements to                 prehension. In q1 of Figure 2, “located in” is identified
        form the original query (convert p0 back to p).                     as the most influential phrase while describing the key
   Figure 2 shows two examples of (q, q 0 , p0 , p) correspond to           word “location”. In q2 , “to release on” is identified as
Figure 1. In Figure 1(1), data elements “city” and “Vir-                    the phrase that describes “release date”.
ginia” are able to be detected by comparing against the                To address these two challenges, we propose our general
database using string match directly. So are “movie” and             purpose data element search strategy with two steps:
“May 19 2019” in Figure 1(2). However, detecting data
elements “location” and “release date” is problematic, so                 • We propose a Data Element Detector (Sec 3.2.1)
we use the pre-trained binary classifier BC. If BC is well                  for the first challenge, which is a binary classifier with
     a question q and data element e as an input. The de-                  Data Element Detector. In particular, the noise for
     tector is able to detect whether the data element e is                each token qi is proportional to @L(q,e)
                                                                                                             @qi
                                                                                                                    .
     mentioned in question q. As presented in Figure 2,
     a Data Element Detector is shared among all the do-                    - JSMA [16]. We calculate the Jacobian-based saliency
     mains.                                                                map based on rL(q, e), and perturb one token at a
                                                                           time. The chosen token has the highest saliency value.
   • In the case of positive prediction in the previous step,
     we propose an Adversarial Text Method (Sec 3.2.2)                     Since all the methods are trying to add minimum noise
     for the second challenge, which relies on the informa-                that influences the prediction the most. The locations
     tion learned by the binary classifier from the first step.            where the noise is added will be the positions of tokens
                                                                           that constitute the most influential phrase – i,e., the
3.2.1    Data Element Detector                                             phrase that describes the data element e.
  We use a bi-directional attentive recurrent neural network
to achieve question understanding. For a question q com-                 3. We search for a phrase in the question where adding a
posed of n tokens [q1 , ..., qn ] and a data element e composed             small perturbation will a↵ect the prediction dramati-
of m tokens [e1 , ..., em ], we use a pre-trained Glove embed-
                                                                            cally.
ding to initiate a word embedding layer. On top of the em-
bedding layer, we use LSTM cells to produce hidden states           The challenge of our adversarial text method is the discrete-
for each time step (each word in q). We build a similar             ness of the text domain. Words or characters are discrete
structure for e. We denote the top layer hidden states as
                                                                    variables thus indi↵erentiable. To overcome such problem,
              q       q                q    e       e        e
            h = [h1 , · · · , hn ]         h = [h1 , · · · , hm ]   we propose to calculate the loss gradient (rL) of the tar-
We build a bi-directional LSTM layer on hq with attention           get model w.r.t. the embedding of each word, where the
over he .                                                           embedding space is smooth.

             !                                                      3.3 Neural Machine Translation
             d0 = 0
                   e                                                 We denote an question post symbol insertion as q 0 and
             != h
              t     !t                                              the corresponding query post symbol replacement as p0 . We
                       ⇣
                          t
                                                                    train a seq2seq model to translate q 0 to p0 :
             !                 !⌘
             dt = LSTM! !t , dt 1
                                                                                           p0 = seq2seq(q 0 )
                                             !
            e!
                   T         q       e
             tj = v Tanh(W0 hj + W1 h + W2 dt 1 )
                     X                                              Encoder is a stacked bi-directional multi-layer recurrent neu-
            ↵! = e!/
              tj     tj e !0        tj                              ral network (RNN). Decoder is a one-layer attentive RNN.
                               j0
                                                                      We use a single multilingual neural translation model for
                     n
                     X
             !
                              ↵!
                                   q                                our cross-domain NLI task. We believe with a prefixed query
               t =             tj hj
                     j=1                                            type symbol (e.g., hSQLi), a multi-domain model is able to
                                                                    handle di↵erent query types, and each query type is treated
where W0 , W1 , W2 ,and v are model parameters. Here t              equally.
enumerates each time step for e, and j enumerates each
                                                       !
token in q. We compute bi-directional output dt = [ dt , dt ],      4.     RELATED WORK
and feed it to a multi-layer perception for binary prediction.
                                                                       NLI to databases was first formally introduced in [1]. Se-
3.2.2    Adversarial Text Method                                    mantic parsing [17, 23, 9] and cross-domain semantic pars-
  With the adversarial text method, given a data element e          ing [7, 20] are applied in NLI to databases. However, due to
that has a positive prediction from the binary classifier, we       the incompatibility among di↵erent domains, cross-domain
propose to search a phrase of the question that describes e.        task remains unsolved. The sketch-based solutions are also
We describe our adversarial text method as follows.                 extensively studied, which is first proposed in [24]. A deep
                                                                    model is trained to fill the slots in the sketch. An exten-
  1. We have trained a Data Element Detector that takes a           sion of sketch-based solution [26] relies on a knowledge base
     question q and a data element e as inputs and predicts         to identify the column values. Such a strategy is confined
     whether e is described in q.                                   in pre-defined sketch and existing knowledge base. seq2seq
                                                                    model are also exploited to serve as a translator [8, 28],
  2. We search for the most influential phrase in q using           which has no limitations on query sketch.
     gradient-based adversarial text methods [5]. There are
     three possible directions (the loss gradient of the Data       5.     PRELIMINARY EXPERIMENTS
     Element Detector with q and e as inputs is rL(q, e)):
                                                                                                                 Test
                                                                         Domain     Dataset      Method
      - DeepFool [13]. We iteratively search the optimal                                                        Accqm    Accex
     direction in which only a minimum perturbation is                                           Seq2SQL [28]   51.6%   60.4%
     needed to a↵ect the prediction. Theoretically, the op-              Single     WikiSQL      SQLNet [24]    61.3%   68.0%
                           f (q,e)                                                               TypeSQL [26]   75.4%   82.6%
     timal direction is ||rL(q,e)|| 2 rL(q, e) where f (·) de-
                                    2                                              WikiSQL                      74.5%   82.7%
     notes the Data Element Detector.                                    Multi    OVERNIGHT      Ours-multi     76.8%      -
                                                                                    Geo880                      84.1%     -
      - Fast Gradient Method FGM [6]. We add a noise
     that is proportional to either rL(q, e) or sign(rL(q, e))                     Table 1: Comparison of models.
     to the original sample to change the prediction of the
5.1     Evaluation                                                  [7] J. Herzig and J. Berant. Neural semantic parsing over
   We conduct our preliminary experiments using a seq2seq               multiple knowledge-bases. arXiv preprint
                                                                        arXiv:1702.01569, 2017.
model with stacked GRU. We use query-match accuracy                 [8] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and
Accqm for evaluation, we match synthesized queries against              L. Zettlemoyer. Learning a neural semantic parser from
the ground truth p. We also compare the execution results               user feedback. In ACL, volume 1, pages 963–973, 2017.
as [28], denoted as Accex , if applicable.                          [9] R. Jia and P. Liang. Data recombination for neural
   We jointly train our multi-domain model on WikiSQL [28],             semantic parsing. 2016.
Geo880 [27] , and OVERNIGHT [23], their query types are            [10] R. Jia and P. Liang. Adversarial examples for evaluating
SQL, Lambda expression, and SQL (we use the dataset that                reading comprehension systems. arXiv preprint
                                                                        arXiv:1707.07328, 2017.
manually converted to SQL in [22]). Some of the domains
                                                                   [11] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu,
are over sampled to balance the number of training samples              Z. Chen, N. Thorat, F. Viégas, M. Wattenberg,
among all the domains. Our method is shown in Table 1                   G. Corrado, et al. Googles multilingual neural machine
as Ours-multi. Since all the domains are trained with in                translation system: Enabling zero-shot translation. ACL,
a single model, the accuracy of multi-domain model does                 5:339–351, 2017.
not exhibit a large improvement. However, we believe it is a       [12] B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi. Deep
model capacity issue since the accuries of all the domains are          text classification can be fooled. arXiv preprint
very close or better than state-of-the-art performance. We              arXiv:1704.08006, 2017.
observe that the seq2seq model is able to infer both SQL and       [13] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard.
                                                                        Deepfool: a simple and accurate method to fool deep neural
Lambda expression as long as a tag (e.g., hSQLi, hlambdai               networks. In CVPR, pages 2574–2582, 2016.
) indicating the query type is provided.                           [14] E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun. The
                                                                        application of data mining techniques in financial fraud
5.2     Spatial Domain                                                  detection: A classification framework and an academic
                                                                        review of literature. Decision Support Systems,
                        Geo880                Restaurant                50(3):559–569, 2011.
  Domain
               Method            Accdm    Method      Accdm        [15] E. W. Ngai, L. Xiu, and D. C. Chau. Application of data
               SEQ2TREE [2]      87.1%    PEK03 [18] 97.0%              mining techniques in customer relationship management: A
   Single      TRANX [25]        88.2%    TM00 [21]   99.6%             literature review and classification. Expert systems with
               JL16 [9]          89.3%    FKZ18 [3]   100%              applications, 36(2):2592–2602, 2009.
 Ours-multi                      90.7%                100%         [16] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B.
                                                                        Celik, and A. Swami. The limitations of deep learning in
                                                                        adversarial settings. In 2016 IEEE EuroS&P, pages
      Table 2: Comparison of models in Spatial Domain.                  372–387. IEEE, 2016.
                                                                   [17] P. Pasupat and P. Liang. Compositional semantic parsing
  We conduct preliminary evaluations on the spatial domain              on semi-structured tables. In ACL, 2015.
(Geo880 [27] and Restaurant [21]), which is a more difficult       [18] A. Popescu, O. Etzioni, and H. A. Kautz. Towards a theory
                                                                        of natural language interfaces to databases. In Proceedings
task since the dataset in the spatial domain is sparse. We              of the 8th International Conference on Intelligent User
adopt the data element detector as a spatial comprehension              Interfaces, 2003.
model, and inject spatial semantics using symbol insertions.       [19] S. Samanta and S. Mehta. Towards crafting text adversarial
In this setting, we jointly train a multi-domain model with             samples. arXiv preprint arXiv:1707.02812, 2017.
both Geo880 and Restaurant training sets (Restaurant is            [20] Y. Su and X. Yan. Cross-domain semantic parsing via
oversampled to balance all the domains), and evaluate on                paraphrasing. arXiv preprint arXiv:1704.05974, 2017.
the test set of each separately (since both are for lambda ex-     [21] L. R. Tang and R. J. Mooney. Automated construction of
pression queries, a prefix symbol is not inserted). As shown            database interfaces: Integrating statistical and relational
                                                                        learning for semantic parsing. In ACL, pages 133–141, 2000.
in Table 2 (we use denotation match accuracy Accdm for
                                                                   [22] W. Wang, Y. Tian, H. Xiong, H. Wang, and W.-S. Ku. A
evaluation), our method (Ours-multi) outperforms previous               transfer-learnable natural language interface for databases.
methods.                                                                arXiv preprint arXiv:1809.02649, 2018.
                                                                   [23] Y. Wang, J. Berant, and P. Liang. Building a semantic
6.[1] I.REFERENCES
         Androutsopoulos, G. D. Ritchie, and P. Thanisch.
                                                                        parser overnight. In ACL, volume 1, pages 1332–1342, 2015.
                                                                   [24] X. Xu, C. Liu, and D. Song. Sqlnet: Generating structured
     Natural language interfaces to databases–an introduction.          queries from natural language without reinforcement
     Natural language engineering, 1(01):29–81, 1995.                   learning. arXiv preprint arXiv:1711.04436, 2017.
 [2] L. Dong and M. Lapata. Language to logical form with          [25] P. Yin and G. Neubig. Tranx: A transition-based neural
     neural attention. arXiv preprint arXiv:1601.01280, 2016.           abstract syntax parser for semantic parsing and code
 [3] C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang,                     generation. arXiv preprint arXiv:1810.02720, 2018.
     K. Ramanathan, S. Sadasivam, R. Zhang, and D. Radev.          [26] T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. Radev. Typesql:
     Improving text-to-sql evaluation methodology. arXiv                Knowledge-based type-aware neural text-to-sql generation.
     preprint arXiv:1806.09029, 2018.                                   arXiv preprint arXiv:1804.09769, 2018.
 [4] Z. Gong, W. Wang, and W.-S. Ku. Adversarial and clean         [27] J. M. Zelle and R. J. Mooney. Learning to parse database
     data are not twins. arXiv preprint arXiv:1704.04960, 2017.         queries using inductive logic programming. In AAAI, pages
 [5] Z. Gong, W. Wang, B. Li, D. Song, and W.-S. Ku.                    1050–1055, 1996.
     Adversarial texts with gradient methods. arXiv preprint       [28] V. Zhong, C. Xiong, and R. Socher. Seq2sql: Generating
     arXiv:1801.07175, 2018.                                            structured queries from natural language using
 [6] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and        reinforcement learning. arXiv preprint arXiv:1709.00103,
     harnessing adversarial examples. arXiv preprint                    2017.
     arXiv:1412.6572, 2014.