<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Meaningful Keyword Search over Databases with Complex Schema</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mehdi Kargar</string-name>
          <email>mkargar@uwindsor.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aijun An</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parke Godfrey</string-name>
          <email>godfrey@cse.yorku.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaroslaw Szlichta$</string-name>
          <email>jaroslaw.szlichta@uoit.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaohui Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Windsor</institution>
          ,
          <addr-line>Windsor</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>York University</institution>
          ,
          <addr-line>Toronto</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Keyword search in databases helps users who are not familiar with the database schema or a query language to explore the database e ciently. An answer is a join tree of tuples that contains the query</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Much of the world's high-quality data stays under lock and key in relational
databases. Access is gained through relational query languages such as SQL.
However, a lay user|anyone who does not know SQL or who is not well versed in
the given schema|is e ectively locked out [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. As the schemas of the datasets
that organizations eld become more complex, we all e ectively become lay
users. Keyword search has recently been used for extracting information from
relational databases.
      </p>
      <p>
        An answer to the query is a set of tuples from the database that cover the
keywords of the query, and a natural structure (i.e., a tree from the database's
schema) that spans those tuples. An important issue in keyword search is to score
answers for relevance. Prior work has addressed relevance. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a simplistic
solution of scoring relevance as the reciprocal of the number of edges in an
answer's tree was proposed. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors apply information retrieval (IR)
measures to rank answers. However, both of these approaches fail to return
relevant answers when the schema is large and complex.
      </p>
      <p>To motivate our approach and illustrate the shortcomings of existing
methods, consider the query "Arden Cynthia" over the TPC-E1 schema. Assume that
Arden refers to a company name and Cynthia refers to a customer name. Fig. 1
shows three possible join trees of di erent sizes that could produce answers for
this query. If we rank the trees according to their size (i.e., the number of edges),
the rst tree (a) gets the highest rank. An answer derived from this tree would
1 http://www.tpc.org/tpce/
CUSTOMER</p>
      <p>Cynthia</p>
      <p>SECURITY
COMPANY</p>
      <p>Arden</p>
      <p>CUSTOMER</p>
      <p>Cynthia</p>
      <p>COMPANY
ADDRESS</p>
      <p>COMPANY</p>
      <p>Arden
STATUS TYPE
(a) (b) (c)
Fig. 1: Possible trees (i.e., MJNSs) for the query \Arden:company
Cynthia:customer".
say that both Arden and Cynthia are active. Looking at the TPC-E database, it
turns out that all the customers and companies have the Active status.
Therefore, any given customer and any given company in the TPC-E database share
the same status through the status type table. Thus, the rst found relationship
is in fact, not that interesting or relevant. The second tree (b) states that
company Arden's stock is traded by customer Cynthia. This is one of the strongest
relations between a customer and a company and is related to the purpose of the
TPC-E schema. The third tree (c) is the largest and therefore it is not
straightforward to interpret. This tree states that the company Arden has the same
status as a security which is related to a company that has the same address as
customer Cynthia. Two relations, address and status type, that are involved in
this tree makes it less meaningful and more di cult to interpret for the user.
Using the IR score ranking also ranks the rst tree as the best answer.</p>
      <p>
        In this paper, we use schema-based ranking to solve the problem of keyword
search over relational databases. We rede ne answer via roles that captures
important answers missed by previous techniques. We devise importance
measures for nodes, importance measures for edges, and a hybrid measure of the
two. Then, we devise relevance measures for join trees that are derived from the
schema relevance. Last, we like to mention this short paper is a summary of our
recently published results in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Ranking Models</title>
      <p>
        In our framework, the role of a query keyword is identi ed rst. For each query
keyword, we rst nd the list of columns (and relations) that contain the keyword
using an inverted index. Then, a user-interface helps the user to specify the role
of each keyword without needing to know the database schema (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for screen
shots of the interface). With speci ed keyword roles, our algorithm searches for
answers to cover the input keywords. Each answer is a minimal joining network
of tuples that covers the input roles. For brevity, we refer to such answers as nal
answer. The nal answers de ned above can be generated through a sequence
of join operations on the database. To generate such answers, we rst
generate the minimal joining networks of schemas (MJNSs) that represent the join
operations for producing the nal answers. Three possible MJNSs for the query
\Arden:company Cynthia:customer" over TPC-E schema are shown in Fig. 1. To
generate MJNSs, we use a breadth- rst search algorithm [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ]. After the MJNSs
are generated, nal answers can be produced by creating an execution plan (i.e.,
SQL query) to evaluate the MJNSs. We focus on how to rank MJNSs so that
the most interesting answers will be presented to the user rst.
      </p>
      <p>Since many nal answers can be generated for a query but some answers may
not be interesting, we aim at producing top-k most interesting nal answers. To
achieve this, we rst limit the number of nodes in an MJNS. This limits the
size of nal answers too. The rationale is two-fold. First, if two tuples in a nal
answer are far away from each other, it is not easy to interpret the answer.
Second, executing the query associated with a large MJNS is time consuming.
Thus, a size control parameter, Dmax, is used to specify the maximum number
of allowed nodes in an MJNS. In addition and more importantly, we rank the
generated MJNSs according to an interestingness measure so that nal answers
from the top-ranking MJNS are produced rst. If the number of nal answers
produced so far is less than k, the next MJNS is used to produce more nal
answers until k nal answers are produced. We propose and use the following
methods for ranking minimal joining networks of schemas (MJNSs).</p>
      <p>Ranking by Importance of Nodes: Given a database D, this method
assumes that the importance of an MJNS M is related to the importance of the
tables in D that instantiate the schemas in M . We propose a ranking method
called key entropy transfer (KE) that ranks the tables in D. Consider GD as an
undirected graph representing a database D. The KE method builds a
node-tonode transition probability matrix M based on the entropies of table attributes,
and performs a random walk on GD with M . The steady-state probabilities of the
random walk are assigned as the importance scores of the tables. The entropies
of table attributes are calculated as follows. Let r:A denote an attribute A in
table r, and let a represent a value of r:A. The entropy of r:A is: H(r:A) =</p>
      <p>P8a2r:A p(a) log p(a) where p(a) is the probability that a occurs in column
r:A (i.e. P (r:A = a)).</p>
      <p>Ranking by Importance of Edges: Intuitively, edge strength can be
measured by the fraction of the join key values being instantiated. The more
fraction of primary key values are instantiated, the more important the edge is.
We propose the following measure, called instantiation fraction (IF), to
quantify the importance of an edge based on the fractions of instantiated key values:
STIF (r:A; r0:A0) = NNri:nrAst NNrin0r:sA0t0 . In this equation, Nri:nAst is the number of tuples
in relation (i.e., table) r that instantiates the edge between r:A and r0:A0, and Nr
is the total number of tuples of table r. To rank the MJNSs by edge importance,
we compute a score for each MJNS using its average edge importance score.</p>
      <p>The Hybrid Ranking Model: Our last approach to ranking MJNSs is to
rank them based on the importance of both nodes and edges. The hybrid formula
is de ned as follows: STIF KE(R:A; R0:A0) = STIF (R:A; R0:A0) KE(R)+2KE(R0) .
Since the value of STIF (R:A; R0:A0) lies between zero and one, KE(R) and
KE(R0) should be normalized into range [0; 1]. To rank the MJNSs, the score of
an MJNS is computed as the average hybrid score of the relations in the MJNS.</p>
      <p>
        Runtime Discussion: The runtime of our methods are similar to that of
Discover I [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. All MJNSs have to be found for the keyword query under the Dmax
threshold. The primary overhead is evaluating the SQL queries associated with
the MJNSs. For us, this overhead is marginally more than Discover I because
our networks can be marginally larger due to roles.
top-10 DISCOVER I
Q1 29.9
Q2 37
Q3 38.4
Q4 37.7
DISCOVER II
29.8
37
39
39.2
IF
82.5
81.7
80.7
84.4
We evaluate our proposed ranking methods for nding the most meaningful
MJNSs. All of the evaluated methods are implemented in Java. The experiments
are conducted over the TPC-E database. TPC provides a transaction log for
TPC-E which we use to generate a gold standard for ranking MJNSs. The focus
of this work is ranking the MJNSs. The input to our MJNS generator is the
keyword roles. That is, the main input to the ranking methods is a set of relation
names (i.e., query keyword roles). We use the following four query keyword
roles in the experiments: Q1: Customer, Company; Q2: Customer, Broker; Q3:
Customer, Company, Industry; Q4: Customer, Company, Broker, Security. Note
that in addition of the following experiment, additional experiments in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also
verify the e ectiveness of our ranking methods.
      </p>
      <p>
        The results of the top-10 overlap with the gold standard are presented in Fig.
2. We also present the results of the Discover I algorithm that ranks nal answers
based on their size [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Note that Discover II produces the same ranking of MJNSs
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Generally, the edge importance ranking method IF works better than others.
To see how e ective ranking of MJNSs impacts the nal answers, we compare
the top-10 nal answers from our IF method with the nal answers produced
by Discover I [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Discover II [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We use four sets of keywords related to the
above keyword roles. For example, the rst query is "Jacob Insurance" in which
"Jacob" is associated with the customer table and "Insurance" is associated with
the company table. We ask eight graduate students to judge the relevancy of the
answers. A user assigns a score between 0 and 1 to each nal answer, where 1
means completely relevant and 0 means completely irrelevant to the query. The
top-10 precision for each query are presented in Fig. 2. Clearly, the IF method
achieves much better precision in all the queries. The reason for Discover-I and
Discover-II to have a lower precision is that in most of the cases, the tuples of the
nal answers are connected together by the dimension tables (e.g. status type)
and fact tables are not involved.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>We improve relevance scoring of answers based on their networks (join trees)
in light of complex database schema. We propose a series of measures, and
algorithms to compute them based on the importance of nodes and edges to
capture the intended semantic of queries. In this work we sought to demonstrate
how e ective deriving relevance of the nodes and edges of the database schema
could be based on just the schema and data, by no means are we advocating
that auxiliary information (such as Linked Data) cannot improve it.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domnori</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guerra</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lado</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Velegrakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Keyword Search over Relational Databases: A Metadata Approach</article-title>
          . In: SIGMOD. pp.
          <volume>565</volume>
          {
          <issue>576</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hristidis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gravano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papakonstantinou</surname>
          </string-name>
          , Y.:
          <article-title>E cient IR-Style Keyword Search over Relational Databases</article-title>
          . In: VLDB. pp.
          <volume>850</volume>
          {
          <issue>861</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hristidis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papakonstantinou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Discover: Keyword Search in Relational Databases</article-title>
          . In: VLDB. pp.
          <volume>670</volume>
          {
          <issue>681</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kargar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>E cient Top-k Keyword Search in Graphs with Polynomial Delay</article-title>
          . In: ICDE. pp.
          <volume>1269</volume>
          {
          <issue>1272</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kargar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Godfrey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlichta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>MeanKS: Meaningful Keyword Search in Relational Databases with Complex Schema</article-title>
          . In: SIGMOD. pp.
          <volume>905</volume>
          {
          <issue>908</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kargar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Godfrey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlichta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Meaningful Keyword Search in Relational Databases with Large and Complex Schema</article-title>
          . In: ICDE. pp.
          <volume>411</volume>
          {
          <issue>422</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>