<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Yaanii: Effective Keyword Search over Semantic dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto De Virgilio</string-name>
          <email>devirgilio@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Miscione</string-name>
          <email>miscione@dia.uniroma3.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science University of Alberta Edmonton</institution>
          ,
          <addr-line>Alberta</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Informatica e Automazione Universitá Roma Tre Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dipartimento di Informatica e Automazione Universitá Roma Tre Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Nowadays data is disseminated in a number of different sources, from databases systems to the Web, from a traditional structured organization (relational) to a semi-structured (XML), up to the unstructured ones (text in Web documents). Although availability of data is constantly increasing, one principal difficulty users have to face is to find and retrieve the information they are looking for. To this aim keywords search based systems are increasingly capturing the attention of researchers. In this paper, we present Yaanii1, a tool for the effective Keyword Search over semantic datasets. It is based on a novel keyword search paradigm for graph-structured data, focusing in particular on the RDF data model. While many techniques search the best answer trees, we propose an effective algorithm for the exploration and computation of all matching subgraphs. We provide a clustering technique that identifies and groups graph substructures based on template match. A scoring function, IR inspired, evaluates the relevance of the substructures and the clusters. A strong point of our approach is that the ranking supports the generation of Top-k solutions during its execution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>
        Keyword-based search approaches have the huge benefit that users
can ignore both the language and the structure of the data they are
going to query. A keyword based search engine returns a list of
candidate pages, documents or set of data that match keywords
provided in input. Then a user has to dedicate time and efforts
navigating each result returned from the engine in order to discover the
desired information, i.e. the answer he is looking for. Therefore,
attention around searching and query processing of graph-structured
data continue to increase as the Web, XML documents and even
relational database can be represented as a graph. Current approaches
rely on a combination of IR and tree/graph exploration techniques
whose goal is to rank results according to a relevance criterion.
Keyword search on tree-structured data counts a good number of
approaches already [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Actual efforts [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ] focus on RDF data
querying, given the great momentum of Semantic Web in which
Web pages carry information that can be read and understood by
machines in a systematic way. Simplifying, a generic approach first
identifies the parts of the data structure containing the keywords of
1Yaanii, literally “path” in Sanskrit.
interest, possibly by using an indexing system or a database engine,
then explores the data structure in order to discover a connection
between such identified parts. The common exploration paradigm
is similar to the triple of RDF, that is subject, property, object .
Candidate solutions, built out of found connections, are then all
generated and finally ranked through a scoring function. To return
the top-k best solutions, pruning techniques reducing the list of
candidate solutions down to those whose score is above a threshold are
implemented. In this framework to achieve efficiency above all,
current algorithms compute the best answers only in an
approximate way. This is because they use an exploration paradigm that
is inefficient and the scoring function takes place only when
solutions were generated all together. Moreover pruning techniques can
have a sensible impact in both the quality of the solutions, as low
scoring results are not shown or even computed, as well as on
efficiency, as an early pruning reduces the space of candidate solutions
to investigate.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we proposed a novel approach to keyword search in the
graph-structure data in a RDF representation. The main
contributions of our approach are:
• A clustering technique that identifies and groups graph
substructures based on template match. The idea is to group
paths with respect to the template (i.e schema) they
correspond to. A solution is a composition of paths belonging to
different clusters. In this way we avoid the exploration of
overlapping solutions and we build cleaner results for the
users, gaining in terms of computation cost. Usually, the
most promising algorithms of an efficient solution for
keyword based search are in PTIME class complexity. To this
aim, in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we demonstrated how Yaanii is more efficient
with respect to the others, presenting a quadratic complexity
as upper-bound.
• An algorithm that ranks solutions while it builds the
solutions. Unlike most of the approaches to keyword search, that
first identify all the solutions and then rank them according to
a function, our approach leverages on the clusters to
assembly a solution starting with the most relevant path in the most
relevant cluster. As a result, the most relevant solution is the
first to come out of the algorithm, then decreasing
monotonically to the less relevant solutions. This allows users to
explore the returned solutions, starting with the most relevant,
while the elaboration of remaining solutions is undergoing.
2.
      </p>
      <p>AN ARCHITECTURE OF REFERENCE
We implemented our approach into a tool, called Yaanii. A
flexible architecture of the system was design, as shown in Figure 1.
1. The RDF Parser takes as input a collection of RDF
Documents and parses them into triples. Here we use the Jena
framework 2;
2. The Indexer builds an index on top of the triple collections to
achieve structural information useful for the query process.
Here the indexing is supported by Lucene3 and WordNet 4.</p>
      <p>The last allows query expansion;
3. A user performs a query through a GUI helper, handling
events and the query itself;
4. The parsed query is given to the Searcher for processing;
5. The Searcher processes the query over the Indexed Resource
Base and returns the search result to the caller. It
communicates with the Indexer to extract the instances matching input
keywords (i.e. informative paths), group them into clusters
and compose elements from clusters into the final solutions
(i.e. subgraph structures). Each structure (i.e. path, cluster
and solution) is evaluated by a scoring function.</p>
      <p>CONCLUSION AND FUTURE WORKS
We presented Yaanii, a tool for effective Keyword Search over
semantic datasets. It is based on a clustering technique and a
scoring function that support the generation of Top-k solutions during
its execution in the first k steps.</p>
      <p>From a theoretical point of view, future directions focus on
improving the search algorithm of Yaanii to reach a linear time
complexity. From a practical point of view, we would improve the
indexing capabilities by embedding Lucene into a DBMS (e.g.
Oracle) and provide a query-by-example interface to support the user
to perform the query and navigate the results.
4.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cappellari</surname>
          </string-name>
          , R. De Virgilio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Miscione</surname>
          </string-name>
          .
          <article-title>Keyword based Search over Semantic Data in Polynomial Time</article-title>
          .
          <source>In Technical Report RT-DIA-160</source>
          , Universitá Roma Tre, Rome, Italy,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>R. De Virgilio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Miscione</surname>
          </string-name>
          .
          <article-title>Cluster-based exploration for Effective Keyword Search over Semantic Datasets</article-title>
          .
          <source>In Proc. of the 28th International Conference on Conceptual Modeling (ER'09)</source>
          , Gramado, Brazil,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          <string-name>
            <surname>Blinks</surname>
          </string-name>
          <article-title>: ranked keyword searches on graphs</article-title>
          .
          <source>In Int. Conf. on Management of Data (SIGMOD'07)</source>
          , China,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          , Sagiv,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Finding and approximating top-k answers in keyword proximity search</article-title>
          .
          <source>In Int. Symposium on Principles of Database Systems (PODS'06)</source>
          , USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Yu,
          <string-name>
            <given-names>C.T.</given-names>
            ,
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Effective keyword search in relational databases</article-title>
          .
          <source>In Int. Conf. on Management of Data (SIGMOD'06)</source>
          , USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          , Wang,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Rudolph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Top-k exploration of query graph candidates for efficient keyword search on rdf</article-title>
          .
          <source>In Int. Conf. on Data Engineering (ICDE'09)</source>
          , China,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>