<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Open Data Search Framework based on Semi-structured Query Patterns</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marut Buranarach</string-name>
          <email>marut.bur@nectec.or.th</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chonlatan Treesirinetr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pattama Krataithong</string-name>
          <email>pattama.kra@nectec.or.th</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Somchoke Ruengittinun</string-name>
          <email>somchoke.r@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Faculty of Science, Kasetsart University</institution>
          ,
          <addr-line>Bangkok</addr-line>
          ,
          <country country="TH">Thailand</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Language and Semantic Technology Laboratory National Electronics and Computer Technology Center (NECTEC)</institution>
          ,
          <country country="TH">Thailand</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Open government data (OGD) is a global initiative to promote transparency, service innovation and citizen participation. OGD is usually made available in forms of datasets on OGD web portals. Searching OGD is usually conducted using metadata search on OGD catalogs. Although searching OGD based on metadata or full-text search is common, it cannot take full advantage of the structured data content in the datasets. By being able to query data in the datasets, the user can find the relevant information more effectively. This paper proposes an open data search framework based on semi-structured query patterns. The proposed semi-structured query pattern has more structured than typical keyword search which will allow for more expressive query. It is also less rigid than structured query which reduces the user effort in forming a query. Three query patterns are currently supported and can be converted to API requests to the existing dataset APIs of Data.go.th. The query suggestion module of the system can make suggestions for possible queries based on the user's initial typing. A prototype system was created to demonstrate searching some datasets from Data.go.th using this approach. Finally, we discuss some lessons learned and current limitations that should be improved in future work.</p>
      </abstract>
      <kwd-group>
        <kwd>open data search</kwd>
        <kwd>semi-structured question</kwd>
        <kwd>dataset API</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Open government data (OGD) is a global initiative to promote transparency, service
innovation and citizen participation. The most common means for publishing OGD is
usually in forms of datasets made available on OGD portals such as Data.gov,
Data.gov.uk and many others. Searching OGD datasets usually relies on search functions
of OGD portal software such as CKAN1 in searching their data catalogs. The search
functions are usually based on keyword-based search over metadata fields or
tagbased search. Although searching datasets based on metadata is straightforward and
1 https://ckan.org/
can help the user to find relevant datasets, the user needs to look into each dataset to
find the information he or she is looking for in each dataset. For example, if the user
is looking for a phone number of a school, the user may have to search for the
datasets whose metadata contains the term “school” and then looks into each returned
dataset whether it contains the telephone number information. Even when full-text
indexing and searching is applied, the user may only find the datasets containing the
search terms but not the “answer” the user is looking for. Effective mechanism that
can allow for “data-level” querying in addition to “dataset-level” querying is needed
for querying OGD datasets.</p>
      <p>There are typically two main approaches in querying structured data:
keywordbased and structured query. Using keyword-based query, the search system searches
the data on every fields. Thus, the structure information of the data is not used in the
query. This approach has an advantage that it reduces user effort in forming a query
with a disadvantage of limited query expressiveness. Using structured query, which is
typically specified via form-based interface, the search system transforms the user
query to a structured query language expression, i.e. SQL, in searching the data. This
approach has an advantage that the user can specify expressive query with a
disadvantage of requiring more user effort in forming query.</p>
      <p>
        In this paper, we propose a semi-structured query approach based on query patterns
as an additional form of querying OGD datasets. In this approach, user can specify
search conditions in free-text from with auto-complete suggestions for the possible
query terms and conditions based on some defined query patterns. For example, the
user can define a query such as “rajini school telephone” to search for the telephone
number of the school. Currently, three query patterns are defined. The search system
utilizes dataset APIs created for some datasets on Data.go.th [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The APIs were
provided on top of an RDF database. Specifically, the OGD datasets were converted to
the RDF data format. The query patterns were mapped with some pre-defined API
and SPARQL query templates. We developed a prototype system for searching some
OGD datasets from Data.go.th using this approach. Finally, some potentials and
limitations of the framework are discussed.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Work</title>
      <p>
        Our approach relies on RDF data querying using SPARQL query templates. We
briefly review some related work on linked open data search focusing on querying
interface as follows. RDF Xpress [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provides a form-based search interface for searching
linked data sources. The user can combine triple patterns with keywords to form
queries with auto-complete feature. This work also defines the following components for
linked data search system: RDF knowledge base, search interface, retrieval engine,
query relaxer and result diversifier. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] discussed some unique challenges for linked
data search engine including the user interface issue. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] investigated a natural
language query mechanism for linked data by mapping user queries into some query
graph patterns. To the best of our knowledge, our work is the first that proposes a
generic framework for querying OGD datasets based on data-level querying using
semi-structured query patterns.
      </p>
      <sec id="sec-2-1">
        <title>3.1 Conceptual Architecture</title>
        <p>A conceptual architecture of the open data search framework based on
semistructured query patterns is shown in Fig. 1. The system consists of four major
modules: Dataset APIs, Query Translation, Query Suggestion and Result Formatter. Each
module is briefly described as follows.</p>
        <p>
          Dataset APIs: Publishing RDF and data API from existing OGD datasets can
further promote application and integration of OGD. Our previous work has proposed a
semi-automatic mechanism for such a process [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The data publishing and querying
system was extended from the OAM framework [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Some datasets from Data.go.th
have been transformed and published as RDF datasets, i.e. via direct mapping, and
RESTFul APIs. The API requests were translated into SPARQL queries based on
predefined query patterns. The returned results were formatted to the JSON format.
        </p>
        <p>Query Translation: In our framework, three semi-structured query patterns were
defined. The user can post a query in one of the patterns. The query patterns were
subsequently translated into API requests made to the available dataset APIs. If the
query is not in the defined patterns, the query is treated as typical keyword search.</p>
        <p>Query Suggestion: In our framework, a semi-structured query pattern is defined as
a query that does not have a rigid structure but having a more controlled form than
keyword search. Thus, in order to prevent the user from forming the malformed
query, a query suggestion module was developed. The module relied on a created index
of the relations between property, class and values from the data in the datasets. It
suggests possible classes, properties and values based on the user’s initial typing for
the query.</p>
        <p>Result Formatter: The results from the dataset APIs in the JSON format were
transformed into a table format. Although the results were presented in table form, the
likely answer is also highlighted within the table cells.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Query Patterns and API Request Translation</title>
        <p>In our framework, three semi-structured query patterns were defined. The user can
post a query in one of the following patterns in the triple format.</p>
        <p>Pattern 1: &lt;class&gt; &lt;property&gt; &lt;value&gt;
Pattern 2: &lt;property&gt; &lt;subject&gt;
Pattern 3: &lt;subject&gt; &lt;property&gt;</p>
        <p>In Pattern 1, the objective is to retrieve the instances of a class that matched with
the query condition &lt;property&gt; = &lt;value&gt;. For example, a query “income province
bangkok” will retrieve instances of the class ‘income’ whose ‘province’ property has
the value ‘bangkok’. A specified class name must be mapped with dataset tags and
resolved to some targeted datasets. Then a query is formed and run against the
datasets. The follows is an example API request for such a query.
query?dsname=income&amp;path=income&amp;property=province&amp;operator=CONTAINS&amp;
value =bangkok</p>
        <p>In Pattern 2, the objective is to retrieve the value of a given property of a given
instance. For example, a query “telephoneNo Rajini School” will retrieve the instance
of ‘Rajini school’ and highlighted the value of the ‘telephoneNo’ property in the
result. In this pattern, the instance and property terms must be checked for the datasets
that contain the terms. A query to search the data related to this instance was then run
against the datasets. The results were highlighted for the value of the given property.
The follows is an example API request for such a query.</p>
        <p>Pattern 3 is similar to Pattern 2 except that the positions of the subject and
property terms are switched. The API translation is the same as that of Pattern 2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Query Suggestions</title>
        <p>The system makes suggestions to the user for possible queries given the user initial
characters for the query. In order to make suggestions, the possible classes (dataset
tags), properties and values must be collected and indexed from the text data in the
datasets. An ER diagram showing entities and relationships of terms for making query
suggestions is shown in Fig. 2. The diagram presents a ternary relationship between
dataset or class, property and value terms. Given this database design, the listing and
possible relationships between datasets, properties and value terms can be retrieved
from the database. The value terms only include string values within a given length
limit. This allows the auto-complete function to be applied when the user is typing
characters and terms. A resulted query made by the auto-complete function will result
in a valid query made to the API.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Case Study</title>
      <p>A prototype system was developed using about ten datasets from Data.go.th to
demonstrate the framework. Dataset APIs were created for these datasets. The terms
in these datasets were indexed for the query suggestions module. The total number of
the indexed properties and term relations were over 160 and 25,000 entries
respectively. Fig. 3 shows an example query suggestions for the query pattern 1. In this
example, the user initially types “income” and the suggested terms are the list of possible
property for this class (dataset). Once a property is selected, e.g. “income province”,
the list of possible values, which are province names, in the dataset is suggested. The
user can select a value, e.g. “income province bangkok”. The system then converted
the query to an API request to query the dataset API with the given criteria. Fig 4a
and 4b shows the query result in both JSON and table formats.</p>
      <p>a)</p>
      <p>Example query results from the income statistics dataset API in JSON format
b)</p>
      <p>Example query results of the system in table format
This paper proposes a framework for searching data in OGD datasets. The framework
allows the user to post semi-structured query patterns in querying the data in the OGD
datasets. The proposed semi-structured query pattern has more structured than typical
keyword search which will allow for more expressive query. It is also less rigid than
structured query which reduces the user effort in forming a query. The result is similar
to the result of database querying. Three query patterns are currently supported and
can be converted to API requests to the existing dataset APIs of Data.go.th. The query
suggestion module of the system can make suggestions for possible queries based on
the user’s initial typing. The module requires indexing of terms and their relationships
in the datasets in terms of classes, property and values. A preliminary prototype
system was created to demonstrate searching a small number of datasets from Data.go.th
using this approach.</p>
      <p>Based on our prototype system, we discuss some lessons learned as follows.
Although the system can work well with a small number of datasets, it is currently not
highly scalable. With the increasing number of datasets, the number of the indexed
terms and their relations is rapidly grows. This can greatly reduce the performance of
the system in making query suggestion. In the future, the index may be created in
NoSQL database to improve its scalability. In addition, more supported query patterns
should be provided. For example, a query pattern which consists of multiple query
conditions, e.g. “income province bangkok year 2015”, should be additionally
provided. Currently, the property terms relied on the terms used in the column headers.
However, some header labels in the datasets are ambiguous or not meaningful, e.g.
‘TelNo’ label to represent telephone number. This can result in some query
suggestions that are ambiguous or not meaningful. Future work should focus on these issues
to improve the performance and usability of the framework.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>This project was partially supported by the Electronic Government Agency (EGA)
and the National Science and Technology Development Agency (NSTDA), Thailand.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Krataithong</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buranarach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Supnithi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>RDF Dataset Management Framework for Data.go.th</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Knowledge, Information and Creativity Support Systems (KICSS2015)</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Elbassuoni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanath</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>RDF Xpress: A Flexible Expressive RDF Search Engine</article-title>
          .
          <source>In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . p.
          <fpage>1013</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinsella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine</article-title>
          .
          <source>Web Semant</source>
          .
          <volume>9</volume>
          ,
          <fpage>365</fpage>
          -
          <lpage>401</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , da Silva,
          <string-name>
            <given-names>J.C.P.</given-names>
            ,
            <surname>Curry</surname>
          </string-name>
          , E.:
          <article-title>Querying linked data graphs using semantic relatedness: A vocabulary independent approach</article-title>
          .
          <source>Data Knowl. Eng</source>
          .
          <volume>88</volume>
          ,
          <fpage>126</fpage>
          -
          <lpage>141</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buranarach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Supnithi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thein</surname>
            ,
            <given-names>Y.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruangrajitpakorn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rattanasawad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wongpatikaseree</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>A.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Assawamakin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>OAM: An Ontology Application Management Framework for Simplifying Ontology-Based Semantic Web Application Development</article-title>
          .
          <source>Int. J. Softw. Eng. Knowl. Eng</source>
          .
          <volume>26</volume>
          ,
          <fpage>115</fpage>
          -
          <lpage>145</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>