<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Semantic Discovery for Heterogenous Open Data by Interlinking Metadata of Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiseong Son</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youngsung Son</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haklae Kim</string-name>
          <email>haklaekimg@kisti.re.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electronics and Telecommunications Research Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Korea Institute of Science and Technology Information</institution>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Open data refers to data that everyone can freely use, reuse and redistribute. A number of open data is released by various organisations, governments or communities. However, it is limited to discover datasets that users want, since most of data portals allow to search their datasets based on simple keywords using le names or descriptions, etc. This paper proposes a novel way for discovering disclosed government datasets by using linked data technologies. For achieving this objective, a set of datasets is collected from the public data portal in Korea, and all of data elds are extracted and transformed into linked data using an ontology model. We also provide a simple evaluation, which compares a search performance between the portal and the proposed method.</p>
      </abstract>
      <kwd-group>
        <kwd>Open Data Government Data Semantic Discovery Ontology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        While the big data phenomenon is becoming increasingly common, it is not easy
for anyone to freely use the data. A large amount of big data is owned by service
providers or platform owners, and only a limited portion of data is shared. On the
other hand, open data allows users to provide a signi cant opportunity that they
are able to use a variety of data across heterogeneous data sources and domains.
The key value of open data is that a piece of data contained in published data
can be interlinked with other data. In an open data environment, data can
be interchanged between institutions, between institutions and governments, or
between governments, and new value can be created through interlinking of
datasets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        One of issues aligning on open data is that discovering datasets is getting
di cult [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Most data portals provide the ability to discover datasets. For
example, CKAN (Comprehensive Knowledge Archive Network), which is a data
portal platform, is able to retrieve a le name and a description for les, and
tags and le types added to the dataset [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. However, there is a limitation
to searching for the information that an individual dataset has. If a user wants
to nd out datasets that have `population', most of data portals returns a list
of datasets that contains the keyword (i.e. `population') on descriptions or le
names of the datasets on behalf of retrieving their content [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>This study proposes a method of discovering disclosed government datasets
by extracting data elds of individual datasets and constructing them as linked
data. Section 2 describes a research approach including data collections and
transformations based on a proposed ontology model. Section 3 introduces a
small evaluation to retrieve the collected datasets with some comparisons.
Section 4 concludes and introduces future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Research approach</title>
      <p>We collect a set of public open datasets from the public data portal3 and extract
all of data elds from the datasets. This site provides governmental open datasets
released by the Republic of Korea. Currently, 689 organisations provide 22, 334
le data (CSV or other types), 2,547 open APIs, and 91 standard data.</p>
      <p>
        This paper focuses on analysing the standard data, since metadata quality of
other datasets is not good to our purposes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Note that the standard data in the
portal refers to a set of datasets by using the public data open standard guidelines
of the government that de nes an item name (data eld) and its value for 93
domains. A total of 1,480 item names were extracted in the collected standard
datasets, there are 903 item names that eliminate redundancy. The selected elds
are no needs for further clustering, since a data eld is already normalised by
using standardised terms. Note that the collected datasets containing the
roadname address and the land-number address are 53 and 44, respectively, and the
latitude and longitude include 55 data sets. Latitude and longitude data elds
are de ned together in all datasets. There are 5 cases where latitude/longitude
is included in the dataset in which the road-name address as an item name does
3 http://data.go.kr
not exist, and there are 12 cases when there is no land-number address. On the
other hand, when there is no latitude and longitude item name, the road-name
address and the land-number address correspond to one case of three. There are
14 datasets that do not have both address and latitude/longitude information.
      </p>
      <p>A simple ontology model is designed for representing a relationship between
a dataset and its data eld as shown in Figure 1. Each dataset has a set of data
elds, and this relation is represented by using the data:hasDataItem property.
Note that the data:relatedTo property is to describe a relationship between
speci c terms. For example, a `location' may be related to `address', `latitude', or
`longitude'. There is no dataset with an item name of `location', but most of
datasets have `address' or `latitude and longitude'. In this reason, this property
is used to expand a speci c query. As shown in Figure 1, a traditional market
dataset does not have any elds associated to a toilet. However, it is possible to
discover some toilets around a traditional market, because both datasets have
address or locational information.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We report the measurements obtained in Figure 2. We compare the three cases
for the data portal and the proposed method. Case 1 discovers for a dataset
with a single keyword. The portal and the proposed model have 1 and 3 results,
respectively, for a speci c topic (i.e. `toilet'). Two of the results of the proposed
model have no related keywords in the le name or description. Case 2 is a
method for searching heterogeneous datasets. Consider the following query: what
datasets contain a market and toilet information nationwide? Such queries are
dependent on the information contained in the dataset. Although a particular
dataset can be discovered if it has both elds, searching in a fragmented dataset
is di cult. As shown in Figure 2, the portal does not have search results for
multiple keywords (i.e. `market' and `toilet'), but the proposed model gives two
results. However, these results provide a simple information about a toilet as yes
or no. Case 3 is to nd out a speci c relationship between datasets. For example,
the data:relatedTo property can be used for discovering a relationship between a
traditional market and a toilet. First, it retrieves a list of exact administrative
area from both datasets based on address information, and then calculates a
distance between search results using the latitude and longitude information.
Compared to Case 2, this result show a speci c location of a toilet around the
market.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper proposes a new approach to discover datasets on a data portal by
using linked data technologies. Most of data portals allow users to retrieve their
datasets with search options, including keywords, data types, or user-generated
tags, etc. However, it is limited to discover datasets based on their content. In
this reason, users need to check whether these datasets are suitable to their
purposes about search results. To solve this problem, this paper introduces a simple
semantic search that aims to discover internal content of individual datasets by
constructing linked data including data elds from individual datasets and its
relationships. Although experimental data are relatively small, the evaluation
shows that the proposed method is more e ective than existing search methods.
Future research will apply the data model and search method proposed in this
paper to the whole data provided by the public data portal.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hand</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Data, not dogma: Big data, open data, and the opportunities ahead</article-title>
          . In: Tucker,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hppner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Siebes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Swift</surname>
          </string-name>
          , S. (eds.)
          <source>Advances in Intelligent Data Analysis XII, Lecture Notes in Computer Science</source>
          , vol.
          <volume>8207</volume>
          , pp.
          <volume>1</volume>
          {
          <fpage>12</fpage>
          . Springer Berlin Heidelberg (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Janssen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charalabidis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuiderwijk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Bene ts, adoption barriers and myths of open data and open government</article-title>
          .
          <source>IS Management</source>
          <volume>29</volume>
          (
          <issue>4</issue>
          ),
          <volume>258</volume>
          {
          <fpage>268</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , H.:
          <article-title>Quality evaluation of the open government data: The case of the open data portal of korea</article-title>
          .
          <source>International Journal of Contents</source>
          (in press)
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kostovski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jovanovik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trajanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Open data portal based on semantic web technologies</article-title>
          .
          <source>In: Proceedings of the 7th Annual South-East European Doctoral Student Conference (DSC</source>
          <year>2012</year>
          ). pp.
          <volume>504</volume>
          {
          <issue>516</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Loureno</surname>
            ,
            <given-names>R.P.</given-names>
          </string-name>
          :
          <article-title>Evidence of an open government data portal impact on the public sphere</article-title>
          .
          <source>IJEGR</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <volume>21</volume>
          {
          <fpage>36</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Scholz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tcholtchev</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lmmel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schieferdecker</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>A ckan plugin for data harvesting to the hadoop distributed le system</article-title>
          . In: Ferguson,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Muoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.M.</given-names>
            ,
            <surname>Cardoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.S.</given-names>
            ,
            <surname>Helfert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pahl</surname>
          </string-name>
          , C. (eds.) CLOSER. pp.
          <volume>19</volume>
          {
          <fpage>28</fpage>
          .
          <string-name>
            <surname>SciTePress</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>