-

Towards a Semantic Discovery for Heterogenous Open Data by Interlinking Metadata of Datasets

Jiseong Son

Youngsung Son

Haklae Kim

haklaekimg@kisti.re.kr 1 0 Electronics and Telecommunications Research Institute 1 Korea Institute of Science and Technology Information , Korea

Open data refers to data that everyone can freely use, reuse and redistribute. A number of open data is released by various organisations, governments or communities. However, it is limited to discover datasets that users want, since most of data portals allow to search their datasets based on simple keywords using le names or descriptions, etc. This paper proposes a novel way for discovering disclosed government datasets by using linked data technologies. For achieving this objective, a set of datasets is collected from the public data portal in Korea, and all of data elds are extracted and transformed into linked data using an ontology model. We also provide a simple evaluation, which compares a search performance between the portal and the proposed method.

Open Data Government Data Semantic Discovery Ontology

While the big data phenomenon is becoming increasingly common, it is not easy for anyone to freely use the data. A large amount of big data is owned by service providers or platform owners, and only a limited portion of data is shared. On the other hand, open data allows users to provide a signi cant opportunity that they are able to use a variety of data across heterogeneous data sources and domains. The key value of open data is that a piece of data contained in published data can be interlinked with other data. In an open data environment, data can be interchanged between institutions, between institutions and governments, or between governments, and new value can be created through interlinking of datasets [ 2 ].

One of issues aligning on open data is that discovering datasets is getting di cult [ 1 ]. Most data portals provide the ability to discover datasets. For example, CKAN (Comprehensive Knowledge Archive Network), which is a data portal platform, is able to retrieve a le name and a description for les, and tags and le types added to the dataset [ 5, 6 ]. However, there is a limitation to searching for the information that an individual dataset has. If a user wants to nd out datasets that have `population', most of data portals returns a list of datasets that contains the keyword (i.e. `population') on descriptions or le names of the datasets on behalf of retrieving their content [ 4 ].

This study proposes a method of discovering disclosed government datasets by extracting data elds of individual datasets and constructing them as linked data. Section 2 describes a research approach including data collections and transformations based on a proposed ontology model. Section 3 introduces a small evaluation to retrieve the collected datasets with some comparisons. Section 4 concludes and introduces future research. 2

Research approach

We collect a set of public open datasets from the public data portal3 and extract all of data elds from the datasets. This site provides governmental open datasets released by the Republic of Korea. Currently, 689 organisations provide 22, 334 le data (CSV or other types), 2,547 open APIs, and 91 standard data.

This paper focuses on analysing the standard data, since metadata quality of other datasets is not good to our purposes [ 3 ]. Note that the standard data in the portal refers to a set of datasets by using the public data open standard guidelines of the government that de nes an item name (data eld) and its value for 93 domains. A total of 1,480 item names were extracted in the collected standard datasets, there are 903 item names that eliminate redundancy. The selected elds are no needs for further clustering, since a data eld is already normalised by using standardised terms. Note that the collected datasets containing the roadname address and the land-number address are 53 and 44, respectively, and the latitude and longitude include 55 data sets. Latitude and longitude data elds are de ned together in all datasets. There are 5 cases where latitude/longitude is included in the dataset in which the road-name address as an item name does 3 http://data.go.kr not exist, and there are 12 cases when there is no land-number address. On the other hand, when there is no latitude and longitude item name, the road-name address and the land-number address correspond to one case of three. There are 14 datasets that do not have both address and latitude/longitude information.

A simple ontology model is designed for representing a relationship between a dataset and its data eld as shown in Figure 1. Each dataset has a set of data elds, and this relation is represented by using the data:hasDataItem property. Note that the data:relatedTo property is to describe a relationship between speci c terms. For example, a `location' may be related to `address', `latitude', or `longitude'. There is no dataset with an item name of `location', but most of datasets have `address' or `latitude and longitude'. In this reason, this property is used to expand a speci c query. As shown in Figure 1, a traditional market dataset does not have any elds associated to a toilet. However, it is possible to discover some toilets around a traditional market, because both datasets have address or locational information. 3

Evaluation

We report the measurements obtained in Figure 2. We compare the three cases for the data portal and the proposed method. Case 1 discovers for a dataset with a single keyword. The portal and the proposed model have 1 and 3 results, respectively, for a speci c topic (i.e. `toilet'). Two of the results of the proposed model have no related keywords in the le name or description. Case 2 is a method for searching heterogeneous datasets. Consider the following query: what datasets contain a market and toilet information nationwide? Such queries are dependent on the information contained in the dataset. Although a particular dataset can be discovered if it has both elds, searching in a fragmented dataset is di cult. As shown in Figure 2, the portal does not have search results for multiple keywords (i.e. `market' and `toilet'), but the proposed model gives two results. However, these results provide a simple information about a toilet as yes or no. Case 3 is to nd out a speci c relationship between datasets. For example, the data:relatedTo property can be used for discovering a relationship between a traditional market and a toilet. First, it retrieves a list of exact administrative area from both datasets based on address information, and then calculates a distance between search results using the latitude and longitude information. Compared to Case 2, this result show a speci c location of a toilet around the market. 4

Conclusion

This paper proposes a new approach to discover datasets on a data portal by using linked data technologies. Most of data portals allow users to retrieve their datasets with search options, including keywords, data types, or user-generated tags, etc. However, it is limited to discover datasets based on their content. In this reason, users need to check whether these datasets are suitable to their purposes about search results. To solve this problem, this paper introduces a simple semantic search that aims to discover internal content of individual datasets by constructing linked data including data elds from individual datasets and its relationships. Although experimental data are relatively small, the evaluation shows that the proposed method is more e ective than existing search methods. Future research will apply the data model and search method proposed in this paper to the whole data provided by the public data portal.

1. Hand , D. : Data, not dogma: Big data, open data, and the opportunities ahead . In: Tucker, A. , Hppner , F. , Siebes , A. , Swift , S. (eds.) Advances in Intelligent Data Analysis XII, Lecture Notes in Computer Science , vol. 8207 , pp. 1 { 12 . Springer Berlin Heidelberg ( 2013 )

2. Janssen , M. , Charalabidis , Y. , Zuiderwijk , A. : Bene ts, adoption barriers and myths of open data and open government . IS Management 29 ( 4 ), 258 { 268 ( 2012 )

3. Kim , H.: Quality evaluation of the open government data: The case of the open data portal of korea . International Journal of Contents (in press)

4. Kostovski , M. , Jovanovik , M. , Trajanov , D. : Open data portal based on semantic web technologies . In: Proceedings of the 7th Annual South-East European Doctoral Student Conference (DSC 2012 ). pp. 504 { 516 ( 2012 )

5. Loureno , R.P. : Evidence of an open government data portal impact on the public sphere . IJEGR 12 ( 3 ), 21 { 36 ( 2016 )

6. Scholz , R. , Tcholtchev , N. , Lmmel , P. , Schieferdecker , I.: A ckan plugin for data harvesting to the hadoop distributed le system . In: Ferguson, D. , Muoz , V.M. , Cardoso , J.S. , Helfert , M. , Pahl , C. (eds.) CLOSER. pp. 19 { 28 . SciTePress ( 2017 )