Linked Open Data System for Scientific Data Sets
       Frederik Simon Bäumer                                 Jangwon Gim*                                Do-Heon Jeong
        Heinz Nixdorf Institute, HNI                 Korea Institute of Science and              Korea Institute of Science and
         University of Paderborn                     Technology Information, KISTI               Technology Information, KISTI
          Paderborn, Germany                            Daejeon, South Korea                        Daejeon, South Korea
        fbaeumer@hni.upb.de                               jangwon@kisti.re.kr                           heon@kisti.re.kr

           Michaela Geierhos                                  Hanmin Jung
        Heinz Nixdorf Institute, HNI                 Korea Institute of Science and
         University of Paderborn                     Technology Information, KISTI
          Paderborn, Germany                            Daejeon, South Korea
         geierhos@hni.upb.de                                 jhm@kisti.re.kr

ABSTRACT                                                                1. INTRODUCTION
In this paper, we present a system which makes scientific data
available following the linked open data principle using standards      The increase in the number of datasets including scholarly
like RDF und URI as well as the popular D2R server (D2R) and            publications in the Linked Data cloud shows the importance of
the customizable D2RQ mapping language. Our scientific data             Linked Data for the scientific community. There are many
sets include acronym data and expansions, as well as researcher         different providers for scientific data available on the Web.
data such as author name, affiliation, coauthors, and abstracts. The    Publishing houses, digital libraries, and resellers have popular
system can easily be extended to other records. Regarding this, a       data sources. The information types range from author
domain adaptation to patent mining seems possible. For this             information (e.g. full name, affiliation, email), over publications
reason, obvious similarities and differences are presented here.        (e.g. title, abstract, coauthors), to specific data like acronyms and
                                                                        their related expansion. Each of these types has own requirements
The data set is collected from several different providers like         on the data presentation and storage, but they all are somehow
publishing houses and digital libraries, which follow different         interlinked with each other.
standards in data format and structure. Most of them are not
supporting semantic web technologies, but the legacy HTML               One common way to represent this interlinks are network models.
standard. The integration of these large amounts of scientific data     This database model is a generalized graph structure without any
into the Semantic Web is challenging and it needs flexible data         hierarchical restriction, which allows storing objects with their
structures to access this information and interlink them.               individual relationships. This is a common way to store data, but
                                                                        does not fit our requirements for modern data publishing.
Based on these data sets, we will be able to derive a general
technology trend as well as the individual research domain for          A more promising way to share data is the Web of Linked Data.
each researcher. The goal of our Linked Open Data System for            The main idea of Linked Open Data (LOD) is to publish free
scientific data is to provide access to this data set for other         accessible, structured data and to interlink it with other data. This
researchers using the Web of Linked Data. Furthermore we                interlinking generates more valuable information under the
implemented an application for visualization, which allows us to        consequent implementation of standard Web technologies such as
explore the relations between single data sets.                         RDF or URI. The core benefit for further data exploration is the
                                                                        ability to apply complex graph queries, which allow the
Categories and Subject Descriptors                                      interlinking, combining and modifying of data.
D.2.12 [Interoperability]: Data mapping                                 For publishing scientific data stored in relational databases, a data
E.2 [Data Storage Representations]: Linked representations              bridge is needed. For that reason we present our three-component
                                                                        LOD system for scientific data sets, based on the popular D2R
General Terms                                                           server and the customizable D2RQ mapping language. For further
Standardization, Languages                                              data exploration, we integrated RelFinder, an application that
                                                                        visualizes relationships between RDF objects enables the
                                                                        exploration of data interactively. We will demonstrate the
Keywords                                                                functionalities of our system on a test set.
Linked Open Data, Researcher Data, Acronym Data, D2R

 *: Co-responding author
                                                                        2. RELATED WORK
                                                                        A lot of work regarding Linked Open Data and the Semantic Web
 Copyright © 2014 for the individual papers by the papers' authors.     is already done. Popular tools like the D2R server or standards
 Copying permitted for private and academic purposes. This volume is
                                                                        like RDF, HTTP and SPARQL are well proved and used in many
 published and copyrighted by its editors. Published at Ceur-ws.org
 Proceedings of the First International Workshop on Patent Mining and
                                                                        LOD systems [1].
 Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.            A major technological advantage is the backward compatibility,
 At KONVENS’14, October 8–10, 2014, Hildesheim, Germany.                for example, to relational databases (RDB) because the majority
of data on the current Web is stored in this kind of databases. The    opens up new possibilieties in the way they can be accessed and
process of mapping RDB to RDF is subject of current research           queried. For that purpose, a force-directed graph layout which
and different approaches like D2RQ, Triplify or R2RML were             supports every RDF know-ledge base, that provides standardized
created [5]. R2RML currently developed by the W3C with the             SPARQL access, is implemented [9]. Relationships between
main goal to define a RDB to RDF mapping standard for read-            objects can be identified much easier by exploring the data step by
only data access. A different approach is Triplify. It is a very       step. Features like highlighting, previewing, and filtering are
lightweight plugin for existing Web applications, which makes          available for further support [4]. The class filter and the link filter
database content available as RDF and other formats.                   allow it to hide objects which contain specific classes or links.
RDB to RDF mapping can be done by applications such as D2R             Primary these are objects, which are in this particular case not of
server. It is a java-based application for publishing the content of   interest. In addition to these, the length filter and the connectivity
relational databases on the Semantic Web and for providing RDF         filter allow a further selection by hiding objects with a specific
and HTML representations of resources. The D2RQ language is            amount of links and relationships [9].
used for the mapping, which is popular because of the possibility      RelFinder is often used as stand-alone application but also as
to provide access via SPARQL queries very easily [5].                  integrated visualization application. It is for example part of the
The ability to interlink resources under the use of “typed             DBpedia Viewer, an integrative interface for DBpedia, which
relationships” allows a goal-oriented navigation trough the            combines LodLive as one more different visualization approaches
database content by web browsers as well as crawler applications       next to RelFinder. The LodLive application is a web-based tool
[2]. Because these LOD systems and components are very                 that allows the exploration of Linked Data in an intuiitive and
flexible, it is possible to adapt them to different domains.           interactive way [10]. In comparison, both applications provide a
Interlinked User-generated content from social networks, a Linked      good visualization. RelFinder can convince with a higher browser
Open Drug Data (LODD) for pharmaceutical research and                  compatibility. Especially when using the Internet Explorer,
development or a LOD live database of semantically enriched            LodLive throws JavaScript errors drawing several relations
sensor data, are only few successful examples, using the D2R           between classes.
server [3].
Latif, Afzal and Maurer [13] utilized the D2R server to publish        3. LOD SYSTEM FOR SCIENTIFIC DATA
unstructured datasets of the Journal of Universal Computer             The scientific data sets are collected from different resources and
Science (J.UCS) as Linked Data in the Web. The linked Open             based on different formats. For that reasons, it is a difficult task to
Data project provides a new way to publish machine readable            combine information such as researchers, affiliations or
structured data on the Web and best practices for interlinking         publications and to provide them as a clean, interlinked data set
these structured datasets. Moreover, the increase in the number of     under the requirements of quality. To make a contribution to the
datasets including scholarly publications in the Linked Data cloud     existing Web of Linked Data, it is necessary to publish it in a
shows its importance in the scientific community. In order to take     standardized, structured way under the use of common
advantage of benefits of LOD projects, the legacy HTML data in         vocabularies and over a common server framework. Furthermore,
this journal is converted into machine-readable and structured         it has to be identified which information can be bundled to a data
RDF data using the D2R server. It is considered to be an               set containing relevant data and how this data can be usefully
appropriate for data conversion due to its good performance,           interlinked. For this reason, we developed a LOD system for
scalability and the availability of SPARQL endpoint and explorer       scientific data sets, which can handle all related information like
features. A RDF graph converted from the legacy HTML data has          researcher data or publications and publish them in a structured
been made available in Linked Data cloud for the data reuse and        way under the use of common semantic web technologies like
interlinking. Moreover, structured journal data was interlinked        RDF and SPARQL. The interlinked data becomes more useful
with Linked Data resources and it successfully disambiguated and       and can be easily adapted to other research topics.
interlinked datasets of authors and publications with DBpedia,
DBLP and Faceted DBLP as well as CiteULike.                            3.1 System architecture
                                                                       The system architecture in Figure 1 can be divided in three main
In addition, Mitrevski, Javanovik and Stonjanov [11] identified
                                                                       components. The first component contains the datasets, which
some issues regarding the D2R server, which appear during the
                                                                       include the specific data, as a result of different acquisition and
process of publishing the open data of the faculty of computer
                                                                       preprocessing steps we applied before. Because these data sets are
science and engineering in Ss. Cyril and Methodius University.
                                                                       stored in a relational database, a Linked Data view on the existing
One problem is that these relational databases include some
                                                                       database is needed (data bridge). A SPARQL endpoint with the
confidential data about employees and students. However, the
                                                                       ability for serving Linked Data views on relational databases is
D2R server does not provide a way to convert only specific parts       D2R Server, which is part of the second step.
from the database into data in a semantic web format. Moreover,
it lacks of functionality enabling the user to link existing           We chose D2R, because it is one of the most important and most
ontologies to the tables. These issues can be solved by creating a     mature relevant solutions.
new database called Open Data DB including only the data with          The second step, called “Linked Data System”, is responsible for
no privacy infringement as well as building the mapping tool           the declarative mapping between the schemata of the database and
utilizing functionalities of the D2R server, but linking the data      the target RDF terms, based on mapping rules. These rules are
with ontologies. Although a mapping tool was created, the study        stored in mapping files and are formalized under the use of the
presents some improvements of this system such as automatic            popular D2RQ mapping language. Each rule defines in detail how
proposal ontology annotations.                                         resources are identified and how they have to be handled (e.g. find
For a structured representation of semantically annotated data and     property values) in the SPARQL endpoint. SPARQL is a strong
a more intuitive exploration, RelFinder has been created by Heim       query language for databases, which allows it to access and to
et al. The main idea is that a structured representation of RDF data   modify RDF data. The Endpoint in the third step is able to
translate SPARQL based queries into SQL queries, which allows         introduced our own “InSciTe” prefix and related properties. They
a live database access, even for non-SQL compatible applications.     may be replaced during the further working process and should be
The “Application System” is the third step. It allows exploring the   seen as temporary.
data sets visually. With this application, researchers can find new
latent interconnections in the database, like for example an
exceptionally usage of rare acronyms by an individual author.           Table 1. Common vocabulary used for scientific data sets
This exploring is based on the RDF query language SPARQL                          Data field                        Property
from the second step.
                                                                       Abstract                               pmlp:hasAbstract
                                                                       Acronym                                InSciTe:Acronym
                                                                       Affiliation                              sch:affiliation
                                                                       Author                                       dc:creator
                                                                       Coauthor                                sch:contributor
                                                                       Editor                                       sch:editor
                                                                       Editor Email                                 sch:email
                                                                       Element type (e.g. journal)          InSciTe:ElementType
                                                                       Authors Email                                sch:email

  Figure 1. LOD system architecture for scientific data sets.          Expansion                             InSciTe:Expansion
                                                                       Source URL                             sch:isBasedOnUrl

3.2 Data Set                                                           Title                                     deri:hasTitle
Digital resources like the Digital Bibliography & Library Project
                                                                       Year of publication                    sch:datePublished
(DBLP) and publishing houses like Springer, IEEE or Elsevier
serve as data providers. Unfortunately, like already mentioned, the
data quality is inconsistent between the data providers. For that
reason, several refinement steps were applied to the data set,        3.4 Design of URIs
especially for the ambiguity detection. During the work is still in   Linked data is based on URIs which identify things and enable
process, we plan to apply more algorithms for named entity            users and computer-based agents to refer to these things or look
disambiguation under the goal of an increasing data set quality.      them up. In this case URIs identify entities like researchers and
The researcher data set contains 8,370,074 publications like PhD      show their relationships to other researchers or publications.
theses, articles or online resources. These publications come with    The D2R server is managing the URIs mostly automatically. For
additional information, for example abstracts, author names,          the class overview pages, we applied the following structure:
coauthors, year of publication and affiliations. Furthermore, we
identified acronyms as well as the individual expansion based on      “http://{ip}:{port}/directory/{classname}s”
publication’s abstracts. This information is also part of the         For individual resources like researcher we applied:
researcher data set, because they can support a future
disambiguation of researchers. One of our research subjects           “http://{ip}:{port}/page/{classname}/{id}”
applied on these datasets is for example the detection of
technology trends and the identification of the research domain of    4. IMPLEMENTATION
individual researchers. Trough the data visualization, we expect to   In this section, we explain the current state of the system’s
find more latent relationships between objects and classes, which     implementation, including the D2R server and the RelFinder
allow us to disambiguate single named entities.                       visualization.
3.3 Common vocabularies                                               4.1 Test data set
Common vocabularies are necessary to enhance the                      For this project, we created a test set of 60 researchers and 400
interoperability between concepts. For that reason we use them, as    related publications as well as 454 publication-related acronyms
far they already exist and exactly describe the data field.           and expansions. Researchers were selected by their number of
For the interlinking of the data and the automatic processing, the    publications, which has to be at least two. KISTI has diverse
exact description of an attached common vocabulary is very            scientific data sets, which are derived from papers, patents and
important. For person related information we use for example the      others. In order to find more valuable relationships from those
schema.org types and properties.                                      data sets, we extracted acronyms and expansions. The number of
                                                                      test data is 491,982. Using these acronyms and expansion we can
One example for a prefix is “dc”, which is commonly used for the
                                                                      apply these data sets to analyze technology trend and we can find
Dublin Core Meta Initiative Terms (http://www.dublincore.org).
                                                                      specific researchers who can be an expert about these acronyms or
We introduced “sch” as another prefix, which describes the
                                                                      expansions [7]. The average number of expansion about an
schemas of schema.org (http://www.schema.org). For the data
                                                                      acronym is 2.9. We observed that the distribution of expansions
field “Acronym”, “Element type” and “Expansion” no fitting
                                                                      follows Zipf’s law. We separated each acronym name based on its
vocabularies are known at the moment. Because of that, we
semantics because an acronym name can be ambiguous. Therefore,            our current work, because more standardization and comparability
each expansion name can have its own acronym name and its own             is needed in order to interlink this data field. The acronyms are
URI. In order to get more valuable analysis reports by using              interlinked within an own class, which contains the expansion and
acronyms and their expansions, we have to create relationships            other related publication.
between these data sets and other resources such as Linked Open           An example graph made by RelFinder’s visualization application
Data, SNS data, Freebase, DBPedia and others. Finally, we get             is shown in Figure 4. Instead of an additional HTTP server like
more information from those relationships.                                Apache, we implemented the RelFinder application in the already
                                                                          existing D2R web server. That way, no difficult setup is needed in
4.2 DataHub System based on LOD                                           order to start our system – all components are loaded during
The publications view is shown in Figure 2. This view contains
                                                                          program’s initiation.
information concerning the publication itself, but also further
information like coauthors, which have the property
“sch:contributor” or acronyms. There are two detected acronyms,
which are interlinked with an own class. This class contains the
expansion and another related publication, which also contains
this acronym in the same meaning. This allows us to find related
publications based on acronyms as a first indication for the
following classification.


                                                                                    Figure 4. Data visualization with RelFinder
                                                                          Here, we added two researcher resources and one acronym. An
                                                                          edge shows the relation between two RDF objects in a
                                                                          unidirectional way. RelFinder looks up the relations between
                                                                          these resources and draws a graph. In this example, the two
                                                                          researchers have one publication in common (red relation). Or in
                                                                          other words: Both researchers are creators (authors) of this
                                                                          publication. Furthermore the acronym is related to the second
                                                                          researcher trough another paper.
                                                                          To build this relations, a n:m database table is needed, which
       Figure 2. View on a specific publication (excerpt).                contains one identifying ID for the publication and one for the
Furthermore, the email address of the main author is shown. It is         author, which we call ‘sequence number’. The D2R server
part of the publication, because email addresses can change from          interlinks the affiliation data with the publication data, based on
publication to publication. During the work is still in process, the      this additional table. This table has to be extended for all
“sch:editor” data field is not finished yet, because it contains more     publications and researchers as part of the further research.
than one author divided by a pipe symbol. This will be part of the
following work. Figure 3 shows for example the person view,               5. TRANSFERABILITY
which contains information about a specific researcher.                   Because of the open architecture, this system can easily be
                                                                          adapted to other domains. In recent years, patent mining has
                                                                          gained in popularity. Considerations for data acquisition and data
                                                                          providing were investigated considering several aspects [14][15].
                                                                          The use of RDF technology is also discussed in the area of patent
                                                                          mining and implemented, for instance, as information retrieval
                                                                          system for biomedical patents by Mukherjea and Bamba [14].
                                                                          Furthermore, it visualizes the connections between patents, but
                                                                          does not allow any user interaction.
                                                                          The presented LOD system can be applied very well to the patent
                                                                          mining and expand existing approaches by the factor of
                                                                          information integration in the semantic web (data providing). The
                                                                          objects of interest, for example, are inventors, assignees, titles,
             Figure 3. View on a specific researcher
                                                                          abstracts etc. This information can be interlinked by relations like
                                                                          “refers”, “invented”, “assigned” etc. [14]. The semantic
In this case the researcher has one publication which is interlinked      representation of patents as well as of academic documents is
by the “is dc:creater of” property. Furthermore an email address          similar. Both document types can be divided into two parts: The
and the affiliation are given. The affiliation data field is subject of
document structure and the content [15]. Common vocabularies              RDF knowledge bases. In Semantic Multimedia, pp. 182-
are available for both information sources.                               187, Springer Berlin Heidelberg.
                                                                      [5] Hert, M., Reif, G. and Gall, H. C. 2011. A comparison of
6. CONCLUSIONS                                                            RDB-to-RDF mapping languages. In Proceedings of the 7th
The main idea of this paper is to unify scientific data such as           International Conference on Semantic Systems, pp. 25-32,
researcher information, publications and acronyms as well as their        ACM.
expansions from several original resources and to provide them as
                                                                      [6] Jentzsch, A., Zhao, J., Hassanzadeh, O., Cheung, K. H.,
machine-readable and structured RDF graphs, which allow
                                                                          Samwald, M. and Andersson, B. 2009. Linking open drug
interlinking and automatic processing. For this reason we
                                                                          data. In I-SEMANTICS.
introduced a LOD system for scientific data sets.
                                                                      [7] Kim, J., Hwang, M., Jeong, D. H. and Jung, H. 2012.
Based on the data visualization trough RelFinder, the system can          Technology trends analysis and forecasting application
further help to identify latent relations between researchers based       based on decision tree and statistical feature analysis. In
on publications, acronyms or for example co-authors and further           Expert Systems with Applications, 39(16), pp. 12618-12625.
to disambiguate single objects and classes.
                                                                      [8] Le-Phuoc, D., Parreira, J. X., Hausenblas, M., Han, Y. and
As above-mentioned, this is work in progress and we want to               Hauswirth, M. 2010. Live linked open sensor database. In
apply further algorithms on the data sets to solve existing               Proceedings of the 6th International Conference on Semantic
disambiguation problems, especially in the researcher data set.           Systems, p. 46, ACM.
Additionally we will expand the linked properties between single
                                                                      [9] Lohmann, S., Heim, P., Stegemann, T. and Ziegler, J. 2010.
classes in order to improve the identification of relations between
                                                                          The RelFinder User Interface: Interactive Exploration of
these classes.
                                                                          Relationships between Objects of Interest. In Proceedings of
It could further be shown that the system is portable due to its          the 15th international conference on Intelligent user
flexible adaptation to other domains (e.g. patent mining) although        interfaces, pp. 421-422, ACM.
the prototypical implementation was designed for other sources        [10] Lukovnikov, D., Stadler, C., Kontokostas, D., Hellmann, S.,
(academic publications).                                                   and Lehmann, J. 2014. DBpedia Viewer-An Integrative
In the near future, we will publish all data sets for research             Interface for DBpedia Leveraging the DBpedia Service Eco
purposes. It will be made available online via our homepage                System. In Proceedings of the 7th Workshop on Linked Data
(http://inscite.kisti.re.kr/ or http://semantic.kisti.re.kr).              on the Web.
                                                                      [11] Mitrevski, M., Jovanovik, M., Stojanov, R. and Trajanov, D.
                                                                           2012. Open University Data. In Proceedings of the 9th
7. ACKNOWLEDGMENTS                                                         Conference for Informatics and Information Technology.
This work was supported by the KISTI [K-14-L02-C03-S03,               [12] Samwald, M., Jentzsch, A., Bouton, C., Kallesøe, C. S.,
Development of Technologies for S&T Text Big Data Analytics                Willighagen, E., Hajagos, J. and Stephens, S. 2011. Linked
Application Platform] and a grant from the University of                   open drug data for pharmaceutical research and
Paderborn, Germany.                                                        development. In Journal of cheminformatics, 3(1), p. 19.
                                                                      [13] Latif, A., Afzal, M. T. and Maurer, H. A. 2012. Weaving
8. REFERENCES                                                              Scholarly Legacy Data into Web of Data. J. UCS, 18(16), pp.
[1] Berners-Lee, T., Bizer, C. and Heath, T. 2009. Linked data-            2301-2318.
    the story so far. International Journal on Semantic Web and
    Information Systems, 5(3), pp. 1-22.                              [14] Mukherjea, S. and Bamba, B. 2004. BioPatentMiner: an
                                                                           information retrieval system for biomedical patents. In
[2] Bizer, C. and Cyganiak, R. 2006. D2r server - publishing               Proceedings of the Thirtieth international conference on Very
    relational databases on the semantic web. Poster at the 5th            large data bases.
    International Semantic Web Conference.
                                                                      [15] Ghoula, N., Khelif, K. and Dieng-Kuntz, R. 2007.
[3] Deng, D. P., Mai, G. S., Hsu, C. H., Chang, C. L., Chuang,             Supporting patent mining by using ontology-based semantic
    T. R., Shao, K. T. 2012. Linking Open Data Resources for               annotations. In Proceedings of the Web intelligence,
    Semantic Enhancement of User-Generated Content. In the                 IEEE/WIC/ACM international conference, pp. 435-438.
    book of Semantic Technology. pp. 362-367.
[4] Heim, P., Hellmann, S., Lehmann, J., Lohmann, S. and
    Stegemann, T. 2009. RelFinder: Revealing relationships in