Linked Open Data System for Scientific Data Sets Frederik Simon Bäumer Jangwon Gim* Do-Heon Jeong Heinz Nixdorf Institute, HNI Korea Institute of Science and Korea Institute of Science and University of Paderborn Technology Information, KISTI Technology Information, KISTI Paderborn, Germany Daejeon, South Korea Daejeon, South Korea fbaeumer@hni.upb.de jangwon@kisti.re.kr heon@kisti.re.kr Michaela Geierhos Hanmin Jung Heinz Nixdorf Institute, HNI Korea Institute of Science and University of Paderborn Technology Information, KISTI Paderborn, Germany Daejeon, South Korea geierhos@hni.upb.de jhm@kisti.re.kr ABSTRACT 1. INTRODUCTION In this paper, we present a system which makes scientific data available following the linked open data principle using standards The increase in the number of datasets including scholarly like RDF und URI as well as the popular D2R server (D2R) and publications in the Linked Data cloud shows the importance of the customizable D2RQ mapping language. Our scientific data Linked Data for the scientific community. There are many sets include acronym data and expansions, as well as researcher different providers for scientific data available on the Web. data such as author name, affiliation, coauthors, and abstracts. The Publishing houses, digital libraries, and resellers have popular system can easily be extended to other records. Regarding this, a data sources. The information types range from author domain adaptation to patent mining seems possible. For this information (e.g. full name, affiliation, email), over publications reason, obvious similarities and differences are presented here. (e.g. title, abstract, coauthors), to specific data like acronyms and their related expansion. Each of these types has own requirements The data set is collected from several different providers like on the data presentation and storage, but they all are somehow publishing houses and digital libraries, which follow different interlinked with each other. standards in data format and structure. Most of them are not supporting semantic web technologies, but the legacy HTML One common way to represent this interlinks are network models. standard. The integration of these large amounts of scientific data This database model is a generalized graph structure without any into the Semantic Web is challenging and it needs flexible data hierarchical restriction, which allows storing objects with their structures to access this information and interlink them. individual relationships. This is a common way to store data, but does not fit our requirements for modern data publishing. Based on these data sets, we will be able to derive a general technology trend as well as the individual research domain for A more promising way to share data is the Web of Linked Data. each researcher. The goal of our Linked Open Data System for The main idea of Linked Open Data (LOD) is to publish free scientific data is to provide access to this data set for other accessible, structured data and to interlink it with other data. This researchers using the Web of Linked Data. Furthermore we interlinking generates more valuable information under the implemented an application for visualization, which allows us to consequent implementation of standard Web technologies such as explore the relations between single data sets. RDF or URI. The core benefit for further data exploration is the ability to apply complex graph queries, which allow the Categories and Subject Descriptors interlinking, combining and modifying of data. D.2.12 [Interoperability]: Data mapping For publishing scientific data stored in relational databases, a data E.2 [Data Storage Representations]: Linked representations bridge is needed. For that reason we present our three-component LOD system for scientific data sets, based on the popular D2R General Terms server and the customizable D2RQ mapping language. For further Standardization, Languages data exploration, we integrated RelFinder, an application that visualizes relationships between RDF objects enables the exploration of data interactively. We will demonstrate the Keywords functionalities of our system on a test set. Linked Open Data, Researcher Data, Acronym Data, D2R *: Co-responding author 2. RELATED WORK A lot of work regarding Linked Open Data and the Semantic Web Copyright © 2014 for the individual papers by the papers' authors. is already done. Popular tools like the D2R server or standards Copying permitted for private and academic purposes. This volume is like RDF, HTTP and SPARQL are well proved and used in many published and copyrighted by its editors. Published at Ceur-ws.org Proceedings of the First International Workshop on Patent Mining and LOD systems [1]. Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. A major technological advantage is the backward compatibility, At KONVENS’14, October 8–10, 2014, Hildesheim, Germany. for example, to relational databases (RDB) because the majority of data on the current Web is stored in this kind of databases. The opens up new possibilieties in the way they can be accessed and process of mapping RDB to RDF is subject of current research queried. For that purpose, a force-directed graph layout which and different approaches like D2RQ, Triplify or R2RML were supports every RDF know-ledge base, that provides standardized created [5]. R2RML currently developed by the W3C with the SPARQL access, is implemented [9]. Relationships between main goal to define a RDB to RDF mapping standard for read- objects can be identified much easier by exploring the data step by only data access. A different approach is Triplify. It is a very step. Features like highlighting, previewing, and filtering are lightweight plugin for existing Web applications, which makes available for further support [4]. The class filter and the link filter database content available as RDF and other formats. allow it to hide objects which contain specific classes or links. RDB to RDF mapping can be done by applications such as D2R Primary these are objects, which are in this particular case not of server. It is a java-based application for publishing the content of interest. In addition to these, the length filter and the connectivity relational databases on the Semantic Web and for providing RDF filter allow a further selection by hiding objects with a specific and HTML representations of resources. The D2RQ language is amount of links and relationships [9]. used for the mapping, which is popular because of the possibility RelFinder is often used as stand-alone application but also as to provide access via SPARQL queries very easily [5]. integrated visualization application. It is for example part of the The ability to interlink resources under the use of “typed DBpedia Viewer, an integrative interface for DBpedia, which relationships” allows a goal-oriented navigation trough the combines LodLive as one more different visualization approaches database content by web browsers as well as crawler applications next to RelFinder. The LodLive application is a web-based tool [2]. Because these LOD systems and components are very that allows the exploration of Linked Data in an intuiitive and flexible, it is possible to adapt them to different domains. interactive way [10]. In comparison, both applications provide a Interlinked User-generated content from social networks, a Linked good visualization. RelFinder can convince with a higher browser Open Drug Data (LODD) for pharmaceutical research and compatibility. Especially when using the Internet Explorer, development or a LOD live database of semantically enriched LodLive throws JavaScript errors drawing several relations sensor data, are only few successful examples, using the D2R between classes. server [3]. Latif, Afzal and Maurer [13] utilized the D2R server to publish 3. LOD SYSTEM FOR SCIENTIFIC DATA unstructured datasets of the Journal of Universal Computer The scientific data sets are collected from different resources and Science (J.UCS) as Linked Data in the Web. The linked Open based on different formats. For that reasons, it is a difficult task to Data project provides a new way to publish machine readable combine information such as researchers, affiliations or structured data on the Web and best practices for interlinking publications and to provide them as a clean, interlinked data set these structured datasets. Moreover, the increase in the number of under the requirements of quality. To make a contribution to the datasets including scholarly publications in the Linked Data cloud existing Web of Linked Data, it is necessary to publish it in a shows its importance in the scientific community. In order to take standardized, structured way under the use of common advantage of benefits of LOD projects, the legacy HTML data in vocabularies and over a common server framework. Furthermore, this journal is converted into machine-readable and structured it has to be identified which information can be bundled to a data RDF data using the D2R server. It is considered to be an set containing relevant data and how this data can be usefully appropriate for data conversion due to its good performance, interlinked. For this reason, we developed a LOD system for scalability and the availability of SPARQL endpoint and explorer scientific data sets, which can handle all related information like features. A RDF graph converted from the legacy HTML data has researcher data or publications and publish them in a structured been made available in Linked Data cloud for the data reuse and way under the use of common semantic web technologies like interlinking. Moreover, structured journal data was interlinked RDF and SPARQL. The interlinked data becomes more useful with Linked Data resources and it successfully disambiguated and and can be easily adapted to other research topics. interlinked datasets of authors and publications with DBpedia, DBLP and Faceted DBLP as well as CiteULike. 3.1 System architecture The system architecture in Figure 1 can be divided in three main In addition, Mitrevski, Javanovik and Stonjanov [11] identified components. The first component contains the datasets, which some issues regarding the D2R server, which appear during the include the specific data, as a result of different acquisition and process of publishing the open data of the faculty of computer preprocessing steps we applied before. Because these data sets are science and engineering in Ss. Cyril and Methodius University. stored in a relational database, a Linked Data view on the existing One problem is that these relational databases include some database is needed (data bridge). A SPARQL endpoint with the confidential data about employees and students. However, the ability for serving Linked Data views on relational databases is D2R server does not provide a way to convert only specific parts D2R Server, which is part of the second step. from the database into data in a semantic web format. Moreover, it lacks of functionality enabling the user to link existing We chose D2R, because it is one of the most important and most ontologies to the tables. These issues can be solved by creating a mature relevant solutions. new database called Open Data DB including only the data with The second step, called “Linked Data System”, is responsible for no privacy infringement as well as building the mapping tool the declarative mapping between the schemata of the database and utilizing functionalities of the D2R server, but linking the data the target RDF terms, based on mapping rules. These rules are with ontologies. Although a mapping tool was created, the study stored in mapping files and are formalized under the use of the presents some improvements of this system such as automatic popular D2RQ mapping language. Each rule defines in detail how proposal ontology annotations. resources are identified and how they have to be handled (e.g. find For a structured representation of semantically annotated data and property values) in the SPARQL endpoint. SPARQL is a strong a more intuitive exploration, RelFinder has been created by Heim query language for databases, which allows it to access and to et al. The main idea is that a structured representation of RDF data modify RDF data. The Endpoint in the third step is able to translate SPARQL based queries into SQL queries, which allows introduced our own “InSciTe” prefix and related properties. They a live database access, even for non-SQL compatible applications. may be replaced during the further working process and should be The “Application System” is the third step. It allows exploring the seen as temporary. data sets visually. With this application, researchers can find new latent interconnections in the database, like for example an exceptionally usage of rare acronyms by an individual author. Table 1. Common vocabulary used for scientific data sets This exploring is based on the RDF query language SPARQL Data field Property from the second step. Abstract pmlp:hasAbstract Acronym InSciTe:Acronym Affiliation sch:affiliation Author dc:creator Coauthor sch:contributor Editor sch:editor Editor Email sch:email Element type (e.g. journal) InSciTe:ElementType Authors Email sch:email Figure 1. LOD system architecture for scientific data sets. Expansion InSciTe:Expansion Source URL sch:isBasedOnUrl 3.2 Data Set Title deri:hasTitle Digital resources like the Digital Bibliography & Library Project Year of publication sch:datePublished (DBLP) and publishing houses like Springer, IEEE or Elsevier serve as data providers. Unfortunately, like already mentioned, the data quality is inconsistent between the data providers. For that reason, several refinement steps were applied to the data set, 3.4 Design of URIs especially for the ambiguity detection. During the work is still in Linked data is based on URIs which identify things and enable process, we plan to apply more algorithms for named entity users and computer-based agents to refer to these things or look disambiguation under the goal of an increasing data set quality. them up. In this case URIs identify entities like researchers and The researcher data set contains 8,370,074 publications like PhD show their relationships to other researchers or publications. theses, articles or online resources. These publications come with The D2R server is managing the URIs mostly automatically. For additional information, for example abstracts, author names, the class overview pages, we applied the following structure: coauthors, year of publication and affiliations. Furthermore, we identified acronyms as well as the individual expansion based on “http://{ip}:{port}/directory/{classname}s” publication’s abstracts. This information is also part of the For individual resources like researcher we applied: researcher data set, because they can support a future disambiguation of researchers. One of our research subjects “http://{ip}:{port}/page/{classname}/{id}” applied on these datasets is for example the detection of technology trends and the identification of the research domain of 4. IMPLEMENTATION individual researchers. Trough the data visualization, we expect to In this section, we explain the current state of the system’s find more latent relationships between objects and classes, which implementation, including the D2R server and the RelFinder allow us to disambiguate single named entities. visualization. 3.3 Common vocabularies 4.1 Test data set Common vocabularies are necessary to enhance the For this project, we created a test set of 60 researchers and 400 interoperability between concepts. For that reason we use them, as related publications as well as 454 publication-related acronyms far they already exist and exactly describe the data field. and expansions. Researchers were selected by their number of For the interlinking of the data and the automatic processing, the publications, which has to be at least two. KISTI has diverse exact description of an attached common vocabulary is very scientific data sets, which are derived from papers, patents and important. For person related information we use for example the others. In order to find more valuable relationships from those schema.org types and properties. data sets, we extracted acronyms and expansions. The number of test data is 491,982. Using these acronyms and expansion we can One example for a prefix is “dc”, which is commonly used for the apply these data sets to analyze technology trend and we can find Dublin Core Meta Initiative Terms (http://www.dublincore.org). specific researchers who can be an expert about these acronyms or We introduced “sch” as another prefix, which describes the expansions [7]. The average number of expansion about an schemas of schema.org (http://www.schema.org). For the data acronym is 2.9. We observed that the distribution of expansions field “Acronym”, “Element type” and “Expansion” no fitting follows Zipf’s law. We separated each acronym name based on its vocabularies are known at the moment. Because of that, we semantics because an acronym name can be ambiguous. Therefore, our current work, because more standardization and comparability each expansion name can have its own acronym name and its own is needed in order to interlink this data field. The acronyms are URI. In order to get more valuable analysis reports by using interlinked within an own class, which contains the expansion and acronyms and their expansions, we have to create relationships other related publication. between these data sets and other resources such as Linked Open An example graph made by RelFinder’s visualization application Data, SNS data, Freebase, DBPedia and others. Finally, we get is shown in Figure 4. Instead of an additional HTTP server like more information from those relationships. Apache, we implemented the RelFinder application in the already existing D2R web server. That way, no difficult setup is needed in 4.2 DataHub System based on LOD order to start our system – all components are loaded during The publications view is shown in Figure 2. This view contains program’s initiation. information concerning the publication itself, but also further information like coauthors, which have the property “sch:contributor” or acronyms. There are two detected acronyms, which are interlinked with an own class. This class contains the expansion and another related publication, which also contains this acronym in the same meaning. This allows us to find related publications based on acronyms as a first indication for the following classification. Figure 4. Data visualization with RelFinder Here, we added two researcher resources and one acronym. An edge shows the relation between two RDF objects in a unidirectional way. RelFinder looks up the relations between these resources and draws a graph. In this example, the two researchers have one publication in common (red relation). Or in other words: Both researchers are creators (authors) of this publication. Furthermore the acronym is related to the second researcher trough another paper. To build this relations, a n:m database table is needed, which Figure 2. View on a specific publication (excerpt). contains one identifying ID for the publication and one for the Furthermore, the email address of the main author is shown. It is author, which we call ‘sequence number’. The D2R server part of the publication, because email addresses can change from interlinks the affiliation data with the publication data, based on publication to publication. During the work is still in process, the this additional table. This table has to be extended for all “sch:editor” data field is not finished yet, because it contains more publications and researchers as part of the further research. than one author divided by a pipe symbol. This will be part of the following work. Figure 3 shows for example the person view, 5. TRANSFERABILITY which contains information about a specific researcher. Because of the open architecture, this system can easily be adapted to other domains. In recent years, patent mining has gained in popularity. Considerations for data acquisition and data providing were investigated considering several aspects [14][15]. The use of RDF technology is also discussed in the area of patent mining and implemented, for instance, as information retrieval system for biomedical patents by Mukherjea and Bamba [14]. Furthermore, it visualizes the connections between patents, but does not allow any user interaction. The presented LOD system can be applied very well to the patent mining and expand existing approaches by the factor of information integration in the semantic web (data providing). The objects of interest, for example, are inventors, assignees, titles, Figure 3. View on a specific researcher abstracts etc. This information can be interlinked by relations like “refers”, “invented”, “assigned” etc. [14]. The semantic In this case the researcher has one publication which is interlinked representation of patents as well as of academic documents is by the “is dc:creater of” property. Furthermore an email address similar. Both document types can be divided into two parts: The and the affiliation are given. The affiliation data field is subject of document structure and the content [15]. Common vocabularies RDF knowledge bases. In Semantic Multimedia, pp. 182- are available for both information sources. 187, Springer Berlin Heidelberg. [5] Hert, M., Reif, G. and Gall, H. C. 2011. A comparison of 6. CONCLUSIONS RDB-to-RDF mapping languages. In Proceedings of the 7th The main idea of this paper is to unify scientific data such as International Conference on Semantic Systems, pp. 25-32, researcher information, publications and acronyms as well as their ACM. expansions from several original resources and to provide them as [6] Jentzsch, A., Zhao, J., Hassanzadeh, O., Cheung, K. H., machine-readable and structured RDF graphs, which allow Samwald, M. and Andersson, B. 2009. Linking open drug interlinking and automatic processing. For this reason we data. In I-SEMANTICS. introduced a LOD system for scientific data sets. [7] Kim, J., Hwang, M., Jeong, D. H. and Jung, H. 2012. Based on the data visualization trough RelFinder, the system can Technology trends analysis and forecasting application further help to identify latent relations between researchers based based on decision tree and statistical feature analysis. In on publications, acronyms or for example co-authors and further Expert Systems with Applications, 39(16), pp. 12618-12625. to disambiguate single objects and classes. [8] Le-Phuoc, D., Parreira, J. X., Hausenblas, M., Han, Y. and As above-mentioned, this is work in progress and we want to Hauswirth, M. 2010. Live linked open sensor database. In apply further algorithms on the data sets to solve existing Proceedings of the 6th International Conference on Semantic disambiguation problems, especially in the researcher data set. Systems, p. 46, ACM. Additionally we will expand the linked properties between single [9] Lohmann, S., Heim, P., Stegemann, T. and Ziegler, J. 2010. classes in order to improve the identification of relations between The RelFinder User Interface: Interactive Exploration of these classes. Relationships between Objects of Interest. In Proceedings of It could further be shown that the system is portable due to its the 15th international conference on Intelligent user flexible adaptation to other domains (e.g. patent mining) although interfaces, pp. 421-422, ACM. the prototypical implementation was designed for other sources [10] Lukovnikov, D., Stadler, C., Kontokostas, D., Hellmann, S., (academic publications). and Lehmann, J. 2014. DBpedia Viewer-An Integrative In the near future, we will publish all data sets for research Interface for DBpedia Leveraging the DBpedia Service Eco purposes. It will be made available online via our homepage System. In Proceedings of the 7th Workshop on Linked Data (http://inscite.kisti.re.kr/ or http://semantic.kisti.re.kr). on the Web. [11] Mitrevski, M., Jovanovik, M., Stojanov, R. and Trajanov, D. 2012. Open University Data. In Proceedings of the 9th 7. ACKNOWLEDGMENTS Conference for Informatics and Information Technology. This work was supported by the KISTI [K-14-L02-C03-S03, [12] Samwald, M., Jentzsch, A., Bouton, C., Kallesøe, C. S., Development of Technologies for S&T Text Big Data Analytics Willighagen, E., Hajagos, J. and Stephens, S. 2011. Linked Application Platform] and a grant from the University of open drug data for pharmaceutical research and Paderborn, Germany. development. In Journal of cheminformatics, 3(1), p. 19. [13] Latif, A., Afzal, M. T. and Maurer, H. A. 2012. Weaving 8. REFERENCES Scholarly Legacy Data into Web of Data. J. UCS, 18(16), pp. [1] Berners-Lee, T., Bizer, C. and Heath, T. 2009. Linked data- 2301-2318. the story so far. International Journal on Semantic Web and Information Systems, 5(3), pp. 1-22. [14] Mukherjea, S. and Bamba, B. 2004. BioPatentMiner: an information retrieval system for biomedical patents. In [2] Bizer, C. and Cyganiak, R. 2006. D2r server - publishing Proceedings of the Thirtieth international conference on Very relational databases on the semantic web. Poster at the 5th large data bases. International Semantic Web Conference. [15] Ghoula, N., Khelif, K. and Dieng-Kuntz, R. 2007. [3] Deng, D. P., Mai, G. S., Hsu, C. H., Chang, C. L., Chuang, Supporting patent mining by using ontology-based semantic T. R., Shao, K. T. 2012. Linking Open Data Resources for annotations. In Proceedings of the Web intelligence, Semantic Enhancement of User-Generated Content. In the IEEE/WIC/ACM international conference, pp. 435-438. book of Semantic Technology. pp. 362-367. [4] Heim, P., Hellmann, S., Lehmann, J., Lohmann, S. and Stegemann, T. 2009. RelFinder: Revealing relationships in