Semantically Mapping Science (SMS) Platform Ali Khalili1 , Peter van den Besselaar2 , Al Koudous Idrissou1 , Klaas Andries de Graaf1 and Frank van Harmelen1 1 Department of Computer Science, Vrije Universiteit Amsterdam, NL {a.khalili,o.a.k.idrissou,ka.de.graaf,frank.van.harmelen}@vu.nl 2 Department of Organization Sciences, Vrije Universiteit Amsterdm, NL p.a.a.vanden.besselaar@vu.nl Abstract. Up to now, STI (Science, Technology, Innovation) studies are either rich but small scale (qualitative case studies) or large scale and under-complex – because they generally use only a single dataset like Patstat, Scopus, WoS (Web of Science), OECD STI indicators, etc., and therefore deploying only a few variables – determined by the data available. However, progress in the STI research field (and the social sciences in general) depends in our view on the ability to do large-scale studies with often many variables specified by relevant theories. There is a need for studies which are at the same time big and rich. The aim of the Semantically Mapping Science (SMS) platform is to enable enriching and integration of heterogeneous data, ranging from tabular statistical data to unstructured data found on the Web, in order to exploit the huge amount of data that are ‘out there’ in an innovative and meaningful way. 1 Introduction Social phenomena generally are complex, and understanding those phenomena requires integrating and analyzing data from multiple sources. Up to now, STI (Science, Technology, Innovation) studies are either rich but small scale (qual- itative case studies) or large scale and under-complex – because they generally use only a single dataset like Patstat, Scopus, WoS (Web of Science), OECD STI indicators, etc., and therefore deploying only a few variables – determined by the data available. However, progress in the STI research field (and the social sciences in general) depends in our view on the ability to do large-scale stud- ies with often many variables specified by relevant theories. There is a need for studies which are at the same time big and rich. In this paper, we present the Semantically Mapping Science (SMS) platform as a means to enable enriching and integrating heterogeneous public and private data, ranging from tabular statistical data to unstructured data found on the Web, in an innovative and meaningful way. SMS is built as an open source platform3 and is available online at http://sms.risis.eu. Fig. 1: The SMS Platform Architecture. 2 Architecture As shown in Figure 1, the SMS platform consists of three main layers: data layer, services layer and application layer. Data layer deals with data conversion, storage and access plans. Service layer provides a set of Web services on top of the created Linked Data to allow developing innovative applications. The application layer is the terminal for end-users who interact with the SMS platform. In this system paper, we briefly describe the main services and applications provided by the SMS platform: 2.1 Conceptual Model SMS platform at its conceptual model employs an entity-centric approach to interlink heterogeneous datasets in the STI domain. As shown in Figure 2, the following entity types are extracted after analysis of existing RISIS datasets and their related open datasets: Funding Programs, Projects, Publications, Patents, Persons, Organizations, Organization Rankings, Geo locations, Geo boundaries and Geo statistical data. It is also possible to add new entity types based on the research questions which need to be answered by the SMS infrastructure. The main idea is creating a data network by linking and enriching the data, a network which the social science user can access through the faceted browser. By selecting the required entities and properties from the data network, the user gets an overview of the data he/she is interested in. The platform produces in the background the required SPARQL queries to retrieve the selected data from multiple datasets in a required format for further analysis. 3 https://github.com/risis-eu/sms-platform 2 Fig. 2: The Main Entity Types Involved in the SMS System. 2.2 Data Curation Metadata helps potential users of a dataset to decide whether the dataset is appropriate for their purposes or not. SMS platform has a collection of various heterogeneous datasets that are not always publicly accessible due to privacy issues, and often require a researcher to be physically at the dataset location. To access these datasets, one needs to be granted an access request. This adminis- trative detour that a researcher has to endure prior to detecting which dataset to use for a particular research question can reduce the number of SMS datasets visitors. It has been shown that research publications that provide access to their base data yield consistently higher citation rates than those that do not. There- fore, to attract more users, to visit and cite RISIS datasets, SMS provides a dataset metadata service and application - modeled using the Resource Descrip- tion Framework (RDF) - that allows researchers to search for data, and have an in-depth understanding of the data without the need to directly access it [2]. Metadata service powered by an intuitive UI allows dataset holders to describe their datasets in a detailed, consistent and uniform way, store the description and if needed modify the stored metadata. 4 The curated metadata are then reflected on RISIS dataset’s portal available at http://datasets.risis.eu. 2.3 Browsing and Querying Datasets One of the objectives in developing the SMS platform was to enable non-Linked Data experts to query and browse RDF datasets without having the knowledge of SPARQL query language. There are currently two main approaches to make information retrieval from SPARQL endpoints more usable: user interaction and natural language (NL). In the category of user interaction-based query genera- tion, faceted browsing user interfaces are well-known techniques which provide 4 see an screencast of the SMS metadata editor at https://youtu.be/p_2D3ydcx1U 3 Fig. 3: An screenshot of the SMS faceted browser. a convenient and user-friendly way to navigate through a wide range of data collections [1]. Faceted browsing UIs allow users to find information without a- priori knowledge of its schema [4]. A faceted interface has several advantages over keyword search or NL queries: it allows exploration of an unknown dataset since the system suggests restriction values at each step; it is a visual interface, removing the need to write explicit queries; and it prevents dead-end queries, by only offering restriction values that do not lead to empty results [4]. SMS provides an adaptive component-based faceted browser environment5 on top of the LD-R framework [3] to allow end-users explore STI related datasets in an integrated way and to incorporate additional features for serendipitous knowledge discovery (see Figure 3 for a screenshot). 2.4 Semantic Enrichment of Data SMS provides a set of services and applications that allow users to enrich their data by adding complementary data to their current data. There are three cat- egories of data-enrichment services provided: Named Entity Recognition. Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Given a dataset which has one or more attributes with textual values, SMS NER service can extract named entities from the text and more importantly connect the extracted entities to a knowledge graph or taxonomy (which can then provide more data about those entities). By default, SMS employs DBpedia Spotlight service for NER. However, 5 see an screencast of the SMS faceted browser at https://youtu.be/9TMLKdGZExY 4 any arbitrary NER service can be plugged into SMS NER service as long as the output of service is reconciled to SMS named entities annotation model. SMS provides an interactive UI to annotate a dataset using the NER service6 . Geo-enrichment. Geo-enrichment is an instrument to enrich data by linking through geo-location. Many (open) datasets provide variables that are measured at some level of geographical aggregation: e.g., environmental data, educational data, or socio-economic data. In order to exploit these linking and enriching pos- sibilities, the SMS platform provides a variety of geo-services. The geo-services are based on a series of open geo-resources, such as GADM, OpenStreetMap and Flickr geotagged data. By integrating these geo-resources, the service can give for an entity’s address the geo-location up to 11 different levels. One practical ap- plication we built for batch processing of addresses is a Google spreadsheet add- on7 which chains Google Geocoding API with our geo-boundary services. Given addresses in a spreadsheet are enriched with different levels of administrative boundaries and FUAs. The users are then able to export the extracted bound- aries and process them in geodata analysis tools such as CartoDB.8 We have also developed a user interface for automatic geo-enrichment of linked datasets in the SMS platform. The interface allows users to select an existing dataset and geocode the whole dataset by selecting the right attributes in the dataset9 . 2.5 Data Linking Linking between entities in different datasets is a crucial element of the SMS platform. Whether or not two entities should be considered equal depends not only on their intrinsic properties, but also on the purpose or task for which the entities are used. As an example, to study the success of scientific organiza- tions, STI researchers need to align research organizations across datasets such as GRID10 and OrgRef11 that describe organisations across various countries including public and private research organisations. The 3M corporation, a large multinational organisation with a substantial patent portfolio, occurs in both datasets. GRID distinguishes between national 3M branches across six countries 3M (Canada), 3M (France), 3M (Germany), 3M (Israel), 3M(United Kingdom) and 3M(United States), while OrgRef only refers to a single 3M entity. Should these entities be designated as “the same” across these datasets? It depends. For a study that aims to compare organizations at a global level, all branches of ‘3M’ should be considered the same. Whereas, for a study that compares orga- nizations for a comparison across countries, the Canadian and U.S. branches of ‘3M’ should be considered separately. SMS provides a novel approach called “Lenticular Lens” for building context- specific links between entities of interest. These links are decorated with rich metadata describing how, why, when and by whom they were generated. As 6 see an screencast of the NER UI at https://youtu.be/OcYNpVRP9_Q 7 https://docs.google.com/document/d/1JoJM7VF_ZaaAPbSjtgpydzRDYLvr-tROzhITGj0cH3w 8 see an screencast of the SMS Google spreadsheet add-on at https://youtu.be/qZGDD5RN7pI 9 see an screencast of the geo-enrciher UI at https://youtu.be/PFalWjluMR8 10 See https://grid.ac/ 11 See http://www.orgref.org/web/download.htm 5 Fig. 4: An screenshot of the SMS data linking UI. shown in Figure 4, SMS exposes an intuitive UI12 to allow end-users create their own lenticular lenses available at http://lenticular-lens.risis.eu. 3 Use Cases In order to demonstrate how the SMS platform can be used for research, we describe several use cases at http://sms.risis.eu/usecases . The use cases demonstrate different features of the platform in connection to addressing certain challenges covering topics such as investigating network structure of research organisations, browsing research data for temporal evolution of higher education, analyzing the geography of innovation and the structure of research portfolio and predicting Leiden Ranking from University environment factors. References 1. M. Hildebrand, J. van Ossenbruggen, and L. Hardman. /facet: A browser for het- erogeneous semantic web repositories. ISWC, pages 272–285, 2006. 2. A. K. Idrissou, A. Khalili, R. Hoekstra, and P. V. den Besselaar. Managing meta- data for science, technology and innovation studies: The RISIS case. In A. Adamou, E. Daga, and L. Isaksen, editors, WHiSe, volume 1608 of CEUR Workshop Proceed- ings, pages 15–20. CEUR-WS.org, 2016. 3. A. Khalili, A. Loizou, and F. van Harmelen. Adaptive linked data-driven web components: Building flexible and reusable semantic web interfaces. In ESWC, volume 9678 of Lecture Notes in Computer Science, pages 677–692. Springer, 2016. 4. E. Oren, R. Delbru, and S. Decker. Extending faceted navigation for rdf data. In International semantic web conference, volume 4273, pages 559–572. Springer, 2006. 12 see an screencast of the linking UI at https://youtu.be/CcffBlCBF54?list= PLo4YbUaRFSnwJ9XJvp6rlIMsaw_rfKT9C 6