<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Managing metadata for science, technology and innovation studies: The RISIS case</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Al Koudous Idrissou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Khalili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rinke Hoekstra</string-name>
          <email>rinke.hoekstrag@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter van den Besselaar</string-name>
          <email>p.a.a.vanden.besselaar@vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Vrije Universiteit Amsterdam</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Organization Sciences, Vrije Universiteit Amsterdm</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Law, University of Amsterdam</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>15</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Here, we describe the RISIS-SMS metadata system, developed to support the use of heterogeneous datasets in the eld of Science, Technology and Innovation Studies (STIS). These data are partly within the RISIS infrastructure, but often elsewhere. The system has three aims: (i) to help researchers to search for and understand data that will help to answer speci c research questions, without having to access or download the data. As datasets often have restricted access, browsing metadata is a key feature of the system: researchers need help identifying the relevant data from di erent sources for their research, and for which data it is worthwhile asking for access; (ii) to support the enrichment of data By linking the metadata system to the Linked Open Data environment (LOD); (iii) to facilitate application-driven data integration.</p>
      </abstract>
      <kwd-group>
        <kwd>metadata</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Science &amp; Technology Studies</kwd>
        <kwd>Research Infrastructures</kwd>
        <kwd>digital humanities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The eld of Science, Technology and Innovation Studies (STIS) is an
interdisciplinary eld between the social sciences and the humanities. It covers many elds
from the economics of science and innovation up to the history and philosophy of
science [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It relies on the availability of a large volume of highly heterogeneous
data: structured and unstructured, qualitative and quantitative. STIS studies the
dynamics of scienti c ideas by analysing the content of scienti c publications
and project descriptions. For example, to help understand the selection processes
taking place in the scienti c community or to better understand life histories of
scientists and research organizations.
      </p>
      <p>
        Based on requirements extracted from interviews we conducted, we
identify the need for researchers to search across datasets and for data providers
to attract researchers while keeping restricted datasets access. To address these
problems, we describe in this paper a rich RDF metadata vocabulary to overcome
access limitations and facilitate data discovery, integration and use by humans
and machines [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The vocabulary is used by the Semantically-Mapping Science4
infrastructure (SMS) to provide metadata services besides Geo services,
Integration services and Category services. It enables the integration of qualitative and
quantitative approaches that have been strongly diverging over time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Requirements</title>
      <p>Problem Summary. At its core, SMS comprises a collection of proprietary and
public databases relevant to the eld of STIS. For most RISIS datasets, access
is restricted to users with authorization only granted after an explicit request
is submitted to the data owner. Other data can only be accessed on site at a
physical location. This gets in the way of a good understanding of the content
and coverage of a dataset: if data is not reachable, how can one decide if a closed
dataset is relevant enough for addressing her research question before requesting
for access, or travel to visit a dataset owner? This access limitation does not
only hinder ndability and relevance assessment ex ante, but also hinders the ex
post integration of heterogeneous datasets after they have been identi ed.
Methodology. To address this problem we designed a metadata vocabulary guided
by informal interviews conducted with STIS researchers and data-owners.This
helped to identify and categorize (Figure 1) the information the metadata should
cover. After the rst version was developed, owners of some 12 datasets used the
system. We visited all the data-owners, and discussed the user experiences as
well as the bene ts and problems. This was used to improve the vocabulary
design. Finally exchanging email with users helped in ne tuning the metadata.
Interviews Outcome &amp; Solutions. Protect proprietary datasets - To protect
datasets that contain private and sensitive data or data for which a speci c
permission or subscription is required, we categorised RISIS datasets as: con
dential data, publicly accessible data, and all other relevant public data on the
Web. Overcome access limitation - Because data access is limited, we
provide users with means for browsing dataset metadata rather than inspecting the
data itself. The metadata should meet six requirements: R1 - Facilitate
information displaying at a user interface level. R2 - Provide information guiding the
use of data. R3 - Provide detailed information about the datasets available on
RISIS-SMS. R4 - Support users to get an in-depth understanding of the data
at hand, in such a way that they easily identify how the data should be
interpreted, used, or linked to other data. R5 - Facilitate trust by providing details
about the quality of the underlying data. R6 - Facilitate simple and advanced
search for relevant data. The latter is considered to be a crucial task for data
discovery and link discovery across datasets. Trigger research opportunities
- Use LOD to organize and integrate databases far beyond the internal RISIS
datasets to create new opportunities for research. For example a link to a city
or a university described in DBpedia or GeoNames could be exploited to infer
new knowledge such as the entity location or geographical boundaries.
4 more details are available at http://sms.risis.eu
Metadata Operationalization. From the interviews, we concluded that our
metadata should cover a broad range of di erent aspects which we categorised into
the following nine metadata types: dataset details foverview, temporal aspects,
content, structure, technical aspects, legal aspects, access, visit and data
quality/used methodologiesg and person details. Detailed description about each
of the aforementioned aspects is beyond the scope of this paper. However, we
shortly describe here how the categories are mapped to the requirements.
Technical aspects, access-visit which provides information on the conditions for visiting
one or more data-owners and legal aspects, this satis es R2 . Data overview,
description of the content, temporal aspects and structure of the data, this
supports R3 and R4 . The provenance which helps to know the origin, creator,
when and how of the data, this satis es R5-Trust . The methodologies followed
for addressing some dimensions of data quality such as records de-duplication,
resource disambiguation and, data consistency and correctness, this again
satises R5-Trust . All aspects of the metadata could be used for simple search. For
complex search, only attributes that link to external knowledge are exploitable
R6 . R1 does not follow the above mapping. Instead, it is covered by the creation
of a User-friendly Interface5 that addresses Categorization of metadata types,
use of non technical words and text hint.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Reused Vocabularies</title>
      <p>
        Since the RISIS metadata covers a broad range of aspects, a platform or a
vocabulary that is all inclusive does not exist. So, we selected a set of nine
domainspeci c dataset metadata vocabularies designed for one or more aspects
discussed in the requirements. The main vocabularies identi ed to share many terms
within RISIS metadata concepts are respectively DCMI,6 PROV-O,7 VoID8 and
FOAF.9 Although provenance is not shown in Figure 1, we use it extensively
behind the scene for describing data manipulations. Other reused vocabularies
that involved less coverage of the RISIS requirements include DCAT,10 DISCO,11
WAIVER,12 PAV [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and SKOS.13 Figure 1 illustrates the mapping between
requirements and existing shared vocabularies where, the table header expresses
a domain, the namespace pre x indicates the vocabulary and the su x (local
name) indicates the term mapped to a requirement term. Yet, reusing these
vocabularies did not entirely satisfy RISIS's view on describing a dataset. As a
result, we created new terms such as risis:usecase (see Figure 1) for concepts
that were not covered by any of the selected vocabularies.
5 User-friendly Interface designed for the matter http://datasets.risis.eu/
6 Dublin Core Metadata Initiative: http://dublincore.org/documents/dcmi-terms/
7 PROV Ontology: https://www.w3.org/TR/2013/REC-prov-o-20130430/
8 Vocabulary of Interlinked Datasets: https://www.w3.org/TR/void/
9 FOAF Vocabulary: http://xmlns.com/foaf/spec/
10 Data Catalog Vocabulary: https://www.w3.org/TR/vocab-dcat/
11 Disco Vocabulary: http://rdf-vocabulary.ddialliance.org/discovery.html
12 Waivers of rights vocabulary: http://vocab.org/waiver/terms
13 SKOS Vocabulary: https://www.w3.org/TR/swbp-skos-core-spec
The work done so far is a pilot that needs a more exhaustive investigation of six
other potential vocabularies (see footnote) before nalizing the RISIS ontology.
A better understanding of these vocabularies is expected lead to (1) in depth
understanding for better data integration at the schema level;14 (1,2) facilitate
publishing mapping between a tabular dataset and its RDF converted version;15
(3) o er providers and consumers means to assess the quality of datasets;16 (4,5)
using the right concept describing statistical information relevant for RISIS and,
for publishing RISIS multi-dimensional statistical data;17;18 and (6) to applying
the concept of \One ontology to bind them all" [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to the RISIS problem and
better coverage of legal aspects such as licensing.19
Related work. Projects - Open PHACTS20 gathers pharmacological resources in
an integrated and interoperable infrastructure to connect for example,
information about chemistry to biological information such that users can determine the
potential impact of a chemical on a biological system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. CLARIAH21 also
gathers large collections of data and software from di erent humanities disciplines.
14 Vocabulary for Tabular Data: https://www.w3.org/TR/tabular-metadata/
15 D2RQ Mapping Language: http://d2rq.org/d2rq-language
16 Data Quality Vocabulary: https://www.w3.org/TR/vocab-dqv/
17 SDMX vocabulary: http://lov.okfn.org/dataset/lov/vocabs/sdmx
18 Data Cube vocabulary: https://www.w3.org/TR/vocab-data-cube/
19 Meta-Share OWL meta model: http://purl.org/ms-lod/MetaShare.ttl
20 The Open PHACTS project: https://www.openphacts.org/
21 CLARIAH: http://www.clariah.nl
CEDAR22 inter-links Dutch census data with other data hubs to create a
semantic data-web of historical information. Such construct allows researchers to bridge
information diversity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for historical research to ask complex questions that
involves historical, socio-economic, demographic data and more domains. The
Center for Expanded Data Annotation and Retrieval also known as CEDAR23
provides a uni ed framework for researchers in all scienti c disciplines who need
to create consistent and easily searchable metadata. Platforms - Datahub.io
provides a public registry for datasets and metadata-based data discovery. Only,
its metadata coverage is not adequate for, the language, provenance and license
of a resource are not properly represented. http://lodlaundromat.org/ gives free
access to LOD collected on the web. Although it produces valuable information
for data comparison, analytics and more, it fails to provide su cient description
satisfying R1-6. Vocabularies - DCAT does not provide format speci c
information, it does not provide information on the content of a dataset nor does it
describe the provenance of data and, its coverage of legal aspects of a dataset is
limited. Open PHACTS uses VoID, a data speci c vocabulary24. Through
usage, the speci cations de ned in there turned out to be so strict that, creating
a proper VoID document was too hard to manage by developers; something
RISIS planed to avoid with its metadata. Summary - Projects that gather data
from various sources and domains exist in disciplines such as Pharmaceutical,
Art, Humanities, Socio-economic, Historical, etc. but not in STIS. Platforms
that provide dataset metadata also exist. Only, they are limited or too speci c.
Although many shared vocabularies exit, they are not rich enough.
Discussion. Access limitations and related work described in this paper underpin
the increasing demand for bringing together data from multiple domains to allow
for complex and cross-domain analysis. We argue that this would not be done
without the creation of metadata for systematic and consistent description of
collections of datasets within a exible data model. The need for a User-friendly
Interface25 (UI) arose from choosing RDF as the metadata model for, it
facilitates integration and information sharing on the web. In fact, RISIS data-owners
who need to generate and maintain metadata about their datasets are not
familiar with the Semantic Web technologies and the ways to generate a standard
and valid RDF. As a result, following a user-centred approach, the proposed
vocabulary was implemented as a UI to help data-owners to easily auto-generate
and manipulate RDF metadata based on the RISIS metadata vocabulary. The
UI has been in use by RISIS data providers. Exposing the metadata through the
RISIS User-friendly Interface [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] stimulated data providers to check the
quality of their data before opening it for access/visit. Securing the data quality in
terms of the standards agreed upon in RISIS was done by satisfying elements in
a RISIS de ned check-list before the opening of the data. Given the broad scope
22 Census data open linked: https://www.cedar-project.nl/
23 CEDAR: https://med.stanford.edu/cedar/our-solution.html
24 http://www.openphacts.org/specs/2013/WD-datadesc-20130912/
25 An example of the RISIS UI http://datasets.risis.eu/metadata/eupro
and generic domain of the problem, SMS is intended to be useful not only for
STIS but also for the humanities and social sciences.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>This paper presents an approach for managing metadata in the eld of science,
technology and innovation studies. The approach was developed and applied in
the context of the RISIS-SMS project with the goal of supporting data
integration, discovery and search across datasets, maintaining privacy, and obtaining
user trust while focusing on data that are not directly accessible. A
contribution of this work is the requirements elicited by interviewing the stakeholders.
The requirement analysis guided the design of a new vocabulary, together with
review of existing metadata vocabularies that helped us lling in part of the
metadata needed to accommodate the domain needs. Additionally, to meet the
requirements, we designed and implemented a user-friendly interface which
allows non-expert users to easily author metadata in RDF.</p>
      <p>As future work, we envisage to extend our vocabulary to cover aspects related
to the quality and provenance of data. We also plan to conduct a usability
evaluation with end-users of the system to ensure that our user interface and
metadata speci cations ful l the user needs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Van den Besselaar, P.:
          <article-title>The cognitive and the social structure of sts</article-title>
          .
          <source>Scientometrics</source>
          <volume>51</volume>
          (
          <issue>2</issue>
          ),
          <volume>441</volume>
          {
          <fpage>460</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ciccarese</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soiland-Reyes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>PAV ontology: Provenance, authoring and versioning</article-title>
          .
          <source>Journal of biomedical semantics 4(1)</source>
          ,
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Daraio</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leporelli</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moed</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naggar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonaccorsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartolucci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Data integration for research and innovation policy: an ontologybased data management approach</article-title>
          . Scientometrics pp.
          <volume>1</volume>
          {
          <issue>15</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loizou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harland</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettifer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>API-centric linked data integration: The Open PHACTS discovery platform case study</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>29</volume>
          ,
          <issue>12</issue>
          {
          <fpage>18</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hackett</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amsterdamska</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lynch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wajcman</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The handbook of science and technology studies</article-title>
          . The MIT Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Khalili</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loizou</surname>
            , A., van Harmelen.,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Adaptive linked data-driven Web components: Building exible and reusable Semantic Web interfaces</article-title>
          .
          <source>Semantic Web Conference (ESWC)</source>
          <year>2016</year>
          (
          <year>2016</year>
          ), to appear
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labropoulou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            guez-Doncel,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>One ontology to bind them all: The META-SHARE OWL ontology for the interoperability of linguistic datasets on the Web</article-title>
          .
          <source>In: The Semantic Web: ESWC 2015 Satellite Events</source>
          , pp.
          <volume>271</volume>
          {
          <fpage>282</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Meron</surname>
          </string-name>
          <article-title>~o-Pen~uela,</article-title>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ashkpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mandemakers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Breure</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Scharnhorst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schlobach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Van Harmelen</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Semantic technologies for historical research: A survey</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>6</issue>
          ),
          <volume>539</volume>
          {
          <fpage>564</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>