Core semantic model for generic research activity€ © Vasily Bunakov Scientific Computing Department, Science and Technology Facilities Council, Harwell OX11 0QX, United Kingdom vasily.bunakov@stfc.ac.uk Abstract Substantial effort of renown information experts has been spent in order to extend some established metadata A simple research activity model is suggested models with new semantic features; the example in that is agnostic to research domain and allows social research will be DDI semantic modelling ([8], independent curation of the research [9]). The richness and the expressivity of metadata information lifecycle by a variety of its model that has evolved through decades can be stakeholders with a potential to further link considered a limitation that makes it harder to agree on individual activities into meaningful research what should constitute the “true” semantic provenance or research value chains. We representation, or what format of it should be a consider the drivers for conceiving the model, “canonical” one. Also the attempts to transform the its main aspects, an RDF manifestation of it, a entire domain-specific metadata model into semantic particular business case for its application, and representation, and then offer it for common adoption discuss its potential for future applications. and data linkage may contradict the social nature of Linked Data as its curation can be reasonably 1 Introduction considered an incremental and opportunistic effort of multiple parties (as brilliantly illustrated by [1]). Different stages of the research lifecycle in natural This is not to say that semantic modelling of the sciences as well as in social and economic research entire research domain is not sensible or do not have a produce multiple data artefacts under control of potential for implementation. Collaborative projects of a different data management solutions and software multinational scale such as PaNdata-ODI ([2], also see platforms. (We use the term “data” here and there in a under [16]) consider semantic representation of the broad sense: not necessarily numeric data resulting from popular domain-specific metadata model [5] with the measurements but research proposals, software purpose of system integration. The motive for this components, configuration files, electronic publications, consideration is that, despite the actual information etc.) Data curators working in a particular research systems in different research centres may be based on domain tend to develop a specific metadata model that the implementations of the same generic metadata aims to cover the entire research lifecycle from the model and even on the same software platform for data research inception to the research outputs catalogue [14], the practices of the catalogue dissemination. Such a metadata model quite often configuration, the interpretation and the use of the serves as a foundation for the design of the actual model elements, and hence the actual semantics of these information systems and services. The example of a elements may vary dramatically. A common semantic comprehensive metadata model for the research layer, probably in the form of ontology, is considered performed at large facilities like synchrotrons, powerful then a viable architecture solution that should allow lasers or neutron sources is the Core Scientific retaining the existing local practices of data cataloguing MetaData model [5]; the example in social research is and at the same time, should give the IT teams an DDI-Lifecycle [7]. ability to meaningfully integrate distributed data and services. Proceedings of the 15th All-Russian Conference That semantic layer, however, will require an "Digital Libraries: Advanced Methods and inclusion into a certain best practices framework to Technologies, Digital Collections" ― RCDL-2013, sustain it through time [4], otherwise divergent business Yaroslavl, Russia, October 14-18 2013. needs and business practices of the collaboration participants can make a thoroughly designed semantic € This work is related to the ENGAGE project model obsolete the next day after its implementation in www.engage-project.eu and the projects of PaNdata a real IT solution. Keeping a comprehensive semantic collaboration www.pan-data.eu supported by the EU 7th model actual can be quite an expensive endeavour with Framework Programme for Research and Technological substantial overheads on continuous business analysis Development. The author would like to thank his and communication with multiple parties. colleagues in ENGAGE and PaNdata for their input for Another concern about the attempts of semantic this paper although the views expressed are the views of representation of comprehensive metadata models is a the author and not necessarily of the projects. tendency for them to reflect the information needs of 79 only a few types of the research lifecycle stakeholders: facility resource for research (e.g. beam time on this is commonly Researchers and Data Archivists. The synchrotron); the further approval of the proposal by the information needs of other stakeholders from Funding, facility’s user office; experiment scheduling; conduct of Industry, or Education are often under-represented. To the actual experiment with data collection; data storage; resolve this issue, one can take two approaches: data analysis; and eventually publishing research results with record keeping for them. Beyond this lifecycle that A) As a responsible information curator, conduct is supported by facility itself, there is research funding thorough business analysis of the research activity, or research policy making, or the researchers’ lifecycle stakeholders’ types and their social communication that all can be considered information needs then incorporate the elements of a larger “research value chain”. knowledge acquired into a comprehensive model that, in order to be effective, should be validated by the stakeholders themselves (then, ideally, permanently amended). B) Give different stakeholders a reasonable Figure 1. Research lifecycle in facilities science (as captured by modeling means to express their role in the CSMD model). research lifecycle so that each of them becomes an information curator who cares The lifecycle of social research that underpins DDI- about the quality and the actuality of her Lifecycle model [7] includes the formulation of the contribution into the shared pool of study concept, further data collection, its processing, information. archiving, distribution, discovery, analysis, and repurposing. Funding, or policy making, or social The latter approach seems more adequate in the communication, despite there are some placeholders for present situation when the advance of Linked Data references to these types of activity – are again beyond principles allows various stakeholders to meaningfully the immediate scope of DDI. model their part of information universe, also re-use the results of similar modeling effort made elsewhere. We suggest a small but quite universal “core” model in the spirit of Linked Data principles [1] with low barriers for its adoption and use for semantic annotation of the research activity in different local information contexts, with their further inclusion into a global information context. We think that such a model should not focus on data but on common patterns of research activity observed in different research domains (for Figure 2. Research lifecycle in social science (as captured by which we give examples further in this paper); various DDI-L model). data then can be considered artefacts or “footprint” of different types of research activity. Each activity yields certain outputs, e.g. in facilities science, the research proposal preparation results in the 2 Research activity model investigation (experiment) description, data analysis yields derived data etc. Previous activity may provide 2.1 Types and common patterns of research activity an input for other activity or give it a context, e.g. it is quite common for researchers to refer to the previous Research lifecycles analyzed and structured by digital investigations (experiments) when they apply for a new curators in the respective research domains can be a investigation to be conducted at the same facility. good source for discovering granular research activities Despite there are similarities between the two and their interrelations. In this work, we consider two aforementioned lifecycles and between the roles of lifecycles: in facilities science 1 and in social research; stakeholders involved in them, there are differences, they are most relevant to the projects which contributed too. Even more differences come up if we consider to the development of our model ([11], [16]) and their context or scope of each research activity, or means for respective research domains stay quite far apart so may their description that are present in each model. As an help us with testing our model universality. example, in facilities science, the scope of experiment Lifecycle in facilities science that underpins CSMD can be understood by considering what samples or model [5] includes the submission of a research chemical substances have been under investigation; in proposal to the facility user office in order to get the social research, it can be meaningful parameters describing the human audience which the study has 1 For the sake of clarity, we use the term “facilities science” for the been aimed upon. Not these details that may be research performed on large-scale scientific instruments different but the very presence of Context and Scope, as (synchrotrons, powerful lasers and alike) by visitor teams or well as the Inputs and Outputs for the research activity, individual researchers who obtain, via the application process, access to the common facility resource in order to conduct their experiments or Actors who perform it, or Effects of the research do or observations, and to collect the resulting data. represent a common pattern – very generic but universal 80 across research fields. Schematically, the granular research activity can be These patterns are common not only across different represented by the following diagram: research domains for the similar types of research activity (when we draw parallels e.g. between facility science Experiment and social research Study); this is also the case for different types of research activity within the same lifecycle, e.g. funding or data analysis or record publication have their Inputs and Outputs, their Actors, Effects, Context (Conditions) and Scope. These basic patterns contribute to a reasonable model that should not be too burdensome for the respective stakeholders (or information specialists working for them) to apply, yet is expressive enough to promote the principles and best practices of Linked Data in various research domains. We consider a potential for such an application below in the section devoted to a particular business case; in the meanwhile, we are going to formally introduce the major aspects of a generic research activity, and suggest a practical RDF- based manifestation for them. 2.2 Generic research activity (research activity Figure 3. Research activity “cell”. “cell”) We deem important the following aspects of a generic Research activities can be combined as “cells” in research activity: chains where Output of one can be an Input to another but in fact, the model allows other sorts of links between activities. As an example, a piece of regulation Examples such as data management policy can be an Output of Aspect Description one activity (policy making), and a Condition that Research affects another activity (research per se); a new Research data software module that is a side Effect of a certain per se analysis activity (data analysis) can be a non-human Actor that Something that is Previous Raw data participates in other activity (e.g. automated indexing of taken in or research experimental data). This shows that activity aspects in Input fact do not have “types”: a modeler can use and operated on by Activity combine them as dictated by the semantics of the respective subject area. Something that is Raw data Derived This view is inspired, to some extent, by SADT Output intentionally (analyzed) activity model [17] with its idea of combining activities produced by data into the hierarchy or a grid but is quite different by Activity introducing some other activity aspects and not imposing their typization. Also SADT promotes a top- Something that Sample One or more down approach to structured analysis and systems Scope Activity is aimed properties experiments at or deals with design when we suggest a bottom-up approach that allows combining the granular activities in more Something that Scientific IT complex information structures. affects or supports instrument environment Compared to other project-driven attempts to model Condition Activity, or gives research activity ([10], [15]) our model is going to be it a specific simpler, more universal, and deliberately aimed at context semantic modeling of a granular activity rather than of the entire research lifecycle thus providing a “building Something or Investigator Data analyst block” for a more sophisticated information modeling somebody who as and when required. Actor participates in Activity 2.3 RDF manifestation of activity model Something that is Environme New The outlined model may imply different manifestations; Effect a consequence of nt pollution software we feel that one expressed in RDFS Plus (RDF Schema Activity module with a few OWL terms) has a good potential for adoption by information curators and implementation in real IT solutions. This paper Appendix suggests the 81 RDFS Plus manifestation of the activity model that can different contexts. Another observation is that detailed be extended by domain specific entities and properties. metadata records may in fact represent different As an example, an information modeler in facilities activities performed by different stakeholders of the science might want to extend the model as follows: research information lifecycle – while the records that in fact circulate in the information management @prefix rdfs: . solutions are focused on particular types of stakeholders @prefix am: . only and support their specific roles in the first place. A @prefix rm: . # For Activities certain stakeholder, e.g. Data Librarian or Data rm:Research rdfs:subClassOf am:Activity . Archivist may claim that Her information management rm:Experiment rdfs:subClassOf rm:Research . solution is focused on data in pursuit of some common # For Conditions rm:Condition rdfs:subClassOf am:Condition . interest when, in fact, the information management rm:Regulation rdfs:subClassOf rm:Condition . solution primarily supports this particular stakeholder rm:DataManagementPolicy rdfs:subClassOf rm:Regulation . specific role in the information lifecycle with only some # For Output types of other stakeholders well served. rm:Output rdfs:subClassOf am:Output . As an example, DDI [7] suggests some means to rm:Publication rdfs:subClassOf rm:Output . rm:Dataset rdfs:subClassOf rm:Output . model information about funding but European funding # For Scope bodies are likely to use their own information systems, rm:Scope rdfs:subClassOf am:Scope . many of them based on CERIF standard [6]. So the rm:ExperimentalTechnique rdfs:subClassOf rm:Scope . richness and expressivity of DDI, as well as the actual rm:SubjectCoverage rdfs:subClassOf rm:Scope . # For properties information systems based on it are in fact aimed at rm:activity_location rdfs:subPropertyOf am:hasScope . researchers in social science and data archivists, not at rm:activity_subject rdfs:subPropertyOf am:hasScope . funders who are likely to have their own information systems based on other metadata standards, and not at The user of the information system where the RDF other types of stakeholders in Business, Education, or data prepared according to our model is published can researchers in other research domains. then use reasonable SPARQL requests to inquire for We feel that it will be more productive to admit this different aspects of research activities, e.g. trying to natural attitude of the information management realize first how much research output, and how much solutions and their owners to cater for only one or a few of each type is out there: roles; it may be better to provide a reasonable means to model different roles and their activities on a granular SELECT ?output_type (COUNT(?output) as ?total) WHERE {?output_type rdfs:subClassOf am:Output . level than try to capture an elusive information context ?output a ?output_type . in more and more complex versions of a comprehensive } semantic model. If we take the existing records in a GROUP BY ?output_type certain rich metadata format, this approach results in or try to discover the chains of interrelated activities: categorization and annotation of the entire metadata records with other metadata based on a smaller but SELECT ?previous_activity ?current_activity semantically meaningful and universal information WHERE {?previous_activity am:hasOutput ?output . model – like our activity model. ?output am:inputFor ?current_activity .} Let us see how our core semantic model may serve User may be familiar with just our activity model DDI metadata categorization and annotation. 2 The knowing very little about a certain research domain at analysis shows that one DDI record typically represents start, then accumulating more and more knowledge different types of research activity: through sensible incremental requests. In case the information modeler, in addition to our basic activity model, has followed good practices of data curation so that e.g. instances of Scope or Condition subclasses are not literals but dereferenceable URIs, the User will have even more opportunities of getting familiarized with the semantics of a particular research domain. When we tell of “User” we of course mean the software agents, too, as the prospect of employing them is a strong incentive for any semantic modeling. Figure 4. Research activities represented by a DDI record. 2.4 Business case for semantic categorization and annotation of existing metadata As we have identified different types of research As we mentioned, it may not be easy to give birth to the activity, we can model them accordingly; we can also semantic representation of a comprehensive metadata model because of its richness and complexity, and 2 This approach was applied to DDI records harvested from the UK because of substantial overheads for communication Data Archive and GESIS archive ([18], [13]) in the interests of the among information curators who apply the model in ENGAGE project [11] and was communicated in [3] as a prolegomenon to the generic model that we are presenting now. 82 identify specific Actors (Funding Agency, Author, Another prospective area where we think our model Distributor), activity Outputs (Publication, Dataset), may prove to be valuable is long-term digital Scopes (Spatial Coverage, Subject Coverage) and preservation with its two well-known problems of the Conditions (Copyright, Access Terms). Different accountable data provenance and of the meaningful data granular activities will be modeled then with different representation for the future (and changing) community amount of detail but we can enrich them with data from of data consumers. The ability of our model to combine other information systems: for research funding – individual data curation activities into the traceable through funding agency portals, for research – through chains of them, as well as its very focus on the Activity the project and the individual investigators’ Web pages. (with data being an artefact or footprint of it) may This information enrichment should ideally be done by contribute to the satisfactory resolution of the data the Actors of the respective Activities (Funding, provenance problem. The model’s data discovery Research per se, Distribution) as they best understand capabilities based on standard information requests and the information context and the semantics of their profiles of them when it is enough for the User to be business. familiar with our basic semantic model in order to start Our activity model then should allow curating the the incremental knowledge discovery – may contribute data and data context (metadata) in a distributed to the meaningful data representation. manner, and the combination of granular activities in Also we find the multi-disciplinary and distributed sensible information context chains. This should curation, discovery and re-use of the research eventually give us a more dispersed but a more information to be in high demand; it is already in the complete description of the research discourse for a agenda of a few actual European projects (see under particular Study – more complete if compared to what [11], [12], [16]) and it is reasonable to expect more of the Data Archivist deemed valuable to capture and them to come. The domain-agnostic nature of our describe in a DDI record for the same. Our core model model, as well as its very manageable core size and then serves as a “glue” to support the common expandability where required let us hope for its information context and facilitate the interoperability of application in some of the existing and future e- different digital curation frameworks that are operated infrastructure initiatives. by different Actors in support of their own Activities. The existing well curated archives of DDI records 3 Appendix: RDFS Plus manifestation of can be considered then a valuable “fuel” to support the the activity model launch of the research discourse “Web” or “grid”. The role-centric nodes of it will be performing their part of @prefix rdfs: . @prefix owl: . digital curation, with sharing its results via simple and @prefix am: . commonly understandable semantic model that can be interpreted not only by data archivists or researchers in ############### Core entities of Activity model ############### social science but by various stakeholders from other research domains, or business, or education, or policy # Comments are based on the Oxford dictionary, with some generalization or amendment where appropriate making. am:Activity rdf:type rdfs:Class ; 2.5 Conclusion rdfs:label "Activity" ; We outlined the motivation for why a simple model rdfs:comment "Something that Actor does, or has done, or is going to do, or can do" . would be valuable for the semantic representation of a am:Input rdf:type rdfs:Class ; generic research lifecycle. We introduced the major rdfs:label "Activity Input" ; aspects of the model, suggested an RDF manifestation rdfs:comment "Something that is taken in or operated on for them and showed how the domain-agnostic requests by Activity" . am:Output rdf:type rdfs:Class ; might work for information discovery. We then rdfs:label "Activity Output" ; considered a particular business case of applying the rdfs:comment "Something that is intentially produced model to the existing rich metadata records in social by Activity" . science but there are more promising cases to consider. am:Actor rdf:type rdfs:Class ; rdfs:label "Activity Actor" ; One of the immediate candidates is facilities science rdfs:comment "Something or somebody who participates with its CSMD metadata [5] that we already mentioned. in Activity" . The diverse business practices for using the existing am:Effect rdf:type rdfs:Class ; mature data management solutions based on CSMD rdfs:label "Activity Effect" ; rdfs:comment "Something that is a consequence model [14] may become a barrier to the meaningful of Activity" . sharing of facilities science data as Linked Data. Our am:Condition rdf:type rdfs:Class ; model then may be of help for the re-engineering of the rdfs:label "Activity Condition" ; existing data archives in spirit of Linked Data and rdfs:comment "Something that affects or supports Activity, or gives it a specific context" . Semantic Web principles, through semantic annotation am:Scope rdf:type rdfs:Class ; of the CSMD metadata records (which may involve rdfs:label "Activity Scope" ; some decomposition, too, similarly to what we rdfs:comment "Something that Activity is aimed at demonstrated for DDI metadata). or deals with" . 83 ########### Core properties of Activity model ########### the Web. Schloss Dagstuhl, September 11 – 16, 2011. # am:hasInput or am:inputFor # links Activity to its Input http://www.dagstuhl.de/en/program/calendar/evhp/ am:hasInput owl:inverseOf am:inputFor . ?semnr=11372 [9] DDI Lifecycle: Moving Forward. Schloss # am:hasOutput or am:outputOf Dagstuhl, October 21 – 26, 2012. # links Acttivity to its Output am:hasOutput owl:inverseOf am:outputOf . http://www.dagstuhl.de/en/program/calendar/evhp/ ?semnr=12432 # am:hasActor or am:actorFor [10] DARIAH-EU: Digital Research Infrastructure for # links Activity to its Actor the Arts and Humanities. http://www.dariah.eu/ am:hasActor owl:inverseOf am:actorFor . [11] ENGAGE: An Infrastructure for Open, Linked # am:hasEffect or am:effectOf Governmental Data Provision towards Research # links Activity to its Effect Communities and Citizens. http://www.engage- am:hasEffect owl:inverseOf am:effectOf . project.eu/ # am:hasCondition or am:ConditionFor [12] EUDAT: European Data Infrastructure. # links activity to its Condition http://www.eudat.eu/ am:hasCondition owl:inverseOf am:ConditionFor . [13] GESIS - Leibniz-Institut für Sozialwissenschaften. http://www.gesis.org/ # am:hasScope or am:ScopeOf # links Activity to its Scope [14] ICAT project. http://www.icatproject.org/ am:hasScope owl:inverseOf am:scopeOf . [15] Infrastructure for Integration in Structural Sciences (I2S2) Project. References http://www.ukoln.ac.uk/projects/I2S2/ [1] Tim Berners-Lee. Open, Linked Data for a Global [16] PaNdata: Photon and Neutron Data Infrastructure. Community. A talk given on Gov 2.0 Expo, http://pan-data.eu/ Washington, DC, 26 May 2010. [17] Structured Analysis and Design Technique. http://www.gov2expo.com/gov2expo2010/public/sc http://en.wikipedia.org/wiki/Structured_Analysis_a hedule/detail/14247 nd_Design_Technique [2] Juan Bicarregui, Vasily Bunakov, and Michael [18] UK Data Archive (for social sciences and Wilson. PANdata international information humanities). http://data-archive.ac.uk/ infrastructure for synchrotrons: opportunity for collaboration. Presentation on the 19th Russian Synchrotron Radiation Conference (SR-2012), Novosibirsk, Russia, 25-28 June 2012. http://epubs.stfc.ac.uk/work-details?w=63074 [3] Vasily Bunakov. Semantic categorization of DDI metadata. Presentation on the 4th Annual European DDI User Conference (EDDI12), Bergen, Norway, 03-04 Dec 2012. http://epubs.stfc.ac.uk/work- details?w=64315 [4] Vasily Bunakov and Brian Matthews. Data curation framework for facilities science. In Proceedings of DATA 2013: the 2nd International Conference on Data Management Technologies and Applications, p.211-216, Reykjavík, Iceland, 29-31 July 2013. [5] Brian Matthews et al., 2012. Model of the data continuum in Photon and Neutron Facilities. PaNdata ODI, Deliverable D6.1. http://pan- data.eu/sites/pan-data.eu/files/PaNdataODI- D6.1.pdf [6] Common European Research Information Format. See under www.eurocris.org [7] Data Documentation Initiative – Lifecycle Specification. http://www.ddialliance.org/Specification/DDI- Lifecycle/ [8] Semantic Statistics for Social, Behavioural, and Economic Sciences: Leveraging the DDI Model for 84