Ready, Set, GO FAIR: Accelerating Convergence to an Internet of FAIR Data and Services © Erik Schultes Leiden University Medical Centre GO FAIR International Support and Coordination Office Poortgebouw N-01, Rijnsburgerweg 10, 2333 AA Leiden The Netherlands erik.schultes@go-fair.org © George Strawn Board Director Board on Research Data and Information (BRDI) US National Academies of Sciences, Engineering, and Medicine USA gstrawn@nas.edu © Barend Mons Leiden University Medical Centre GO FAIR International Support and Coordination Office Poortgebouw N-01, Rijnsburgerweg 10, 2333 AA Leiden The Netherlands barend.mons@go-fair.org Abstract. As Moore’s Law and associated technical advances continue to bulldoze their way through society, both exciting possibilities and severe challenges emerge. The upside is the explosive growth of data and compute resources that promise revolutionary modes of discovery and innovation not only within traditional knowledge disciplines, but especially between them. The challenge, however, is to build the large-scale, widely accessible, and automated infrastructures that will be necessary for navigating and managing the unprecedented complexity of exponentially increasing quantities of distributed and heterogenous data. This will require innovations in both the technical and social domains. Inspired by the successful development of the Internet and leveraging the FAIR Principles (for making data Findable, Accessible, Interoperable and Reusable by machines) the GO FAIR initiative works with voluntary stakeholders to accelerate convergence on minimal standards and working implementations leading to an Internet of FAIR Data and Services (IFDS). Keywords: analytics and data management, data intensive domains, digital libraries, FAIR Data, GO FAIR Initiative, Internet of FAIR Data and Services (IFDS). 1 Introduction Existing data stewardship practices are highly and other research outputs) contribute to massive data inefficient. Numerous studies indicate that data scientists loss and a well-documented reproducibility crisis [3-5]. both in academia and industry spend 70-80% of their Coupled with the exponential increases in data volumes time on mundane, manual procedures to locate, access, (driven by, among other things, high through-put and format data for reuse [1,2]. Methodological legacies instrumentation and IoT data streams) the urgency for inherited from a pre-digital era (e.g., poor capture of automated, commonly usable data infrastructures (i.e., an metadata, broken links to various research assets) and Internet for Machines) is increasingly recognised by outdated professional incentives (e.g., only rewarding numerous national and international organisations, publication of research articles rather than also datasets science funders and industry [6-11]. Despite the urgent need, building a generalised, ubiquitous, data infrastructure that is widely used by diverse stakeholders Proceedings of the XX International Conference is an inherently distributed and difficult process to direct. “Data Analytics and Management in Data Intensive Knowing this to be the case, the GO FAIR initiative was Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 19 launched to accelerate data infrastructure development by leveraging general patterns of phased development described in other revolutionary infrastructures, including the Internet and the World Wide Web (WWW) [12]. 2 Learning from previous Revolutionary Infrastructures Revolutionary Infrastructures (for example, transportation, electrification, telecommunications, and computer networks) follow five phases of development [12,13]: (1) Vision: New discoveries and technologies lead to the anticipation of broad new application spaces; (2) Creolization: Inspired by the Vision, numerous experimental implementations are created, resulting in Figure 1 The 15 FAIR Principles ensuring machine an uneven landscape of independently developed Findability, Accessibility, Interoperation and Re-use of prototypes; (3) Attraction: Some solutions prove more digital resources [18][19] viable, and are effectively generalised to achieve a simplified set of ‘universal principles’ that attract the It is important to note that the use of TCP/IP has always attention of others working in the field; (4) Convergence: been voluntary, and at no time was its use ever required. Various Attractors voluntarily decide to bridge otherwise Indeed, top-down enforcement policies would likely isolated application solutions, and a compelling global have killed its effectiveness as an attractor. Instead, once infrastructure begins to emerge at the expense of the a ‘critical mass’ of influential users had adopted TCP/IP, many other possibilities; (5) Exploitation: As widespread the larger community followed, driving convergence. An commitment to a particular implementation emerges, analogous pattern of development (voluntary use, economy of scale kicks in, and what was hard and cost- attractor effect in the community) occurred soon after prohibitive, now becomes easy and affordable. Users in with the formation of the WWW, in this case with HTTP the Exploitation phase might not even be aware of the playing the role of TCP/IP. The significance of this infrastructure systems they routinely use (e.g., most users historical insight can not be understated. It enables some of the internet are blissfully ignorant of TCP/IP). degree of control in the development of new In the specific case of the Internet, there had been infrastructures, because only a relatively few (albeit early Visions of interlinked computers throughout the influential) users need be convinced to invest in a 1950s and 1960s. By 1969, ARPAnet had initiated the particular technology. Once the ‘critical mass’ is phases of Creolization (and later Attraction) with the assembled, the 'long tail’ of community stakeholders will co-existence of multiple, specialised solutions, e.g., X25, likely follow. Ethernet, ARCNET, and others. This work demonstrated Even before the 2000’s, visionaries had already the feasibility of computer networks and drew the anticipated the need for a general-purpose data attention of large investors (e.g., IBM, DEC). But this infrastructure. Digital Object Architectures (DOA), investment resulted in numerous incompatible standards systems supporting Persistent Identifiers (PIDs) and the that in some ways slowed progress. Convergence was Semantic Web (a framework for knowledge eventually triggered with TCP/IP protocols (early 1970s) representation built on top of existing Internet and and the 7-layer ISO/OSI reference model (early 1980s). WWW infrastructures) appeared as an important This was because these minimal standards allowed components, ensuring both data interoperation and various networks to interoperate while at the same time machine readability. Since then, difficult problems in maintaining maximum freedom to engineer solutions at this space have been investigated resulting in a plenum the implementation layer ‘below’ and application layer of new, co-existing methods, languages, software and ‘above’ (creating the so-called “hourglass” architecture specialised hardware, producing by now, a protracted of the Internet, with TCP/IP at the narrow waist). It was period of Creolization. By 2012 the Attraction phase was working implementations (however embryonic) and the underway with public discussions about component simplicity of the hourglass approach that motivated specifications, principles and procedures for influential decision makers “to move towards using semantically enabled data infrastructures [14-16]. By TCP/IP as universal for implementing global computer early 2014, in a workshop hosted by the Lorentz Center networking” [13]. With a stabilized universal in place, (Leiden), this discussion culminated in the generalised Exploitation soon followed, with rapid investment in and broadly applicable FAIR Principles for data reuse both hardware and software, that is the now familiar [17]. In a now widely cited commentary (indicative of story of the Internet. By 1992, the Internet Society was the Attraction phase) [18], the FAIR approach had been set up to coordinate further develop TCP/IP approaches defined as “Data and services that are findable, to networking. accessible, interoperable, and re-usable both for machines and for people” and 15 high-level Principles had been articulated, Figure 1. Immediately following 20 their publication (April 2016), the FAIR Principles (and 3.2 GO BUILD later, the corresponding FAIR Metrics [20]) have been GO BUILD focuses on the technological aspects of acting as a powerful attractor in the emerging data the IFDS, including the design and building of reference infrastructure. implementations for elements composing the IFDS such Following the previous examples, the Convergence as FAIR Metrics [20], FAIR Data Points [25,26], phase of the data infrastructure will commence once a FAIRification tools and other FAIR-compliant services. ‘critical mass’ of users commits to particular, minimal Furthermore, via ongoing “Metadata for Machines” specification for automatic routing of FAIR data and workshops and “Community Challenges”, GO BUILD services (see for example the continuing discussions supports and coordinates communities who aim to around Digital Object Architecture [14,15,21]). This achieve adoption of globally unique and persistent globally distributed data infrastructure will likely be identifiers, agree on common metadata representation substantially more complex than its predecessors in that formats, agree on a minimal set of generic metadata an Internet of FAIR Data and Services (IFDS) content and define domain-relevant community necessitates elaborate semantically enabled metadata standards. Currently, there are 8 INs under the GO descriptions. The ‘FAIRification’ of digital resources is BUILD pillar. not trivial, and widespread application will require an ecosystem of methods, tooling, services and training that 3.3 GO TRAIN help communities of diverse stakeholders to create and use FAIR resources. GO FAIR supports and coordinates The overall objective of the GO TRAIN pillar is to bottom-up community initiatives that aim to ‘Make create a scalable framework that is used in higher FAIR easy” [11, 22]. education programs and throughout industry to train large numbers of certified data stewards (estimated to be 3 GO FAIR 500,000 for Europe [27], millions more world wide). GO TRAIN supports and coordinates two activities: 1) The 3.1 Accelerating Convergence toward a FAIR data development of canonical training curricula focused on infrastructure FAIR Data Stewardship; 2) The development of certification schema for competencies in FAIR Data Given that many different combinations of Stewardship (providing professional career trajectories, technology choices and use of standards could that in turn, are intended to drive rapid uptake of FAIR conceivably implement the FAIR Principles, the GO practices among diverse stakeholders). Currently there FAIR initiative was launched in late 2017 by the Dutch, are two GO TRAIN INs. The first is the Training German and French governments as a means to Frameworks IN which aims to develop schema for FAIR pragmatically accelerate community Convergence. The Data Stewardship education (including train-the-trainer initial vehicle for GO FAIR is the International Support curricula and endorsement specifications), with lenses and Coordination Office (GFISCO) [11]. Following the for Managers, Principal Investigators and Data Stewards examples of the Internet and WWW, the GFISCO themselves. Secondly, The FAIR Curriculum IN will re- operates through voluntary stakeholder participation use the Carpentries Open, community based curriculum attempting to reach a ‘critical mass’ of users committed development model [28] to develop novel modular to a set of absolute minimal technology specifications. lessons for FAIR data stewardship. Beyond these minimal specifications, there is unrestricted room to innovate. 3.4 GO CHANGE GFISCO is stakeholder governed, and includes researchers from specialized knowledge domains (e.g., The overall purpose of the GO CHANGE pillar to earth sciences [23], chemistry [24]) but also policy support and coordinate systemic culture change that bodies (e.g., CODATA, RDA, FORCE11), publishers transforms existing data management practices into the (e.g., Elsevier, Springer-Nature), repositories (e.g., respected profession of data stewardship. This includes Figshare), and funding agencies (e.g., The American the development of new funding schema, sustainability NSF and NIH, the Health Research Board of Ireland, and strategies, and business models. GO CHANGE the Dutch ZonMW). GFISCO brokers among stakeholders range from international policy makers and stakeholders, the choice of standards implementing the national governments to organisation managers and functions of the FAIR Principles and emerging best front-line data producers and data stewards. A key IN for practices leading to the Internet of FAIR Data and GO CHANGE is a FAIR resource hub that aggregates Services. GFISCO operates via supporting and multiple resources for FAIR data stewardship planning, coordinating Implementation Networks (INs), which are compliance, and assessment. voluntary international consortia that self-organize to implement elements of the IFDS. GO FAIR INs belong 4 Participating in GO FAIR to 3 broad topical pillars: GO BUILD, GO TRAIN and GO CHANGE. 4.1 Implementation Networks GO FAIR INs foster a collaborative community of harmonized practice which leads to Convergence and allows members to ‘speak with one voice' on critical 21 issues regarding FAIR data infrastructures. Anyone (i.e., [9] How expensive is FAIR compliance and how a person, an institution or a network organisation) can expensive is it to not be FAIR compliant. RDA join an existing or create a new GO FAIR IN [29]. The 11th Plenary BoF meeting (2018) https://rd- list of current GO FAIR INs can be found at the GO alliance.org/how-expensive-fair-compliance-and- FAIR website [30]. how-expensive-it-not-be-fair-compliant-rda-11th- plenary-bof 4.2 Launching an IN [10] G7 SCIENCE MINISTERS’ COMMUNIQUÉ. The requirements to become an IN are minimal: 1) Turin, 27 – 28 September have a plan to implement an element of the IFDS http://www.g7italy.it/sites/default/files/documents/ (including adequate resourcing to accomplish the G7%20Science%20Communiqué.pdf proposed goals); 2) comply with the GO FAIR Rules of [11] Progress Towards the European Open Science Engagement (essentially, commitment to the FAIR Cloud: GO FAIR Office Established, Global Principles and ‘no vendor lock-in’ [https://www.go- ActionPlatform (2017) fair.org/implementation-networks/rules-of- http://globalactionplatform.org/post/progress- engagement/]); 3) have sufficient critical mass to be towards-the-european-open-science-cloud-go-fair- regarded as thought leaders in the field of expertise. office-established Moreover, IN leaders will compose a ‘manifesto’ [12] Thomas P Hughes. Networks of Power: describing the goals and mode of operation of the IN Electrification in Western Society, 1880–1930 . [31]. Drafting the manifesto can be done in assistance Baltimore: Johns Hopkins University with the GFISCO as part of ongoing, periodic, 1-day Press. (1983). Manifesto Workshops [32]. Interested parties can initiate [13] Wittenburg P & Strawn G. Common Patterns in the application process by completing an online intake Revolutionary Infrastructures and Data. US form [33]. National Academy of Sciences (February, 2018) https://www.rd- Acknowledgments. We thank Peter Witenburg and alliance.org/sites/default/files/Common_Patterns_i Laurence Lannom for reviewing the manuscript and n_Revolutionising_Infrastructures-final.pdf offering constructive commentary. [14] International DAITF Workshop at the ICRI 2012 Conference References http://www.icri2012.dk/www.ereg.me/ehome/inde [1] Stehouwer H & Wittenburg P. RDA Europe: Data x06e1.html Practices Analysis. (Jan 11, 2018) [15] Research Data Alliance, Data Foundation & http://hdl.handle.net/11304/6e1424cc-8927-11e4- Terminology Group, Core Terms and Model ac7e-860aa0063d1f http://hdl.handle.net/11304/5d760a3e-991d-11e5- [2] Data Scientist Report, Crowdflower. (2017) 9bb4-2b0aad496318 https://visit.crowdflower.com/WC-2017-Data- [16] The FAIR Data Principles, FORCE11 Science-Report_LP.html https://www.force11.org/group/fairgroup/fairprinci [3] Schloss PD. Identifying and Overcoming Threats ples to Reproducibility, Replicability, Robustness, and [17] Jointly designing a data FAIRPORT (13-16 Generalizability in Microbiome Research. mBio, January 2014), Lorentz Center faculty of Science 9(3), e00525–18. (2018) of Leiden University, Leiden The Netherlands http://doi.org/10.1128/mBio.00525-18 https://www.lorentzcenter.nl/lc/web/2014/602/info [4] Gorgolewski KJ & Poldrack RA. A Practical .php3?wsid=602 Guide for Improving Transparency and [18] Wilkinson MD, et al. The FAIR Guiding Reproducibility in Neuroimaging Research. PLOS Principles for scientific data management and Biology 14(7): e1002506. (2016) stewardship. Scientific Data 3 (2016), 160018. https://doi.org/10.1371/journal.pbio.1002506 doi:10.1038/sdata.2016.18 [5] Barend Mons. Data Stewardship for Open [19] GO FAIR, FAIR Principles Explained. Science: Implementing FAIR Principles, 1st https://www.go-fair.org/fair-principles/ Edition. Chapman and Hall/CRC. (2018) [20] Wilkinson MD, et al. A design framework and [6] Research Data Alliance https://www.rd- exemplar metrics for FAIRness. Sci. Data alliance.org 5:180118 doi: 10.1038/sdata.2018.118 (2018) [7] Implementation Roadmap for the European Open [21] Research Data Alliance, Data Type Registries Science Cloud (14 March 2018) Recommendations (Endorsed) https://www.rd- http://www.esfri.eu/ri-world- alliance.org/group/data-type-registries- news/implementation-roadmap-european-open- wg/outcomes/data-type-registries science-cloud [22] GO FAIR International Support and Coordination [8] New Models of Data Stewardship, NIH Data Office (GFISCO), http://go-fair.org Commons https://commonfund.nih.gov/commons [23] American Geophysical Union’s Enabling FAIR 22 Data Project http://www.copdess.org/enabling- [29] GO FAIR Implementation Networks fair-data-project/ https://www.go-fair.org/implementation-networks/ [24] GO FAIR Chemistry Implementation Network [30] GO FAIR Current Implementation Networks (ChIN), Supporting FAIR Exchange of Chemical https://www.go-fair.org/implementation- Data Through Standards Development networks/overview/ https://iupac.org/event/supporting-fair-exchange- [31] GO FAIR Implementation Network manifesto chemical-data-standards-development/ template https://www.go-fair.org/manifesto- [25] Wilkinson MD, et al. Interoperability and template/ FAIRness through a novel combination of Web [32] GO FAIR Manifesto Workshops https://www.go- technologies. PeerJ Computer Science 3:e110 fair.org/implementation-networks/starting-new- https://doi.org/10.7717/peerj-cs.110 (2017) implementation-network/manifesto-workshop [26] FAIR Data Point Specification [33] GO FAIR Implementation Network intake form https://github.com/DTL- https://www.go-fair.org/implementation- FAIRData/FAIRDataPoint/wiki/FAIR-Data-Point- networks/starting-new-implementation- Specification network/implementation-network-application- [27] 500,000 data scientists needed in European open form/ research data, JoinUp Platform, European Commission (2016) https://joinup.ec.europa.eu/news/500000-data- scientists-need [28] The Carpentries https://carpentries.org 23