To be findable, accessible, interoperable and reusable:
   language data and technology infrastructure for
         supporting the FAIR data approach
                                                    Danguolė Kalinauskaitė
                                   Faculty of Humanities, Centre of Computational Linguistics
                                                  Vytautas Magnus University
                                                        Kaunas, Lithuania
                                              e-mail: danguole.kalinauskaite@vdu.lt


    Abstract–In this paper the FAIR principles are summarized,             The focus on assisting machines in their discovery and
with a focus on (meta)data management infrastructure as the            exploration of data through application of more generalized
conduit for developing these principles in practice. The paper         interoperability technologies and standards at the
serves as an overview of the European Open Science Cloud in            data/repository level, becomes a first-priority for good data
supporting the FAIR data approach in general, and as an
overview of the Lithuanian practice in supporting the sharing,
                                                                       management. There are some guidelines, summarized as
use and sustainability of language resources specifically: the         acronym FAIR that put specific emphasis on enhancing the
CLARIN infrastructure as a networked federation of                     ability of machines to automatically find and use the data, in
repositories and service providers is presented, Lithuanian            addition to supporting its reuse by individuals. They are
experience in developing national language resources                   discussed in the paper below.
infrastructure is detailed, and the practical benefits that                The paper is organized as follows. The FAIR principles
Lithuania gained with joining the European infrastructure are          are theoretically presented in section II, practical
reviewed.                                                              implementation of developing these principles is specified in
                                                                       section III, the Lithuanian practical results in supporting the
   Keywords–FAIR; data management; infrastructures; European
Open Science Cloud; CLARIN; language resources
                                                                       FAIR approach are detailed in section IV.
                                                                           The goal of this paper is to present infrastructures as one
                                                                       of the means to overcome data discovery and reuse obstacles
                     I. INTRODUCTION                                   in order to be more FAIR, with a focus on Lithuanian
    It is reasonably stated that humans and machines often             experience in developing language resources infrastructure.
face distinct barriers when attempting to find and process
data on the web. This is because humans have an intuitive                                   II. TO BE “FAIR”
sense of “semantics” (the meaning or intent of a digital
object): they are capable of identifying and interpreting a            A. Cooperation of humans and machines
wide variety of contextual cues, whether those take the form               Data-intensive sciences meet the challenges to facilitate
of structural/visual/iconic cues in the layout of a web page,          knowledge discovery. Computational analysis to discover
or the content of narrative notes. The primary limitation of           meaningful patterns in massive, interlinked datasets is rapidly
humans, however, is that they are unable to operate at the             becoming a routine research activity. However datasets should
scope, scale, and speed necessitated by the scale of                   be prepared for such analysis, so this kind of activity, in turn,
contemporary scientific data and complexity of e-science [1].          urges an ever closer cooperation of humans and machines in
This is due to the fact that humans increasingly rely on               the access to, integration and analysis of, scientific data.
computers to undertake different tasks on their behalf.                    Providing machine-readable data as the main substrate for
    Machines are necessitated to be capable of autonomously            knowledge discovery and for the e-scientific processes to run
and appropriately acting when faced with the wide range of             smoothly and sustainably is considered to be one of the
                                                                       biggest challenges of e-science. As it is observed, the reason
types, formats, and access-mechanisms/protocols. It also
                                                                       that we often need several weeks (or months) of specialist
necessitates that the machines keep an exquisite record of             technical effort to gather the data necessary to answer
provenance such that the data they are collecting can be               research questions is not the lack of appropriate technology;
accurately and adequately cited [1]. Therefore all participants        the reason is, that we do not pay our valuable digital objects
in the data management process (from researchers and data              the careful attention when we create and preserve them [1].
producers to data repository hosts) are of paramount                   So it is worth to note here that cooperation of humans and
importance in assisting computers and improving this                   machines needs to follow certain principles to enable optimal
process.                                                               use of data and methods.


  Copyright © 2017 held by the authors


                                                                  21
B. FAIR principles                                                     considered “FAIR” [1]. In this way, the FAIR principles
    A key enabler to achieve international-grade data                  provide steps toward machine-actionability.
management is for research data and information to be                      The idea of being machine-actionable applies in two
published in a “FAIR” manner. “FAIR” summarizes several                contexts: first, when referring to the contextual metadata
aspects, or in other words, a set of guiding principles,               surrounding a digital object (“what is it?”), and second, when
regarding data, both for machines and for people, and these            referring to the content of the digital object itself (“how do I
principles assist the interaction between those who want to            process it/integrate it?”). Here also to be drawn a distinction
use data and those who provide them. In the FAIR data                  between machine-actionable data as a result of specific
approach, data should be [1]:                                          investment in software supporting that data-type, and data
     Findable (F)                                                     that are machine-actionable exclusively through the
    F1. (Meta)data are assigned a globally unique and eternally        utilization of general-purpose, open technologies [1]. The
persistent identifier.                                                 paper details the latter.
    F2. Data are described with rich metadata (defined by R1               FAIRness can be achieved with a wide range of
below).                                                                technologies and implementations. One of the solutions to
    F3. (Meta)data are registered or indexed in a searchable           supporting FAIRness is data infrastructures. The FAIR
resource.                                                              principles guide the development of infrastructure and tooling
    F4. Metadata specify the data identifier.                          to make data optimally reusable for machines and people
     Accessible (A)                                                   alike, which is considered to be a crucial step forward.
    A1. (Meta)data are retrievable by their identifier using a
standardized communications protocol.                                  III. INFRASTRUCTURES FOR SUPPORTING THE “FAIR”
    A1.1. The protocol is open, free, and universally                                  DATA APPROACH
implementable.                                                             Infrastructure in general is defined as (usually large-scale)
    A1.2. The protocol allows for an authentication and                basic physical and organizational resources, structures and
authorization procedure, where necessary.                              services needed for the operation of a society or enterprise [3].
    A2. Metadata are accessible, even when the data are no             Specifically, a research infrastructure refers to an
longer available.                                                      infrastructure intended for carrying out research: facilities,
     Interoperable (I)                                                resources and related services used by the scientific
    I1. (Meta)data use a formal, accessible, shared, and               community to conduct top-level research [3]. So a research
broadly applicable language for knowledge representation.              infrastructure covers various means for researchers in different
    I2. (Meta)data use vocabularies that follow FAIR                   fields. One of those means are digital resources, including
principles.                                                            both data and software. Infrastructure initiatives enable FAIR
    I3. (Meta)data include qualified references to other               principles implementation in practice and are of paramount
(meta)data.                                                            importance. Some of such initiatives are discussed in the rest
     Reusable (R)                                                     of this paper.
    R1. Meta(data) have a plurality of accurate and relevant
attributes.                                                            A. European Open Science Cloud
    R1.1. (Meta)data are released with a clear and                         The European Commission suggested “European Cloud
accessible data usage license.                                         initiative” [4] (issued in April 2016) which set an ambitious
    R1.2. (Meta)data are associated with their provenance.             vision for the European Open Science Cloud (EOSC) (see Fig.
    R1.3. (Meta)data meet domain-relevant community                    1.) to be realised by 2020.
standards.
    So putting data on the web is not enough. The FAIR
principles speak about a “knowledge representation” language
for data representation. To be actually interoperable and
reusable, data should not only be properly licensed, but the
methods to access and/or download them should also be well
described and preferably fully automated and using well
established protocols [2]. Any combination of the FAIR
principles is desirable and of paramount importance
(especially regarding data management in small countries),
and it is worth to note here that these principles are intended
to apply not only to “data” in the conventional sense, but also
to the algorithms, tools, and workflows that led to that data.
In other words, the principles suggest that contemporary data
resources, tools, vocabularies and infrastructures should
exhibit specific characteristics, norms, and practices in order        Fig. 1. European Open Science Cloud [5]
to assist discovery and reuse by third-parties and to be


                                                                  22
    The EOSC is a data infrastructure to support Open
Research Data and Open Science in Europe: it is intended as
an open and trusted environment where research data can be
safely stored and made openly available, and is dedicated to
enable trusted access to services, systems and the re-use of
shared scientific data across disciplinary, social and
geographical borders [6]. In other words, the EOSC serves as
an ecosystem of infrastructures and although it is indeed a
European infrastructure, it is aimed to be globally
interoperable and accessible.

B. EOSC implementation and opportunities
    The European Open Science Cloud will start by federating
existing scientific data infrastructures, today scattered across
disciplines and geographically. This step will make access to
scientific data easier, cheaper and more efficient. It will enable
the creation of new market opportunities and new solutions in
key areas such as health, environment, or transport [4].
                                                                          Fig. 3. EOSC benefits [5]
                                                                              Practically, the European Open Science Cloud will offer
                                                                          1.7 million European researchers and 70 million professionals
                                                                          in science and technology a virtual environment with free at
                                                                          the point of use, open and seamless services for storage,
                                                                          management, analysis and re-use of research data, across
                                                                          borders and scientific disciplines [4]. In other words, this
                                                                          infrastructure is about a federated environment for scientific
                                                                          data sharing and re-use, based on existing and emerging
                                                                          resources in the members of the infrastructure. It will provide
                                                                          only lightweight international guidance and governance and a
                                                                          large degree of freedom regarding practical implementation.
                                                                          So one of the most important aspects of the EOSC is
                                                                          systematic and professional data management and long-term
                                                                          stewardship of scientific data assets and services in Europe
                                                                          and globally [6].
                                                                              While the European Open Science Cloud is still on the
                                                                          way, the other - European infrastructure dedicated to
Fig. 2. EOSC opportunities [5]                                            researches that deal with language resources - already works
                                                                          and is one of the good practices in creating infrastructures.
    The European Open Science Cloud will be also open for
education and training purposes in higher education and, over
                                                                           IV. “FAIR” IN LITHUANIAN PRACTICE. LANGUAGE
time, to government and business users as the technologies
                                                                                     RESOURCES INFRASTRUCTURE
developed will be promoted for wider application [4]. The
EOSC includes the required human expertise, resources,                        In January of 2015 Lithuania became a full member of
standards, best practices as well as the underpinning technical           CLARIN ERIC (Common Language Resources and
infrastructures [6]. In this way, the EOSC is intended to bring           Technology Infrastructure) (see [7]), which is aimed to
tangible benefits to society (see Fig. 3.).                               provide easy and sustainable access to digital language
                                                                          data (in written, spoken, video or multimodal form),
                                                                          and advanced tools to discover, explore, exploit, annotate,
                                                                          analyse or combine such data sets and tools, wherever they are
                                                                          located. Soon national consortium (CLARIN-LT) was founded
                                                                          by three partner universities: Vytautas Magnus University,
                                                                          Kaunas University of Technology and Vilnius University. So
                                                                          this part of the paper is based on reflection of Lithuanian
                                                                          experience in storing and accessing language resources.
                                                                              The CLARIN infrastructure at the European scale is
                                                                          intended for the humanities and the social sciences, and these
                                                                          two domains, in turn, include a wide range of disciplines. It is
                                                                          distributed, i.e., implemented in a network of CLARIN


                                                                     23
Centres, and virtual, i.e., it provides services via the Internet
[8]. This European infrastructure covers a wide spectrum of
digital data types:
     Data in natural language (texts, lexicons, grammars,
         etc.).
     Databases about natural language (typological
         databases, dialect databases, lexical databases, etc.).
     Audio-visual data containing (written, spoken, signed)
         language (e.g. pictures of manuscripts, audiovisual
         data for language description, description of sign
         language, interviews, radio and tv programmes, etc.)
         [3].
    It also includes software to browse and search in digital
language data (e.g. software to search in a linguistically
annotated text corpus), as well as software to analyze, enrich,
process, and visualize digital language data (e.g., a parser,
                                                                                                CLARIN Repositories
which enriches each sentence in a text corpus with a syntactic
structure) [3].
    The case of CLARIN ERIC points to the means for                      Fig. 4. CLARIN Services
overcoming data discovery and reuse obstacles in language                   As shown in the figure above, CLARIN offers a number of
technologies. The CLARIN infrastructure supports the                     basic services to researchers, in particular:
sharing, use and sustainability of language data and tools                   Services to find digital data and get access to digital
through a networked federation of centres: language data                         data.
repositories, service centres and knowledge centres, with
                                                                             Services to find software for processing digital data.
single sign-on access for all members of the academic
                                                                             Services to apply the software to the digital data in a
community in all participating countries. Tools and data from
                                                                                 user friendly manner.
different centres are interoperable, so that data collections can
be combined and tools from different sources can be chained                  Services to create and describe new data and software.
to perform complex operations to support researchers in their                Services to store new data and software in CLARIN
work. Interoperability is ensured through the standards                          for long term preservation and for making them
adopted in the CLARIN framework. A large number of                               accessible to other researchers [8].
participating centres are offering access services to data, tools
and expertise [7].                                                       A. Infrastructure as the conduit for FAIR data
    The access provided to data and content is in principle                  Before Lithuania became a member of the CLARIN
sustainable and in accordance with the principles of open                infrastructure, its language-based resources and advanced
science, and thus free for scholars. The same criteria for data          tools were accessible via special websites. With joining the
and services interoperability, access conditions, quality of             infrastructure, Lithuanian resources and services are available
data and services are adhered to by all members, however                 to the various research communities at large, and Lithuania as
countries have a large degree of freedom to decide what they             a member of the infrastructure gained practical benefits:
contribute to the CLARIN infrastructure [7]. Principal scheme                 Access to services provided by the infrastructure:
of the CLARIN services is presented in Fig. 4. to give a                          existing tools and datasets, e.g. online access to
general picture of CLARIN functionality.                                          national corpora, lexica, audio and video recordings,
                                                                                  annotations, grammars, etc.
                                                                              Long-term archiving: a storage guarantee can be given
                                                                                  for a long period (up to 50 years in some cases).
                                                                                  Resources can be archived and made available to the
                                                                                  community in a reliable manner.
                                                                              Persistent identifiers for data. Resources can be cited
                                                                                  easily with a persistent identifier.
                                                                              The resources and their metadata are integrated into
                                                                                  the infrastructure, making it possible to search them
                                                                                  efficiently.
                                                                              Password-protected resources can be made available
                                                                                  via an institutional login.
                                                                              Once resources are integrated in the CLARIN
                                                                                  infrastructure, they can be analyzed and enriched more
                                                                                  easily with various linguistic tools (e.g. automated


                                                                    24
        part-of-speech tagging, phonetic alignment or                     computer systems to (semi)automatically combine different
        audio/video analysis).                                            data sources, resulting in richer knowledge discovery
       Practical workshops at which a diverse group of                   activities. Rich metadata facilitate such discovery, including
        people, including those with research questions and               clear rules regarding the process for accessing the data.
        software developers, get together to work on                          Metadata for language resources and tools exists in a
        producing some real solutions over a short but intense            multitude of formats. Often these descriptions contain
        period. The outputs of those workshops - tools,                   specialized information for a specific research community
        datasets, linked data, etc. that are to be maintained.            (e.g. TEI headers for text, IMDI for multimedia collections
        Such activity stimulates the adoption or better use of            [10]). To overcome this dispersion CLARIN has initiated the
        digital methods for the community of the                          Component MetaData Infrastructure (CMDI). It provides a
        infrastructure.                                                   framework to describe and reuse metadata blueprints.
                                                                          Description building blocks (“components”, which include
       Repository. Lithuanian language data and services are
                                                                          field definitions) can be grouped into a ready-made
        stored in CLARIN-LT Repository [9]. It contains 5                 description format (a “profile”) (Interoperable). Both are
        resources: 1. Lemmatised Wordlist of 1 m. Corpus of               stored and shared with other users in the Component
        Contemporary Lithuanian; 2. LITIS v.1; 3. Lithuanian              Registry [11] to promote reuse (Reusable).
        Treebank ALKSNIS; 4. Lithuanian morphologically                       For metadata purposes, CLARIN ERIC provides a special
        annotated corpus MATAS; 5. Wordlist of the                        metadata-based portal for language resources - Virtual
        Contemporary Corpus of Lithuanian language.                       Language Observatory [12]. It is completely based on the
                                                                          Component Metadata (CMDI) and ISOcat standards. This
B. CLARIN-LT Repository                                                   approach allows for the use of heterogeneous metadata
    Data in CLARIN-LT Repository (Fig. 5.) are made                       schemas while maintaining the semantic compatibility [13].
available under licences attached to resources. In case there is
no licence, data is made freely available for access, printing            D. Benefits to individual groups of people and to institutions
and download for the purposes of non-commercial research or                   There are numerous and diverse stakeholders who benefit
private study [9]. The Repository is like a library for linguistic        from being a part of the infrastructure:
data and tools: to search for data and tools and easily                           researchers share and reuse each other’s data;
download them; deposit the data and be sure it is safely stored,                  professional data publishers offer their services;
everyone can find it, use it, and correctly cite it (Findable).                   software and tool-builders provide data analysis and
                                                                          process services;
                                                                                  funding agencies (private and public) increasingly
                                                                          concerned with long-term data stewardship;
                                                                                  a data science communities mine, integrate and
                                                                          analyse new and existing data.
                                                                              Lithuania as a member of the CLARIN infrastructure
                                                                          makes its resources (Fig. 6.) available and accessible in the
                                                                          infrastructure for other researchers. Other members, in turn,
                                                                          provide access to their tools and resources incorporated in
                                                                          the infrastructure.


Fig. 5. CLARIN-LT Repository
    So Lithuanian language resources and tools can now be
reached not only via special websites of those resources and
tools, but also via a specific platform provided by the
CLARIN infrastructure (Accessible).

C. Metadata
    An important step in the FAIR data approach is to publish
existing and new datasets in a semantically interoperable
format that can be understood by computer systems. By
semantically annotating data items and metadata, we can use


                                                                     25
                                                                         integration and reuse by the community after the data
                                                                         publication process.
                                                                             Infrastructures are one of the means to overcome data
                                                                         discovery and reuse obstacles for those wishing to become
                                                                         more FAIR. Lithuanian case of membership in the
                                                                         infrastructure shows that to be a part of infrastructures means
                                                                         to reach more and wider.

                                                                                                     REFERENCES
                                                                         [1]  Wilkinson, M. D. et al. “The FAIR Guiding Principles for scientific
                                                                              data management and stewardship,” Scientific Data 3, 15 March 2016,
                                                                              doi:10.1038/sdata.2016.18. [Online]. Available:
                                                                              https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/.
                                                                         [2] “Guiding Principles for Findable, Accessible, Interoperable and Re-
                                                                              usable Data Publishing version b1.0,” FORCE11. [Online]. Available:
                                                                              https://www.force11.org/fairprinciples.
                                                                         [3] J. Odijk, “The CLARIN infrastructure in the Netherlands: What is it and
                                                                              how      can     you      use   it?”     2014.   [Online].    Available:
                                                                              http://www.clarin.nl/sites/default/files/CLARIN%20General%20Introdu
                                                                              ction.pdf.
                                                                         [4] “Communication: European Cloud Initiative – Building a competitive
                                                                              data and knowledge economy in Europe.” 2016. [Online]. Available:
                                                                              http://eur-lex.europa.eu/legal-
                                                                              content/en/TXT/?uri=CELEX:52016DC0178.
                                                                         [5] “European Open Science Cloud,” European Research & Innovation.
Fig. 6. Lithuanian resources in CLARIN-LT Repository                          [Online]. Available:
    It should be stated that CLARIN has a lot to offer: it has                https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-
                                                                              cloud.
adapted existing data and software, and it has created new
easy and user-friendly software for searching, analysing and             [6] “Realising the European Open Science Cloud.” 2016. [Online].
                                                                              Available:
visualising data. However, there is still work to do: some                    https://ec.europa.eu/research/openscience/pdf/realising_the_european_o
improvements and extensions are needed, especially in terms                   pen_science_cloud_2016.pdf.
of (meta)data functionality (the same applies also to CLARIN-            [7] CLARIN ERIC: https://www.clarin.eu/.
LT).                                                                     [8] J. Odijk, “Linguistic research using CLARIN,” Lingua, 178, pp. 1-4,
                                                                              2016.
              V. CONCLUDING REMARKS                                      [9] CLARIN-LT Repository: https://clarin.vdu.lt/xmlui/?locale-attribute=en.
                                                                         [10] CLARIN ERIC Glossary: https://www.clarin.eu/glossary.
    In the paper the FAIR principles were summarized, with a
                                                                         [11] CMDI Component Registry:
focus on infrastructure initiatives as the conduit for developing
                                                                              https://catalog.clarin.eu/ds/ComponentRegistry/#/?_k=arj093.
these principles in practice: the European Open Science Cloud
                                                                         [12] CLARIN Virtual Language Observatory: https://vlo.clarin.eu/?2.
for supporting the FAIR data approach was presented, and
                                                                         [13] D. Van Uytvanck, H. Stehouwer, L. Lampen, “Semantic metadata
Lithuanian experience in developing language resources                        mapping in practice: the Virtual Language Observatory,” in Proceedings
infrastructure was overviewed.                                                of LREC 2012: 8th International Conference on Language Resources
    Good data management is not a goal in itself, but rather is               and Evaluation, N. Calzolari (Ed.), pp. 1029-1034, 2012.
the key conduit leading to knowledge discovery and
innovation, and to subsequent data and knowledge


                                                                    26