1. Introduction

June

Comparing FAIR Assessment Tools and their Alignment with FAIR Implementation Profiles Using Digital Humanities Datasets

Andre Valdestilhas

Menzo Windhouwer

Ronald Siebes

Shuai Wang

0 2 0 Department of Computer Science, Vrije Universiteit Amsterdam , De Boelelaan 1105, 1081 HV Amsterdam , The Netherlands 1 KNAW - Humanities Cluster , Oudezijds Achterburgwal 185, 1012 DK Amsterdam , The Netherlands 2 University Library, Maastricht University , Grote Looiersstraat 17, 6211 JH Maastricht , The Netherlands

2025

02 2025 0000 0002

FAIR principles serve as guidelines for implementing data and metadata to improve Findability, Accessibility, Interoperability, and Reusability. In recent years, numerous tools have been developed to assess how well datasets adhere to each FAIR principle. However, due to their diverse designs, these tools interpret the FAIR principles diferently and provide varying assessment results, which can be confusing. Many communities publish datasets that follow similar data management practices, and some of these common practices have recently been compiled into community standards known as FAIR Implementation Profiles (FIPs). This paper compares the metrics of FAIR assessment tools with FIP. We illustrate these diferences by analyzing the assessment results of two datasets in the Digital Humanities domain and further explore how these results compare with their corresponding FIPs.

eol>FAIR assessment Digital Humanities FAIR Implementation Profile

1. Introduction

Digital Humanities (DH). We compare the assessment results of the representative datasets of two DH communities with their corresponding FIPs.

Due to the interdisciplinary nature of the research field, DH researchers often encounter various types of data. The datasets used and published in one project can difer significantly from those in another, leading to complex approaches to data management. Some datasets are available through modern data infrastructures, while others can only be accessed on outdated websites, contributing to the diversity of implementation choices. It has been observed that certain implementation choices may depend on the selected data infrastructure, particularly in terms of metadata. To complicate matters further, the interpretation of FAIR principles can vary among diferent assessment tools, resulting in inconsistent scores [26]. How do the assessment results of DH datasets difer between FATs, and what does the diference imply? How does the diference between datasets impact findability, interoperability, and other aspects? How does a FIP capture the diversity of choices between datasets published by researchers in a community? Addressing these questions would not only guide better implementation decisions in data management but also ease the comparison between communities.

The variance of interpretation of FAIR principles captured by diferent FATs and the diferences mentioned above between communities, we would like to study the following research questions (RQ). RQ1: How do the FATs and FIP difer in the interpretation of the FAIR principles? RQ2: How do the FAIR assessment results of DH datasets difer between FATs? RQ3: How do the FAIR assessment results of representative datasets from various DH communities align with their respective FIPs? In particular, we will compare the assessment results of two selected representative datasets within DH communities. In addition, we will examine the aspects included in the FIPs and investigate how the assessment results correspond to these profiles. The evaluation results reveal disparities in FAIR assessment outcomes since diferent DH communities have developed unique approaches.

The remainder of this paper is as follows: in Section 2, we introduce the FATs and the FIP concept in more detail. Section 3 describes the methodology and implementation decisions: the selected DH communities, their FIPs, and the representative datasets. Section 4 provides a concise comparison of the FATs’ assessment metrics with the FIP and studies how much the assessment results align with the FIPs. Finally, Section 5 discusses options to mitigate the problems identified. 2

2. Preliminaries and Related Work

The past years have witnessed the development of many FAIR assessment tools3 using metrics tailored to diferent needs with diferent interfaces [ 2 ]. A recent study compared 20 FATs and their 1180 relevant metrics and highlighted the diferent characteristics of the tools and the trends over time [ 4 ]. Seven automatic FATs are web services that can perform evaluations on data sets: AutoFAIR [ 3 ], FAIR Checker [8], FAIR Enough [7], FAIR Evaluator [25], F-UJI [11], FAIR EVA [ 1 ], and FAIROs [10]. FAIROs is intended for a specific format of digital objects, namely RO-Crate [ 4 ]. AutoFAIR is further tailored towards bioinformatics and FAIR Checker for life sciences. Some of these FATs could be adapted to specific disciplines [ 4 ], but none are specialized in social science and humanities. Sun et al. compared FAIR Evaluator, FAIR Checker, and F-UJI, focusing on the characteristics of the evaluation tools, the FAIRness evaluation metrics, as well as the testing results using some public datasets [18]. The EOSC FAIR Metrics subgroup reported that these FATs have comparability issues, leading to inconsistency [26]. They reported by example that the same dataset could end up with completely diferent assessment results [26]. A closer examination shows that FATs studied used diferent numbers of tests and the distribution of tests difers, with F-UJI having most of its tests in the “Reusability” domain while the FAIR Evaluator has most of its tests on Findability and Interoperability [26]. A recent study reported that some items in the assessment metrics of 20 FATs do not always have a one-to-one matching with the FAIR principle (i.e. could be one-to-many), have misalignment, or are not about FAIR at all [ 4 ]. Moreover, the assessment result could be diferent even if the metrics align [12].

2The supplementary material is available on Zenodo with DOI: 10.5281/zenodo.15261773. 3By 18 April 2025, 30 tools and services are listed and compared on FAIRassist: https://fairassist.org/.

Alternatively, FAIRness can be assessed by manual questionnaires such as the FAIR Data SelfAssessment Tool4 and FAIR Implementation Profiles (FIPs) [ 14]. This paper focuses on FIPs: a FIP serves as a comprehensive framework that captures the strategies for the management of digital assets of FAIR Implementation Community (FIC), a self-identified organization (often with more than one person) with a common interest that aspires to the creation of FAIR data and services [14]. The creation of FIP starts with a FIP template consisting of 21 questions that address various facets of the FAIR principles, focused explicitly on datasets and their associated metadata. The answer to each question is restricted to FAIR Enabling Resources (FERs): a list of tools, services, licenses, infrastructures, and other services and resources that help researchers, data stewards, and institutions make data FAIR. These answers are considered community standards. These questions prompt thoughtful responses that illustrate how resources are allocated and utilized to facilitate adherence to the FAIR principles. FIP not only captures the commitment of the FIC to implementing FAIR practices but also serves as a valuable tool for evaluating and enhancing the efectiveness of these eforts over time. In recent years, FIP gained popularity in many domains, including Social Science and Humanities (SSH) [22]. These FIPs can serve as references for decisions in data management, especially for the R1.3 principle: (Meta)data meet domain-relevant community standards. The applications of FIPs for data management include providing suggestions for Data Management Plans [16] and decisions for the upcycling of legacy data [21]. To facilitate the creation of FIPs, the FIP Wizard5, a user-friendly web interface, is used to guide users through the questions to document their FAIR implementation strategies. This process fosters introspection within communities, aiding in the recognition of their strengths and aspects that require enhancement regarding FAIR practices. Furthermore, FIPs can raise awareness within a particular field and act as references for other communities to create policies that align with FAIR principles. The resulting FIPs are published as nanopublications [14]. Published FIPs, FERs, and other related resources can be found on FIR Connect’s search engine.6 Next, we introduce the FIPs of two communities.

The CLARIN (Common LAnguage Resources INfrastructure) is one of the oldest European ERICs (European Research Infrastructure Consortia) and serves the community of researchers developing and interested in Language Resources and Tools (LRT)7. Nowadays, its network spans 243 institutions in 24 countries in Europe and South Africa and their associated communities of researchers. Over the years, this infrastructure has developed a set of requirements for datasets and their metadata. These requirements are vetted by the CLARIN Technical Centre Committee in the “Checklist for CLARIN B-Centres”8. To become a CLARIN B centre, a candidate centre does a self-assessment, which gets reviewed by the CLARIN Assessment Committee resulting, in general, in successfully being granted the B centre status. This procedure is repeated every 3 years. Although these requirements predate the FAIR principles, they largely overlap. So, upon creation of a FIP for CLARIN, these requirements have become the basis, resulting in the answers to the FIP questionnaire as shown in Table 19. It also includes some resources under development, which are in italics, indicating future use. It does show that long-established communities, like the LRT one as now represented by CLARIN, often have a well-established and still actively developed set of technologies in place to meet the FAIR principles. CLARIN develops and maintains the ‘CLARIN Virtual Language Observatory’ [20], providing access to the joint metadata domain of the european CLARIN infrastructure on LRT. The FIP was created in collaboration with the FAIR Expertise Hub using the FIP template 4.3.4.

The ODISSEI Portal Community is the subcommunity of ODISSEI10, the Dutch Research Infrastructure for Social Sciences and Economics, which is a collaborative consortium of 45 member organisations, including social sciences faculties in the Netherlands, Central Statistics Ofice (CBS), public research 4https://ardc.edu.au/resource/fair-data-self-assessment-tool/ 5https://fip-wizard.ds-wizard.org/ 6https://fairconnect.pro/search-fair-nanopublications/ 7https://www.clarin.eu/ 8http://hdl.handle.net/11372/DOC-78 9Answering a FIP question by items taken from a registry is a feature not yet implemented in the FIP Wizard. Thus, the FIP in the supplementary material does not have it. 10https://odissei-data.nl/ agencies, and research institutes. This community advances the implementation of FAIR principles by developing and maintaining the ODISSEI Portal [6].11 Diferent from CLARIN, this community does not produce datasets, but collects and provides metadata through their portal. Thus, their FIP mostly focuses on metadata. The FIP was created in a similar fashion as CLARIN with the same template.

3. Methodology and Implementation Decisions

Our methodology consists of three steps: 1) align assessment metrics of FATs with FIPs, 2) compare the assessment results of representative datasets, and 3) reflect the assessment results with FIPs.

Step 1. To better understand how the FATs align with FIPs, we compare the metrics of the two FATs and align them with each other and with the FIP template (RQ1). This will clarify why the assessment results are diferent and how they can be explained. As explained in Section 2, there are many automated FATs. Given the comparability issues as addressed by [26], in this paper, we use FAIR Enough and FAIR Checker as they both use the FAIR Maturity Indicators (MI) [13] as their assessment criteria. They are selected for their state-of-the-art development, usability, and robust performance. We leave the comparison of the assessment results with the remaining FATs for future work. FAIR Enough was built on the FAIR Test Python library.12 and evaluates how much online resources follow the FAIR principles [7] with integrated insights from previous assessment initiatives, such as F-UJI [11], FOOPS! [9] and FAIR Evaluator [25]. Assessments can be performed using the web service13 as well as the FAIR Extension[15], its Google Chrome extension. The FAIR Checker14 [8] is another assessment tool that includes a collection of SPARQL queries for evaluating FAIR principles and a SHACL constraints generator for enhancing metadata completeness. It could be used as a web service as well as through its RESTful API [8]. Moreover, it further specializes in digital objects in life sciences. During alignment, where such compatibility issues occur, we choose the closest possible one-to-one match while preventing misalignment.

Step 2. We choose the representative datasets of the two communities and obtain their assessment results for comparison to answer RQ2. Although it is easy to obtain assessment results of an individual dataset and aggregate the scores corresponding to each subprinciple [12], there is no aggregation scheme, as far as the authors are aware, that compiles the results of all the datasets and other digital assets a community publishes into one aggregated assessment result. Such an aggregation is not as simple as taking the average of the assessment results of datasets. There are many factors to be taken into account: the evolution of data management strategies, diferent versions of datasets, the change of data infrastructures, duplicates, exceptional datasets, diferent licenses, etc. As a pilot study, we focus on communities’ representative datasets, which are to be assessed by selected FATs. Apart from the regular communities that produce and publish data, it was observed that some communities have an impact on datasets and their management by providing data infrastructures and services. These communities are often not the curators of datasets themselves. Thus, we choose one community of each kind. Our paper takes a similar approach as [18] but with a focus on DH datasets. Next, we introduce the two chosen DH communities and their representative datasets, respectively. We then obtain the assessment results of representative datasets using the two selected tools. To answer RQ2, we rely on the answer to RQ1 and examine three aspects: 1) coverage of the FAIR principles, 2) the diference in assessment results, and 3) the format and reuse of the results.

For CLARIN, we choose the Awetí dataset15 in The Language Archive at the Max Planck Institute for Psycholinguistics [5]. The MPI-PL ofers resources for long-term archiving language resources and tool development. They have been the birthplace of the CLARIN infrastructure and they are the primary archive of the DOBES project funded by the Volkswagen Foundation for multimodal resources on endangered languages collected by trained field linguists. The Language Archive has been a CLARIN 11https://portal.odissei.nl/ 12https://github.com/MaastrichtU-IDS/fair-test 13https://fair-enough.semanticscience.org/ 14https://fair-checker.france-bioinformatique.fr/ 15https://hdl.handle.net/1839/74209f7d-c6f-4129-8afd-64c78f4d300e B center for many years and provides its (meta)data compliant with the CLARIN requirements, so the FAIR assessment of the dataset could be a good representative of the CLARIN community.

For the ODISSEI Portal community, we select the dataset “Banen en lonen op basis van de Polisadministratie” (“Jobs and wages based on the Policy Administration” in English)16 contains data on jobs and wages in Dutch companies, derived from the Polisadministratie (the “Police administration” in English), which records all income relationships subject to wage tax. For simplicity, we refer to this dataset as the ‘Banen’ dataset. The dataset is managed by Statistics Netherlands (CBS) and accessible through the ODISSEI portal, requiring approval for use. This dataset is vital for the Semantic Web because it provides structured, detailed data about employment and wages in the Netherlands, which can be linked and integrated with other datasets to enhance understanding of labor markets, economic trends, and social factors. By using standardized formats and ontologies, the data can be made machine-readable and interoperable, facilitating automated analysis and richer insights across disciplines. Its availability as open data also supports transparent research and evidence-based policy-making. Integrating such datasets into the Semantic Web ensures that they are accessible, discoverable, and useful for a wide range of users and applications. Since the Banen dataset is not publicly accessible, some entries of assessment are expected to fail.

Step 3. Finally, our RQ3 requires us to use the assessment results and compare them with the FIPs regarding each aligned assessment metric. We discuss how FERs are used and how they are reflected in the assessment results. We compare the FIPs and discuss how to improve FAIRness scores.

4. Evaluation and Discussion 4.1. Comparing FATs’ assessment metrics with FIP questions

We answer RQ1 by comparing the evaluation metrics with the questions of FIP. They address FAIR at diferent levels and emphasize diferent aspects: the FAIR assessment has much more technical details on “behavior” of a digital asset. FATs detects if certain standards or specified resources are being used and returns Success/Failure. However, the FIP is broader with open questions on the use of resources, i.e. community choices. For example, FAIR Checker would give the points when the condition is met: “Metadata includes provenance” by verifying that at least one provenance property from PROV, DCTerms, or PAV ontologies is found in metadata. In contrast, FIP asks “What metadata schema do you use for describing the provenance of your datasets?” For this reason, aggregating the assessment results requires the FATs to include the detected resources in their output. The assessment could be of diferent specificity. For F4, FIP asks which service is used to publish metadata/data, unlike FAIR Enough, which tests if the data can be found in Bing. For I1, FAIR-Checker verifies that at least one RDF triple can be found in metadata. In contrast, the corresponding FIP question is much more specific: “What knowledge representation language (allowing machine interoperation) do you use for metadata records?” For FAIR Enough, the metrics are diferent with a split of strong and weak for metadata and data, respectively. In addition, it was also noticed that the assessment of interoperability can be muddled. For example, assessment on the use of ontology is in I3 for FIP but is in I2 for FAIR Checker, and is not well-addressed by FAIR Enough. As shown above, there are compatibility issues between FAIR assessment metrics and FIP. This adds complexity to RQ2 and RQ3 to be shown below.

4.2. Comparing the assessment results of representative datasets

Next, we answer RQ2 (how the assessment results difer between selected FATs) by studying how the results difer per aspect of FAIR. We then study the diferences in the assessment results and discuss the reasons. Finally, we evaluate the reuse of these results.

It was noticed that they both skipped some subprinciples. There are only 12 metrics by FAIR Checker in comparison with a more sophisticated assessment of 22 metrics by FAIR Enough. They overlap on 10 metrics over FAIR principles including F1, F2, A1, I1, I2, I3, and R1.1. Both FATs have more metrics about metadata than data. Furthermore, more metadata-data pairs are in the metrics of FAIR Enough.

Looking at the overall result, both FATs give higher marks to the Banen dataset: FAIR Checker assigns 70.83% to Banen in comparison with Awetí’s 16.67%. The score by FAIR Enough is 3/22 for Awetí and 10/22 for Banen. Regarding their overlap, for Awetí, they agree on 7 out of 10 overlapping metrics. Conversely, they agree on merely half of the criteria for Banen, even by matching FAIR Checker’s ‘1/2’ with FAIR Enough’s ‘Success’. The diference lies mostly in the identifier, the communication protocol, the authentication and authorization, external links and outward references, and the licenses. This addresses the diference in implementation design between FATs, which could be further explored. Notice that the FATs have a high preference for metadata represented via Linked Data. Only F-UJI, which is run as part of the FAIR Enough run, recognizes the use of CMDI by CLARIN and is able to, although still very limited, interact with it. It was also noticed that FAIR Enough failed to detect the use of Handle thus the assessment result for F1 has some errors.

The assessment results could be downloaded from FAIR Checker in the format of CSV, while FAIR Enough uses JSON. Most recently, FAIR Checker ofers their results also in RDF. Their results are not interoperable. The explanation of the assessment result of each metric is in text rather than a structured form. None of their assessment results can be directly used to enrich their corresponding FIP without human interpretation, which reduces the reuse.

4.3. Comparing the assessment results with FIPs

Finally, for RQ3, we compare the assessment results with their corresponding FIPs. We further elaborate on our observations, analyze the diferences, and address means to improve the FAIRness scores.

Findability. Both CLARIN and ODISSEI use globally unique, persistent identifiers like DOIs and Handles for metadata. The two datasets use DOI and Handle, and are aligned with the community standards captured by FIP. CLARIN relies on CMDI and Dublin Core for metadata schemas, while ODISSEI uses DDI, DCAT, and Croissant. The use of multiple schemas in ODISSEI may enhance compatibility with diferent platforms. Here, the Awetí dataset lost points for both FATs, in contrast, the Banen dataset gained points. Metadata indexing varies: CLARIN’s metadata is indexed in the Virtual Language Observatory (VLO), while ODISSEI’s metadata is found in Zenodo and its own portal. Both are aligned with community standards: Awetí can be found in VLO and Banen can be found in the ODISSEI Portal. However, this aspect was assessed by neither FAT.

Accessibility. Both use HTTPS for accessing metadata, ensuring security. CLARIN and ODISSEI support OAI-PMH for metadata exchange, but ODISSEI also provides APIs (Dataverse and SPARQL), increasing accessibility. Authentication for metadata is stricter in ODISSEI, using SURF systems, whereas CLARIN has no authentication for metadata records. Dataset authentication varies: CLARIN supports SAML and OIDC, while ODISSEI relies on SURF-SRAM and CBS Microdata Authentication, showing diferences in security approaches. The FATs diverge in their authentication assessment outcomes for both datasets, underscoring the diferences in implementation.

Interoperability. CLARIN and ODISSEI are diferent in knowledge representation languages, semantic models, and metadata schema. Despite that FATs’ the assessment results agree with each other, the used FERs are not highlighted in the assessment result for comparison, which could be a barrier for the assessment of metadata interoperability.

Reusability Both infrastructures adopt Creative Commons CC0 licenses for metadata, promoting open reuse. Dataset licensing is less uniform, depending on specific contexts in CLARIN, while ODISSEI does not specify dataset licenses, potentially limiting clarity on reuse conditions. Provenance tracking difers: CLARIN lacks provenance metadata schemas, whereas ODISSEI is planning to use a customized JSON schema, ofering more explicit tracking of metadata origins. Also in ODISSEI the versions of the various software components used to ingest and enrich the metadata is provided and made available via persistent identifiers.

5. Conclusion and Future Work

In this paper, we answered our first research question by critically examining the evaluation metrics of two popular FAIR assessment tools. We highlighted the diferences and addressed the compatibility issues. The second research question was answered by a detailed comparison of FATs’ results from three aspects. This diference reflects the FATs’ evaluation metrics and implementation choices. Finally, for RQ3, we compare FATs’ results against FIPs. Using FATs to assess the representative datasets shows that these domain-specific solutions pose problems for these generic tools. The FATs don’t understand enough of some of the technologies chosen by the infrastructure, e.g., CMDI. However, to go beyond being FAIR within ones own community aiming for a better score with the more generic FATs will improve the maturity of FAIRness, e.g. by also make a core part of the metadata available as linked data in the landing page of a dataset.

European projects, e.g., FAIRCORE4EOSC [19] and OSTrails [17], are developing solutions based on FAIR testing execution flows that address individual tests[ 23]. This approach allows a community to mix and match tests from the various tools to make FAIR assessment better align with their FIP. It also allows a community to implement some of these tests with some more specialized criteria to suit their solutions better. Both infrastructures exhibit domain-specific implementations of FAIR principles, highlighting how diferent research communities tailor their metadata, authentication, and interoperability approaches. These diferences suggest that achieving FAIRness is not a one-size-fits-all approach. Enhancing dataset indexing, licensing policies, and provenance metadata could improve FAIR compliance while ensuring alignment with FATs. Neither fully meets all FAIR criteria, as dataset indexing is inconsistent, and dataset licensing and provenance tracking remain underdeveloped.

In conclusion, this paper presents a proof-of-concept work towards a more complete evaluation and analysis on a larger scale of communities’ FAIR practices. Despite that the assessment results can be easily obtained for an individual dataset, it remains challenging for DH researchers to interpret and take advantage of these disparate assessment results. Our approach can be adapted and applied to other domains beyond SSH and with alternative FATs. Although the assessment results and their comparison with FIP can be diferent, the approach can be scalable when an aggregation method is implemented for assessment results (and that of various digital objects) is available.

6. Acknowledgments

This publication is part of the project Social Science and Humanities Open Cloud for the Netherlands (SSHOC-NL) with file number 184.036.020 of the research programme National Roadmap for Large-Scale Research Facilities which is (partly) financed by the Dutch Research Council (NWO). The authors would like to thank Liliana Melgar for the discussions and her assistance with proofreading.

Declaration on Generative AI The author(s) have not employed any Generative AI tools.

[1]

Aguilar Gómez and I. Bernal. FAIR EVA : Bringing institutional multidisciplinary repositories into the FAIR picture . Sci. Data , 10 ( 1 ): 764 , Nov . 2023 .

[2]

Antonioletti ,

Wood ,

N. Chue

Hong ,

Breitmoser ,

Moraw , and

Verburg . Comparison of tools for automated fair software assessment , Aug . 2024 .

[3]

Bonello , E. Cachia, and

Alfino . Autofair-a portal for automating fair assessments for bioinformatics resources . Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms , 1865 (1): 194767 , 2022 .

[4]

Candela ,

Mangione , and G. Pavone. The fair assessment conundrum: Reflections on tools and metrics . Data Science Journal , 23 ( 1 ), 2024 .