=Paper=
{{Paper
|id=Vol-2042/paper11
|storemode=property
|title=Towards a FAIR Sharing of Scientific Experiments: Improving Discoverability and Reusability of Dielectric Measurements of Biological Tissues
|pdfUrl=https://ceur-ws.org/Vol-2042/paper11.pdf
|volume=Vol-2042
|authors=Md. Rezaul Karim,Matthias Heinrichs,Lars C. Gleim,Michael Cochez,Emily Porter,Alessandra La Gioia,Saqib Salahuddin,Stefan Decker,Martin O’Halloran,Oya Deniz Beyan
|dblpUrl=https://dblp.org/rec/conf/swat4ls/KarimHGCPGSDOB17
}}
==Towards a FAIR Sharing of Scientific Experiments: Improving Discoverability and Reusability of Dielectric Measurements of Biological Tissues==
<pdf width="1500px">https://ceur-ws.org/Vol-2042/paper11.pdf</pdf>
<pre>
        Towards a FAIR Sharing of Scientific
     Experiments: Improving Discoverability and
      Reusability of Dielectric Measurements of
                  Biological Tissues

 Md. Rezaul Karim1,2 , Matthias Heinrichs2 , Lars Christoph Gleim2 , Michael
 Cochez1,2,4 , Emily Porter3 , Alessandra La Gioia3 , Saqib Salahuddin3 , Martin
                O’Halloran3 , Stefan Decker1,2 , and Oya Beyan1,2
                     1
                       Fraunhofer FIT, Sankt Augustin, Germany
             2
                Informatik 5, RWTH Aachen University, Aachen, Germany
      3
        Translational Medical Device Laboratory, Lambe Institute of Translational
                Research, National University of Ireland, Galway, Ireland
    4
      Faculty of Information Technology, University of Jyvaskyla, Jyvaskyla, Finland


        Abstract. Experiments on the dielectric properties of biological tissues
        generate data that characterizes the interaction of human tissues with
        electromagnetic fields. This data is vital for designing electromagnetic-
        based therapeutic and diagnostic technologies, and for assessing the safety
        of wireless devices. Despite the importance of the data, poor report-
        ing and lack of metadata impede its reuse and forgo interoperability.
        Recently, the minimum information model for reporting Dielectric Mea-
        surements of Biological Tissues (MINDER) has been developed as a com-
        mon framework. In this work, we have developed a metadata model and
        implemented a data sharing framework to improve findability and repro-
        ducibility of experimental data inspired by FAIR principles. We define a
        process for sharing the reported data and present tools to support rich
        metadata generation based on existing community standards. The devel-
        oped system is evaluated against competency questions collected from
        data consumers, and thereby proven to help to interpret and compare
        data across studies.

        Keywords: Scientific Data, Dielectric Measurements, Metadata Man-
        agement, Semantic Web, FAIR Data Principles.


1     Introduction
Data sharing is the release of research data for use by others [2]. Over the last
three decades, there have been many discussions about sharing primary research
data. Early studies emphasized the role of sharing scientific data in the practice
of open scientific query for verification and refinement of original resources [4].
Within the context of data intensive science, data sharing has become a main
vehicle contributing to scientific progress by enabling interdisciplinary interpre-
tation of data, optimizing the use of resources and retaining data integrity for
2

long term preservation [14]. On the other hand, data driven innovation also re-
quires low barriers to access, interpret and use, rich and widely available data.
One of the data intensive innovation areas in health care is the development of
medical devices, translating novel research findings to patient care. The design,
development, and clinical evaluation of innovative medical devices for diagnostic
and therapeutic applications is heavily dependent on findability and reusability
of accurate research data.
    One of the fastest growing areas for medical device development in Europe is
electromagnetic (EM) imaging and therapeutics. Within the context of an aging
population and exponential growth in healthcare costs, EM-based techniques
provide a very attractive solution for new therapeutics and diagnostic technolo-
gies, since they are low cost, non-ionising and largely non-invasive. Many Eu-
ropean companies have attempted to commercialize their technology. However,
over 75% of medical device companies go out of business within the first five
years [5]. Before a new medical device company is formed or a new product line
is considered, several factors should be carefully analyzed, such as the clinical
need, market size, the regulatory pathway, and, importantly, the technical risk.
While many of these factors can be easily quantified, the technical risk often
remains the most elusive. Preliminary clinical data or accurate experimental
data are often required to de-risk the technical challenge. However, gaps or un-
certainty in experimental datasets can mean that the technical risk cannot be
estimated sufficiently, and the proposed medical device is ultimately abandoned.
    In this work, we aim to bridge the experimental data gap in device develop-
ment by proposing an approach and implementing tools to improve findability
and reusability of dielectric measurement experiments. Achieving well main-
tained, interoperable, and machine actionable data and metadata are the main
building blocks towards this goal. This can be tackled with Semantic Web tech-
nologies. The Life Sciences domain, as an early adopter of the Linked Data ap-
proach, provides many examples for integrating data from multiple sources and
making them queryable on the web through the SPARQL query language [1, 7].
    Recently, the FAIR guiding principles have been proposed for scientific data
management and stewardship, and have had an impact on the scientific com-
munity at large. These principles, rather than prescribing a set of standards,
describe the qualities or behaviours required of data sources to achieve their
optimal discovery and reuse [10] [18]. The acronym FAIR stands for:
Findable: Enhancing the findability of a given dataset using persistent identi-
   fiers while maintaining additional metadata.
Accessible: By using a standardized communication protocol, data, as well as
   metadata, should be accessible, even if the actual data no longer exists.
Interoperable: Applying controlled vocabularies and qualified references for
   metadata to be used for knowledge representation.
Reusable: By having a clear and accessible data license and using a well-defined
   set of accurate vocabulary, the data becomes easily re-usable.
Existing standards can be applied to fulfill the requirements of the FAIR guide-
lines in varying degrees. However, the need for developing further standards is
                                                                                3

apparent. Our specific goal is to adapt some of these FAIR principles using Se-
mantic Web standards to improve discoverability and reusability of experimental
dielectric data sets to support EM device development.
    We follow an incremental approach to achieve optimal FAIRness of the data
sets. This paper reports our first iteration by application of a set of Semantic
technologies, and achieves some degree of FAIRness. The current implementa-
tion is neither complete in terms of coverage of all FAIR principles nor does it
yet demonstrate the full benefits of existing standards. However, this work pro-
vides a starting point and a good example of practical applications of Semantic
Web technologies for FAIR data sharing, as well as providing a view on future
directions.
    The rest of the paper is structured as follows: Section 2 gives an overview
of related works. In section 3, we discuss the modeling of Dielectric Measure-
ments of Biological Tissues (DMBT) experimental data. Section 4 discusses our
proposed pipeline for sharing DMBT data and metadata, domain specific vo-
cabulary development, and our RDFization technique. In section 5, we describe
access mechanisms to data and metadata. Finally, we provide a future outlook
and conclude the paper.


2   Related Work

Although the knowledge discovery approach was invented more than 400 years
ago, the dissemination of knowledge is still mostly done in the same way as it was
when invented in the 16th century [3,17]. As published articles are isolated from
each other, it is up to the reader to link their information together. This makes
information retrieval more difficult than it could be and it is hardly automated.
     One of the first attempts to tackle the problem was the 5 star OPEN DATA
by Tim Berners Lee5 . This approach provides five guidelines for data namely:
the data must be available on the web under an open license, must be structured
in a well-defined format, accessible in a non-proprietary format like CSV, must
use URIs for denotation, and must be linked to other data to add contextual
information. Although, some organizations, for example, Thomson-Reuters [13]
or Springer-Nature6 are using Semantic Web approaches to construct knowledge
graphs and automating their knowledge retrieval processes, in many other cases
still, the data lacks findability, making it hard to access and reuse. Therefore,
it is hard to link different datasets, resulting in a lower interoperability. As
a complemantary approach the FAIR Data Principles were proposed by the
FORCE11 group7 in 2015 [18] to mitigate these issues as well as to streamline
the workflow of scientific data.

5
  http://5stardata.info/en/
6
  https://www.springernature.com/cn/researchers/scigraph
7
  https://www.force11.org/group/fairgroup/fairprinciples
4

3     Modeling Dielectric Measurements Experimental Data

The dielectric properties, namely, the relative permittivity (εr ) and conductivity
(σ), of biological tissues quantify the interaction of electromagnetic fields with
the human body. Together, these properties characterize how EM waves are
reflected at, absorbed by, and transmitted through the body. Knowledge of the
dielectric properties of various tissues is vital to the field of dosimetry (safety
studies, such as for wireless communication devices), and for the implementation
of EM-based medical technologies, such as microwave ablation and imaging.
     Dielectric measurements are typically performed using an open-ended coax-
ial probe connected to one port of a vector network analyzer (VNA) through a
specialized cable [9]. First, the dielectric probe is calibrated through measure-
ments on materials of known dielectric properties. This enables compensation
for systematic measurement errors. Then, the calibration is validated by mea-
suring on yet another known dielectric material, and calculating the accuracy
of the measurement. Finally, dielectric measurements of biological tissues are
performed by bringing the probe into direct contact with a tissue sample and a
dielectric measurement is recorded. After that, the acquired dielectric data may
be associated with the material composition within the probe sensing volume.
Numerical models may also be fitted to the dielectric data, in order to present
the results in closed form.
     Although the process of conducting a dielectric measurement on a tissue
sample appears rather straightforward, there are a multitude of confounders that
can impact the measured data. These confounders are a likely source of inconsis-
tencies in reported data. Both equipment-based measurement confounders and
clinical confounders affect the accuracy of dielectric data. Uncertainties in the
dielectric data caused by these measurement confounders have been thoroughly
investigated over the years and can now be reduced or eliminated by following
good measurement practice. However, clinical confounders have been relatively
little investigated to date and may introduce a significant level of additional
uncertainty into the dielectric data.
     The reusability of dielectric measurement data and reproducibility of exper-
iments can be improved by capturing metadata that describes the confounders
and by making this metadata a part of the data sharing practice. Moreover,
having confounders’ metadata together with the metadata about the study it-
self can help data consumers to define their data requirements and will improve
the discoverability of datasets.
     Recently as part of our earlier work [12], a set of reporting standards, namely
the Minimum Information Model for Dielectric Measurements of Biological Tis-
sues (MINDER) has been proposed 8 . The developed model follows the Invest-
igation-Study-Assay (ISA) framework and defines rich domain metadata to de-
scribe the aforementioned clinical confounders such as the tissue source, physio-
logical parameters, in-vivo versus ex-vivo measurement, time, temperature, sam-
ple dehydration, as well as dielectric data reporting related confounders such as
8
    https://www.bio-minder.com/
                                                                                 5

model type selection, number of poles, and fitting algorithms. The developed
reporting model is also compliant with the MIRIAM guidelines [8]
    In this work, we report on the implementation of a framework to semantically
express the metadata and data reported via the MINDER reporting schema and
templates. We also developed a platform enabling the discovery of and access to
data and metadata by both individuals and machines. To demonstrate the added
value of our proposed solution, we collected a set of competency question that are
commonly asked by data consumers, as follows:(i) is there any data for pancreas
tissue? (ii) is there any data for porcine bladder tissue? (iii) are there kidney
measurements available at 24.3 ◦C? (iv) are there any measurements available on
biological tissues at 18 GHz? (v) is there any liver data taken between 20 ◦C and
25 ◦C over the frequency range of 1–2 GHz? (vi) is there any tissue-mimicking
phantom data? We present that our realized approach enables us to find answers
to these questions.


4     Improving the Reusability of Experimental Data

Currently there is no standard way to find and access data on dielectric mea-
surements of biological tissues (DMBT) generated in labs. In most cases, the
metadata is only partially recorded, if at all, in lab books and the data is stored
separately on hard disks. Some metadata is reported unsystematically in publi-
cations, without any controlled vocabulary, resulting in the lack of both human
and machine discoverability.


4.1   Pipeline for sharing DMBT data and metadata

In this work, we have created a pipeline to transform the semi-structured, non-
standardized DMBT data and metadata resulting from experiments in individual
labs to a machine discoverable triple store, and developed a data portal to fulfill
data access requirements of end users. Our semantic pipeline implementation
follows a subset of the FAIR data principles, namely:

 – To make the scientific data findable, we assign unique and persistent identi-
   fiers to metadata and data, and uploaded it in a publicly accessible reposi-
   tory. We assigned persistent identifiers for each data object. For the identi-
   fiers we used dereferenceable URLs through the MIRIAM registry.
 – To make our data accessible, at first we reviewed the data and metadata
   and decided on required access mechanisms. Metadata is served via machine
   interoperable, well defined SPARQL protocol, whereas data can be accessed
   and downloadable as CSV files with predefined headers. We did not imple-
   ment any access control mechanism, since all data currently residing on the
   platform is freely available.
 – To make the scientific data interoperable, we applied Semantic Web technolo-
   gies. The metadata is transformed into the Resource Description Framework
   (RDF) format utilizing shared, domain specific vocabularies and ontologies.
6

    – To make the scientific data reusable, we provided rich and well defined
      metadata. Moreover, reusing existing vocabularies and ontologies makes the
      shared data linkable to other data sources.


Fig. 1. Proposed architecture used for making the metadata findable and accessible


    The basic workflow that results from this work is illustrated in fig. 1. At
first, the metadata files are parsed and transformed into Resource Description
Framework (RDF) according to the vocabulary. This process is further explained
in section 4.3. The resulting triples are then transferred into a Virtuoso triple
store9 that is accessible from the web. The web-application then enables the
user to browse this data in various ways, for example, through guided drilling-
down into the experiments or the computation of statistics. Further, the user
can query the data either directly or using pre-specified query templates. This
web application is served using an nginx web server and itself programmed in
PHP. Also easy programmable direct access to the data is provided to enable
machine access.

4.2     Domain specific vocabulary development
Vocabularies define the terms (concepts and relationships) used to describe and
represent a domain. In this work we developed a vocabulary with terms used
to define rich experiment metadata. The data model is based on the MINDER
minimum information model, which followed the ISA framework classification.
In order to optimize interoperability, we reuse a variety of existing terms from es-
tablished ontologies and vocabularies. Suitable candidates were discovered using
the ontology browse and search tools Ontobee [11,20], Linked Open Vocabularies
(LOV) [15] and the BioPortal Ontology Recommender [16].
9
    https://virtuoso.openlinksw.com/rdf/
                                                                                 7

    We only reuse terms fully reflecting the semantic meaning of the term in
the context of our application and chose definitions from more commonly used
ontologies when several suitable candidates were available to simplify integra-
tion with existing datasets. This content reuse enables us to develop a consistent
representation of this domain, reusing content, deploy existing models and align
them to other related datasets. Moreover, this helps us to increase the interop-
erability between other ontology-based applications.


4.3     RDFization of the experiment metadata

RDF is a W3C standard for the description and modeling of data in a struc-
tured way. This standard also provides an abstract and conceptual framework for
defining and using metadata and metadata vocabularies by applying statements
that consist of subjects, predicates and objects to an ontology. Consequently,
the data model becomes easily searchable, findable, and accessible.
    The originally generated metadata for our DMBT experiments was stored
in Excel format. To transform this data into RDF, we have developed a Java
application that parses given metadata files into a data structure, which is then
converted into RDF, adhering strictly to the developed vocabulary. The devel-
oped RDFization tool makes use of the Apache Jena framework. Metadata is
stored in an online triple store hosted on Amazon web services. As described
above, this endpoint can be accessed directly or through our web application10 .
This web application is designed for non-tech-savvy users to facilate uplaod of
new data, however it can receive as input only a specific template.


4.4     Using persistent data and metadata identifiers

To allow reliable referencing of the entities and to enhance the interoperability
between different controlled vocabularies and databases, we utilize persistent
identifiers (PIDs). PIDs enable the combination of data concerning the same
entity from multiple data sources via identifier matching. Thus they enhance the
interoperability so that more than one entity can be combined across individual
documents and data sources.
    The identifier.org system directly employs dereferenceable URLs through
the MIRIAM registry11 which provides persistent identification for life science
data [6]. Using the MIRIAM Registry service allows for data to be referenced in
both a location-independent and a resource-dependent manner, which aids direct
resolution of the identifiers via the HTTP protocol [19]. The service’s PIDs ensure
global uniqueness, perennity, standard compliance, and resolvability while being
free of charge to use. The registry is further queryable and features an automated
link monitoring system which checks the registered resources on a daily basis for
reliability.
10
     https://datalab.rwth-aachen.de/MINDER/
11
     https://www.ebi.ac.uk/miriam/
8

    Overall this system provides a good basis for the persistent identification of
data and metadata in our usage scenario and is thus employed as PID provider
for all datasets in the Bio-MINDER tissue database.


5   Access to data and metadata through web services

To demonstrate the effectiveness of our proposed approach, we evaluated our
approach through our web application. The Semantic Web technologies used in
our web application provide the encoding of the entities by assigning resolvable
identifiers. The used vocabulary also defines the purpose or interpretation of
terms. Additionally, a high-level description of the ontology is available from the
web interface so that user can reuse it. These functionalities help make the data
accessible and enhances the re-usability.


       Fig. 2. Metadata for Measurements of Freshly Excised Porcine Tissues


    We provided access to data for both human and machine consumption through
SPARQL query interface and web application respectively. Also, as part of the
web application, there is a tab in which all of the above-mentioned competence
questions are answered based on SPARQL queries. These can be executed and
the result browsed. Further, when a user selects a specific investigation ID, both
the metadata and the description can be accessed. An example is shown in fig. 2.
The concrete experimental data can then be downloaded when the user agrees
to the specified terms and conditions.


6   Conclusion and Outlook

In this paper, we demonstrated a use case of Semantic Web and a subset of
the FAIR principles to improve the repeatability of dielectric measurements of
                                                                                  9

biological tissue and the reusability of the produced data. We showed how to
adopt the MINDER specification and developed a domain specific, controlled
vocabulary which makes the data more findable and reusable, setting first steps
towards the FAIR principles. We utilized persistent identifiers (PIDs) for both
the data and metadata from experiments on the dielectric measurements of bio-
logical tissue. This allows the referencing of data in both a location-independent
and resource-dependent manner. The provision of resolvable identifiers (URLs)
fits well with the Semantic Web vision, and the Linked Data initiative. More-
over, we have reused existing ontologies for terms from our controlled vocabulary
where appropriate. Then, we made the overall system available through a web
interface which also specifically answers the competence questions which were
gathered from domain experts.
    In this first iteration, our approach still has several limitations. First, we
currently have only met some aspects of the FAIR principles. In future work,
we could look at whether it is reasonable and feasible to also address other
aspects. For example, we did not deal with licensing properly. We have included
a simple licence for the provided files, but these are not in a form which is
machine interpretable. One reason we have refrained from defining this is that
there is currently no consensus on what a good way would be. This issue is, for
example also discussed in working groups on the DCAT standard12 , and several
application profiles have chosen different ways to address this. Hence, it would
be better to wait until a consensus is reached there before reinventing the wheel
ourselves. We have currently also not specified any data access restrictions. For
the datasets currently provided, there is no such need, but for data commercially
provided, this has to be amended.
    Currently, we also chose to only reuse existing vocabulary terms in case there
was an exact match. We could consider also adding redundant vocabulary terms,
which are broader as the ones we applied to increase potential reusability. A
related issue is that we did not make use of, for example, the DCAT vocabulary
for specifying our data catalog. The reason is that we were mainly focused on
the domain vocabulary and not on the higher level from the start. This can be
amended in future work.
    This work already showed useful progress. In later work, we would like to
further improve this approach for not only biological experiments but also other
domain like agriculture, finance, marketing, etc.


References

 1. Beyan, O.D., et al.: Querying phenotype-genotype associations across multiple
    knowledge bases using semantic web technologies. In: Bioinformatics and Bioengi-
    neering (BIBE), IEEE 13th International Conference on. pp. 1–5. IEEE (2013)
 2. Borgman, C.L.: The conundrum of sharing research data. Journal of the Associa-
    tion for Information Science and Technology 63(6), 1059–1078 (2012)

12
     https://www.w3.org/TR/vocab-dcat/
10

 3. Decker, S.: Rethinking access to scientific knowledge: Knowledge graphs. LinkedIn
    Pulse       (2017),   https://www.linkedin.com/pulse/rethinking-scientific-
    knowledge-graphs-stefan-decker/
 4. Fienberg, S.E., Martin, M.E., Straf, M.L.: Sharing research data. National
    Academy Press (1985)
 5. Gage, D.:           The Venture Capital Secret: 3 Out of 4 Start-Ups
    Fail the wall street journal (2012), http://www.wsj.com/articles/
    SB10000872396390443720204578004980476429190
 6. Juty, N., Le Novère, N., Laibe, C.: Identifiers. org and MIRIAM registry: commu-
    nity resources to provide persistent identification. Nucleic acids research 40(D1),
    D580–D586 (2011)
 7. Kazemzadeh, L., Kamdar, M.R., et al.: LinkedPPI: enabling intuitive, integrative
    protein-protein interaction discovery. In: Proceedings of the 4th International Con-
    ference on Linked Science-Volume 1282. pp. 48–59. CEUR-WS. org (2014)
 8. Le Novère, N., Finney, A., Hucka, M., Bhalla, U.S., Campagne, F., Collado-Vides,
    J., et al.: Minimum information requested in the annotation of biochemical models
    (MIRIAM). Nature biotechnology 23(12), 1509 (2005)
 9. Meaney, P.M., Gregory, A.P., Seppälä, J., Lahtinen, T.: Open-ended coaxial di-
    electric probe effective penetration depth determination. IEEE transactions on
    microwave theory and techniques 64(3), 915–923 (2016)
10. Mons, B., Neylon, C., Velterop, J., Dumontier, M., et al.: Cloudy, increasingly
    FAIR; revisiting the FAIR data guiding principles for the european open science
    cloud. Information Services & Use (Preprint), 1–8 (2017)
11. Ong, E., Xiang, Z., Zhao, B., Liu, Y., Lin, Y., Zheng, J., et al.: Ontobee: A linked
    ontology data server to support ontology term dereferencing, linkage, query and
    integration. Nucleic acids research 45(D1), D347–D352 (2016)
12. Porter, E., La Gioia, A., Salahuddin, S., Decker, S., Shahzad, A., et al.: Minimum
    information for dielectric measurements of biological tissues (MINDER): A frame-
    work for repeatable and reusable data. International Journal of RF and Microwave
    Computer-Aided Engineering pp. e21201–n/a, e21201
13. Song, D., Schilder, F., Hertz, S., Saltini, G., et al.: Building and querying an
    enterprise knowledge graph. IEEE Transactions on Services Computing (2017)
14. Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A.U., Wu, L., Read, E., Manoff,
    M., Frame, M.: Data sharing by scientists: practices and perceptions. PloS one
    6(6), e21101 (2011)
15. Vandenbussche, P.Y., Atemezing, G.A., Poveda-Villalón, M., Vatant, B.: Linked
    open vocabularies (LOV): a gateway to reusable semantic vocabularies on the web.
    Semantic Web 8(3), 437–452 (2017)
16. Whetzel, P.L., Noy, N.F., et al.: BioPortal: enhanced functionality via new web ser-
    vices from the national center for biomedical ontology to access and use ontologies
    in software applications. Nucleic acids research 39(suppl 2), W541–W545 (2011)
17. Whewell, W.: History of inductive sciences (1858)
18. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M.,
    Baak, A., et al.: The FAIR guiding principles for scientific data management and
    stewardship. Scientific data 3, 160018 (2016)
19. Wimalaratne, S.M., Bolleman, J., Juty, N., Katayama, T., Dumontier, M.,
    Redaschi, N., Le Novère, N., othersection 4.2?: SPARQL-enabled identifier con-
    version with identifiers.org. Bioinformatics 31(11), 1875–1877 (2015)
20. Xiang, Z., Mungall, C., Ruttenberg, A., He, Y.: Ontobee: A linked data server and
    browser for ontology terms. In: ICBO (2011)

</pre>