<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wikidata as an intuitive resource towards semantic data modeling in data FAIRi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annik</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gory S. Stupp</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lynn M. S</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>hriml</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>rk Thompson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>w I. Su</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>o Roos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Human Genetics, Leiden University Medical Center</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Micelio</institution>
          ,
          <addr-line>Antwerp - Ekeren</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The Scripps Research Institute</institution>
          ,
          <addr-line>San Diego CA 92037</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Maryland School of Medicine</institution>
          ,
          <addr-line>Baltimore, MD</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data with a comprehensible structure and context is easier to reuse and integrate with other data. The guidelines for FAIR (Findable, Accessible, Interoperable, Reusable) data for humans and computers provide handles to transform data existing in silos into well connected knowledge graphs (linked data). Semantic data models are key in this transformation and describe the logical structure of the data and the relationships between the data entities. This description is provided through IRIs (Internationalized Resource Identi ers) which link to existing ontologies and controlled vocabularies. Creating a semantic data model is a labour-intensive process, which requires a solid understanding of the selected domains and the applicable ontologies. Moreover, in order to achieve a useful degree of Interoperability between datasets, either the datasets need to use the same (set of) ontologies, or the ontologies themselves need to be aligned and mapped. The former requires implementation of extensive (social) processes to achieve consensus, while the latter requires relatively advanced semantic engineering. We argue that this poses a signi cant obstacle for (otherwise capable) novice data modelers and even experienced data stewards. Here, we propose that Wikidata can be used as an intuitive resource for resolvable IRIs both for teaching and studying semantic data modeling. In this way Wikidata serves as a hub in the linked data cloud connecting di erent but similar ontologies. We elaborate current problems and how Wikidata can be used to tackle these. As an example we describe two genetic variant models, one generated in a workshop and one generated using Wikidata. This shows how Wikidata can be instrumental in mapping similar concepts in di erent ontologies in a way that can bene tFAIR data stewardship processes in education and research.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic modeling</kwd>
        <kwd>FAIR mapping</kwd>
        <kwd>Wikidata</kwd>
        <kwd>ontology</kwd>
        <kwd>concept</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Opportunistic semantic data modeling</title>
      <p>
        Data integration between heterogeneous data sets can be enabled by making
them machine-readable, which formally captures their structure and context.
One common way of generating such linked data is by using the combination of
Resource Description Framework (RDF) triples and Internationalized Resource
Identi ers (IRIs), the latter providing semantics to the data. In addition it is
crucial that the context is made explicit through IRIs. This orchestration between
IRIs can be captured in a semantic data model. Generating linked, interoperable
data by using semantic modeling is central to making data FAIR (Findable,
Accessible, Interoperable, and Reusable) for humans and computers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However,
creating a semantic data model is a laborious process. This process, which
requires expertise both in the eld under scrutiny and in ontologies and controlled
vocabularies, is coordinated by a data steward. Further, given the relative
novelty of the eld, the availability of data stewards does not scale to the demand
in the life sciences eld.
      </p>
      <p>
        To disseminate expertise of FAIRi cation, so-called Bring Your Own Data
(BYOD) workshops have been conducted for the last ve years, where FAIR
and domain experts work together to FAIRify heterogeneous data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A large
part of this process consists of creating the underlying semantic data model. An
example of such a model, generated in a BYOD held 6-8 June 2017 in Utrecht,
The Netherlands [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], can be seen in Figure S1. This model describes data
that re ect measurements in samples from whole genome sequencing
experiments, available in a variety of non-linked data formats. These models tend to
be rather opportunistic in their onset, because the BYOD participants typically
have diverse backgrounds. The di erent ontologies and controlled vocabularies
are often cherry-picked based on the respective preferences of the participants
and experts. Di erent resources exist to semantically express the same data even
within the same domain: for example, OBO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Bioschemas [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and SIO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] serve
partially overlapping and partially distinct areas of semantics in the life sciences.
Therefore, initial BYOD models often use multiple namespaces, because it is
difcult to dictate a clear guideline to select one single source for semantics. This is
demonstrated by the large number of results per term in several state-of-the-art
ontology search tools (Table S1). Even if the model was harmonized on a small
set of ontologies and controlled vocabularies, the numbers mentioned in Table
S1 suggest that di erent data modeling groups still would end up using di erent
harmonized sets. This raises the question on how interoperable and reusable the
resulting linked, FAIR data really is. For linked data that uses distinct sets of
ontologies and vocabularies to be interoperable, it is essential to have mappings
between their vocabulary terms and ontological concepts, otherwise the resulting
linked data e ectively remains a data silo.
      </p>
      <p>
        In this paper, we propose that Wikidata [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] may be used as a source for IRIs
and serve as a potential hub linking di erent opportunistic semantic data models
both for education and research. Wikidata is a linked database contributed by
both humans and machines. We rst, describe an opportunistic semantic data
model of genetic variants generated in a BYOD and show how Wikidata can be
used for data model construction and ontology mappings.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Semantic Data modeling with Wikidata IRIs</title>
      <p>
        Wikidata is a linked database and a sister Wikimedia project of Wikipedia [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
What Wikipedia is to text, Wikidata is to data: anyone, both humans and
machines, can contribute to Wikidata as long as its primary source of the
contribution is available under a public license. Wikidata has an RDF representation
using Wikidata namespaces, which enables Wikidata concepts (items) to be
embedded into RDF knowledge graphs. Issuing Wikidata items and statement
values are open to all. On the other hand, properties link items in the Wikidata
namespace (e.g. molecular function in Figure S2) are prede ned and new
properties need to go through a proposal process before they can be instantiated.
Although Wikidata is not limited to a constrained set of domains, there are
various active initiatives in the Biomedical domain to synchronize Wikidata with
knowledge from authoritative biomedical resources [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], such as the Disease
Ontology [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or Gene Ontology [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where respective items have mappings to
the original ontologies. Instead of having to sort through a wide variety of
suggestions provided by the di erent ontology search tools, Wikidata can act as a
single entry point for IRIs to create a semantic data model for data FAIRi
cation. Wikidata provides three di erent ways that make it a viable source for IRIs
to be used in initial steps of transforming unstructured research data into FAIR
data. Firstly, the various (language) labels and descriptions associated with an
item may be modi ed or extended by Wikidata users to enrich the de nition
of an item. The direct link to related wikipedia articles helps to disambiguate
items so as not to (unintentionally) change the intended semantics of an item.
Secondly, one could use the wikidata items with mappings to existing
ontologies. Finally, one could choose to mint a new Wikidata item to re ect a speci c
concept that does not exist yet as a wikidata item.
      </p>
      <p>Using Wikidata properties and items we expressed a semantic data model
for genetic variants (see Figure 1). This is the same use case as illustrated in
Figure S1, but here only Wikidata was used as a controlled vocabulary, and we
added the identi ed IRIs from the used ontologies as mappings to this
wikidata model, thus increasing the potential for semantic interoperability. Creating
mappings was easily achieved using the Wikidata community edit interface by
attaching a wikidata exact match property to the relevant items to connect them
to external ontology IRIs. If one group's opportunistic model used for example
an OBO ontology instead of the SIO ontology used by a second group, then
data integration may be challenging. However, since both of those terms can be
reconciled through Wikidata using the available mapping properties, the
introduction of Wikidata facilitates automated or semi-automated harmonization of
independently-authored opportunistic models.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and future work</title>
      <p>In this paper, we put forward the position that Wikidata is a well-suited
controlled vocabulary for data FAIRi cation in the life sciences, especially in initial
stages, where it is either too early to adhere to speci c domain descriptions,
or speci c data or project constraints and requirements are yet to emerge. As
said, this leads to opportunistic data models where parts of di erent ontologies
are cherry picked. By using Wikidata namespaces, this decision can be
postponed or (partially) obviated. Once applicable ontologies become apparent, one
could update Wikidata with mappings, thus linking the Wikidata namespace
to external ones. For many data managers and data stewards, linked data
approaches and technologies have a relatively steep learning curve and dealing with
the wide variety of available ontologies is a major contributing factor. We argue
that Wikidata is a good starting point to create initial models while
maintaining the exibility to evolve those models in the future. Additionally, it is easy
to add new concepts and to do mappings using Wikidata so that resolvable IRIs
are instantly available for representing data as linked data. In comparison,
extending other commonly used ontologies requires engaging with their curators
or maintainers, which is not always possible or easy.</p>
      <p>In conclusion, we argue that Wikidata is a viable source for IRIs in the process
making data FAIR. Wikidata is open to everyone to add terms, properties and
mappings to external ontologies. This together with the fact that every Wikidata
items has a resolvable IRI, makes data using Wikidata items as IRI and its
properties - also with IRIs - interoperable. Wikidata is useful resource in any
FAIRi cation process. As part of future work, we would like to investigate how
di erent semantic data models representing the same data compare to each other,
what their respective limitations are and how Wikidata can be used to map from
one to the other in a more extensive interoperability use case. We plan to do
such a modelling exercise in the foreseeable future and welcome collaborations
in doing so.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wilkinson</surname>
            <given-names>MD</given-names>
          </string-name>
          , et al.
          <article-title>The FAIR Guiding Principles for scienti c data management and stewardship</article-title>
          .
          <source>Sci Data</source>
          .
          <year>2016</year>
          ,
          <volume>3</volume>
          :
          <fpage>160018</fpage>
          . doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2016</year>
          .
          <volume>18</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Bring</given-names>
            <surname>Your Own Data Workshops</surname>
          </string-name>
          , http://www.dtls.nl/fair-data/byod/,
          <source>Last accessed 1 Oct</source>
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. UBEC FAIR datapoint, http://www.ubec.nl/data/fair-data-point/,
          <source>Last accessed 1 Oct</source>
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Bring</given-names>
            <surname>Your Own Data</surname>
          </string-name>
          Workshop - OncoXL, https://www.dtls.nl/wp-content/ uploads/2017/09/
          <string-name>
            <surname>BYOD-OncoXL-June-</surname>
          </string-name>
          2017
          <source>-report.pdf, Last accessed 1 Oct</source>
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Smith</surname>
            <given-names>B</given-names>
          </string-name>
          , et al.
          <article-title>The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration</article-title>
          .
          <source>Nat Biotechnology</source>
          .
          <year>2007</year>
          ,
          <volume>25</volume>
          ,
          <fpage>1251</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1038/nbt1346
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Franck</given-names>
            <surname>Michel</surname>
          </string-name>
          and
          <article-title>The Bioschemas Community. Bioschemas and Schema.org: a Lightweight Semantic Layer for Life Sciences Websites</article-title>
          .
          <source>Biodiversity Information Science and Standards</source>
          .
          <year>2018</year>
          ,
          <volume>2</volume>
          , e25836. Doi:
          <volume>10</volume>
          .3897/biss.2.
          <fpage>25836</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dumontier</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <article-title>The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery</article-title>
          .
          <source>J Biomed Semantics</source>
          .
          <year>2014</year>
          ,
          <volume>5</volume>
          , 14. doi:
          <volume>10</volume>
          .1186/2041-1480-5-14.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vrandecic</surname>
            <given-names>D.</given-names>
          </string-name>
          and Krotzsch M.
          <article-title>Wikidata: A Free Collaborative Knowledgebase</article-title>
          .
          <source>Commun ACM</source>
          .
          <year>2014</year>
          ,
          <volume>57</volume>
          ,
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Burgstaller-Muehlbacher</surname>
            <given-names>S</given-names>
          </string-name>
          , et al.
          <article-title>Wikidata as a semantic framework for the Gene Wiki initiative</article-title>
          .
          <source>Database (Oxford)</source>
          .
          <year>2016</year>
          , pii, baw015. doi:
          <volume>10</volume>
          .1093/database/baw015.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mitraka</surname>
            <given-names>E</given-names>
          </string-name>
          , et al.
          <article-title>Wikidata: A platform for data integration and dissemination for the life sciences and beyond</article-title>
          . bioRxiv.
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1101/031971.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kibbe</surname>
            <given-names>WA</given-names>
          </string-name>
          , et al.
          <article-title>Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <year>2015</year>
          ,
          <volume>43</volume>
          (Database issue),
          <fpage>D1071</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku1011.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Gene Ontology Consortium.
          <article-title>Gene Ontology Consortium: going forward</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <year>2015</year>
          ,
          <volume>43</volume>
          (Database issue),
          <fpage>D1049</fpage>
          -
          <lpage>56</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku1179.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>