<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bioschemas:</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alasdair J G Gray</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carole Goble</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael C Jimenez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Bioschemas Community</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ELIXIR-Hub, Hinxton Genome Campus</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heriot-Watt University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The life sciences have a wealth of data resources with a wide range of overlapping content. Key repositories, such as UniProt for protein data or Entrez Gene for gene data, are well known and their content easily discovered through search engines. However, there is a long-tail of bespoke datasets with important content that are not so prominent in search results. Building on the success of Schema.org for making a wide range of structured web content more discoverable and interpretable, e.g. food recipes, the Bioschemas community (http://bioschemas.org) aim to make life sciences datasets more ndable by encouraging data providers to embed Schema.org markup in their resources.</p>
      </abstract>
      <kwd-group>
        <kwd>Schema</kwd>
        <kwd>org</kwd>
        <kwd>metadata</kwd>
        <kwd>dataset descriptions</kwd>
        <kwd>data discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Schema.org provides a way to add semantic markup to web pages to enable those
web pages to become more interpretable by the search engines that index them,
and therefore to improve search results [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Schema.org markup describes types
of information, which then have properties. For example, Recipe is a type for
representing cooking recipes that has properties like cookTime, nutrition, and
recipeIngredient for marking up the characteristics of the recipe. Schema.org
markup is increasingly being applied to web pages as it boosts a site's ranking
in search results [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Schema.org markup in web pages also enhances the search
experience for end users; enabling them to make more informed decisions when
deciding between two search results. For example, when searching for a recipe
for potato salad the result snippets contain information such as cooking time
and the number of calories (see Fig. 1). These have been extracted from the
Schema.org markup of the underlying web pages and enable the user to make a
decision without reading the whole web page. Another example of services being
built over Schema.org markup include the content of the knowledge graphs of
the search engines (also shown in Fig. 1).
      </p>
      <p>
        The life sciences community have a wealth of data resources with a wide range
of overlapping content. When gathering data about a particular gene or protein,
scientists want the data to be aggregated from all available sources. Currently
data from key repositories, such as UniProt (SIB/EBI) for information about
proteins [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or Gene (NIH) for information about genes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], are well known and
easily gathered. However, there is a long-tail of bespoke datasets with important
content whose content are not so readily available. The Bioschemas community6
aim to make life sciences datasets more ndable by encouraging data providers
to embed Schema.org markup in their resources. Thus enabling aggregation of
content through a common approach that will enable novel applications.
Previous work has added Biomedical terms7, but these are not su cient for the
breadth of the life sciences community. The Bioschemas community are working
in conjunction with the wider Schema.org community.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Bioschemas</title>
      <p>Within the life sciences there is demand to discover more than just generic types
like Dataset and Event. The Bioschemas community have identi ed a wide
range of discovery use cases searching for di erent types of biological resources.
6 http://bioschemas.org (accessed Sept 2017)
7 https://health-lifesci.schema.org/ (accessed Sept 2017)</p>
      <p>These include searching for data about a speci c biological entity such as a
particular gene or protein, discovering a data repository to deposit
experimental results, and identifying the storage location of speci c biological samples8.
Currently biological types like genes, proteins, and samples are not represented
in Schema.org. Bioschemas aims to engage with life science communities
relying on existing community agreements to bring forward new biological types to
Schema.org.</p>
      <p>
        For any given entity type in Schema.org there are a large number of
properties available, many inherited from parent types. For example, Dataset has two
properties (distribution and includedInDataCatalog) but inherits 78
properties from CreativeWork and 11 from Thing. This is far more properties than
can be realistically expected from resource providers, c.f. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Additionally, this
wide range of choice makes it di cult to develop tools to consume markup. To
support tools developed to exploit Schema.org markup in web resources, it is
bene cial if the markup is done in a consistent way, i.e. all resources describing
a particularly type of entity provide the same set of properties.
      </p>
      <p>The Bioschema speci cations are being developed in an example driven
manner in a short timeframe { the ELIXIR Implementation Study runs for just one
8 Links to documents containing these use cases can be found on the Bioschemas
website http://bioschemas.org/groups/ (accessed Sept 2017)</p>
      <p>
        Gray et al.
calendar year (2017). The Bioschema speci cations go beyond simply extending
Schema.org with new types and properties for biological entities. As shown in
Fig. 2, the Bioschemas speci cations layer provides additional constraints over
the Schema.org model. These constraints capture (i) the minimal information
properties agreed by the community which are mandatory (M), recommended
(R), or optional (O), (ii) the cardinality of the property, i.e. whether it is
expected to occur once or many times, and (iii) associated controlled vocabulary
terms drawn from existing ontologies. Following from the experience of the wider
Schema.org community [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the Bioschemas speci cations aim to require just 6
properties for any resource type. These properties are being selected based on
their ability to support indexing and snippet generation to enable a consumer
of the search result to discover and distinguish between resources.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Future Work</title>
      <p>The Bioschemas Implementation Study is currently in the development/testing
phase of its lifecycle. To ensure the viability of the speci cations from the
resource providers perspective, example deployments are being developed. At the
same time, tools for consuming and exploiting the markup are also being
developed. The outcome of both these development processes will feed into the nal
revisions of the speci cations and proposed extensions to the core Schema.org
vocabulary.</p>
      <p>While the Bioschemas community has a primary focus on life sciences data,
prominent members of the community are involved with the European Open
Science Cloud project9 with the aim to adopt the Bioschemas approach of de ning
community agreed Schema.org markup pro les in other scienti c disciplines.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>The work presented here represents the contributions of the whole Bioschemas
Community (http://bioschemas.org/people/). The current work is funded
through an ELIXIR Implementation Study (https://www.elixir-europe.org/
activities/bioschemas) and the EU ELIXIR-EXCELERATE grant within the
Research Infrastructures programme of Horizon 2020, grant agreement number
676559.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          et al, G.R.:
          <article-title>Gene: a gene-centered information resource at ncbi</article-title>
          .
          <source>NAR</source>
          <volume>43</volume>
          (
          <issue>D1</issue>
          ),
          <source>D36{D42</source>
          (
          <year>2015</year>
          ), http://dx.doi.org/10.1093/nar/gku1055
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macbeth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Big data makes common schemas even more necessary</article-title>
          .
          <source>CACM</source>
          <volume>59</volume>
          (
          <issue>2</issue>
          ) (
          <year>2016</year>
          ), http://dx.doi.org/10.1145/2844544
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. The UniProt Consortium:
          <article-title>UniProt: the universal protein knowledgebase</article-title>
          .
          <source>NAR</source>
          <volume>45</volume>
          (
          <issue>D1</issue>
          ),
          <source>D158{D169</source>
          (
          <year>2017</year>
          ), http://dx.doi.org/10.1093/nar/gkw1099 9 European Open Science Cloud https://eoscpilot.eu/ accessed Sept 2017
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>