<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Semantics and NLP in the SMART Protocols Repository</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olga Giraldo</string-name>
          <email>ogiraldo@fi.upm.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Garcia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Corcho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Linkingdata I/O LLC</institution>
          ,
          <addr-line>Fort Collins, Colorado</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontology Engineering Group, Universidad Politécnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In   this   poster   we   present   the   semantic   and   NLP   layers   in   the   development   of   our   repository   for   experimental   protocols.   We   have   studied   existing   repositories   for   experimental   protocols   as   well   the   experimental   protocols   themselves.     We   have   identified   end-­‐user   features   across   existing   repositories;   we   have   also   structured   the   semantics  for  these  documents,  defined  by  an  ontology  and  a  Minimal   Information   model   for   experimental   protocols.   In   addition,   we   have   built   an   NLP   layer   that   makes   extensive   use   of   semantics.   Our   integrative   approach   focuses   on   facilitating   search,   retrieval   and   socialization   of   experimental   protocols.   We   also   focus   on   facilitating   the  generation  of  documents  that  are  born  semantics.    </p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Experimental protocols are fundamental information
structures that support the description of the processes by
means of which results are generated in experimental
research. Well-structured and accurately described protocols
(procesable by humans and machines) should facilitate
experimental reproducibility. In this poster we present the
semantic and NLP infrastructure that we are putting together
for machine procesable protocols; we emphasize in the
integration of key components of this infrastructure during
the implementation of a repository for experimental
protocols. Our components include: i) The SMART
Protocols (SP) Ontology: this ontology results from the
analysis of over 200 experimental protocols in various
domains –molecular biology, cell and developmental
biology and others. Domain experts also participated in the
development of the SP ontology
        <xref ref-type="bibr" rid="ref4">(Giraldo, García, &amp;
Corcho, 2014)</xref>
        . Using the SP ontology allows us to annotate
and generate Linked Open Data (LOD) for existing and de
novo protocols –protocols to be born semantics. ii) The
Sample Instrument Reagent Objective (SIRO) model.
This is a twofold model; on the one hand it defines an
extended layer of metadata for this kind of documents. On
the other hand, SIRO is a Minimal Information (MI) model
conceived in the same realm as PICO
        <xref ref-type="bibr" rid="ref1">(Booth &amp; Brice,
2004)</xref>
        , supporting search, retrieval and classification
purposes. SIRO is based on an exhaustive study of over
200 protocols in biochemistry, molecular biology, cell and
developmental biology, health care as well as interviews
with end users. SIRO includes information elements that
were identified as central for describing, searching and
sharing protocols. Furthermore, as SIRO is rooted in the
content of the document, it defines a score of completeness
and reproducibility for experimental protocols. iii) The
NLP engine. The semantics defined by the SP ontology,
SIRO, and several domain ontologies is used by our NLP
engine, GATE 1 ; thus, facilitating search, retrieval and
socialization (SeReSo) over experimental protocols. We
have generated rules based on the content of protocols; these
rules allow us to identify meaningful parts of speech (PoS).
      </p>
      <p>We have reviewed proposed standards for representing
experimental protocols, investigations, experiments,
scientific documents, rhetorical structures and annotations.
In addition, we have analyzed existing repositories for
protocols. Interestingly we have found that there are
numerous similarities across these repositories –e.g.
business model, end-user features, document management;
by the same token, the lack of semantics for experimental
protocols and the lack of specific features for this particular
type of documents may be seen as a common deficiency in
these repositories. This document is organized as follows;
in section 2 the semantic components are presented; in this
section we also inform on the use of semantics by our NLP
engine. Some issues and final remarks are presented in
section 3.
2</p>
    </sec>
    <sec id="sec-2">
      <title>SEMANTICS PLUS NLP</title>
      <p>The combination of semantics and NLP makes it possible to
deliver a tool that facilitates the generation of experimental
protocols that are to be born semantics –fully annotated,
linked to the web of data, with fully identified PoS,
procesable by machines as well as by humans. In the same
vein, a similar process for existing experimental protocols in
formats such as PDF is also supported. Furthermore,
searching for queries such as: “What bacteria have been
used in protocols for persister cells isolation?”, “What
imaging analysis software is used for quantitative analysis
of locomotor movements, buccal pumping and cardiac
activity on X. tropicalis?”, “How to prepare the stock
solutions of the H2DCF and DHE dyes?”, is also possible.</p>
      <p>We are using the SP ontology; SP aims to formalize the
description of experimental protocols, which we understand
as domain-specific workflows embedded within documents.</p>
      <p>
        SP delivers a structured workflow, document and domain
knowledge representation written in OWL DL. For the
representation of document aspects we are extending the
1 http://gate.ac.uk/
Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
Information Artifact Ontology (IAO).2 The representation of
executable aspects of a protocol is captured with concepts
from P-Plan Ontology (P-Plan)
        <xref ref-type="bibr" rid="ref3">(Garijo &amp; Gil, 2012)</xref>
        ; we are
also reusing EXPO
        <xref ref-type="bibr" rid="ref6">(Larisa N. Soldatova &amp; D., 2006)</xref>
        ,
EXACT
        <xref ref-type="bibr" rid="ref5">(L. N. Soldatova, Aubrey, King, &amp; Clare, 2008)</xref>
        and OBI
        <xref ref-type="bibr" rid="ref2">(Courtot et al., 2008)</xref>
        . For domain knowledge, we
rely on existing biomedical ontologies. Our ontology-based
representation for experimental protocols is composed of
two modules, namely SP-document3 and SP-workflow.4 In
this way, we represent the workflow, document and domain
knowledge implicit in experimental protocols. By
combining both modules we are delivering a born-semantics
self- describing document.
      </p>
      <p>We are also working with the SIRO model; our model
breaks down the protocol in key elements that are common
to “all” laboratory protocols: i) Sample/Specimen (S), ii)
Instruments (I), iii) Reagents (R) and iv) Objective (O).</p>
      <p>SIRO is motivated by minimal information models as well
as by the Patient/Population/Problem
Intervention/Prognostic/Factor/Exposure Comparison
Outcome (PICO) model. For the sample it is considered the
strain, line or genotype, developmental stage, organism part,
growth conditions, pre-treatment of the sample and,
volume/mass of sample. For the instruments it is
considered the commercial name, manufacturer and
identification number. For the reagents it is considered the
commercial name, manufacturer and identification number;
it is also important to know the storage conditions for the
reagents in the protocol. Identifying the objective or goal of
the protocol, helps readers to make a decision about the
suitability of the protocol for their experimental problem.</p>
      <p>The four elements are also automatically annotated with
existing ontologies and exposed as LOD.</p>
      <p>The NLP engine, GATE, uses the semantics defined by
the SP ontology and SIRO. We have classified our corpus of
protocols according to purpose/objective (e.g. extraction of
nucleic acids, DNA amplification and visualization of
nucleic acids) and then we transformed them to text. For
each protocol, metadata available, reagents, instruments
samples, actions and instructions were manually identified.</p>
      <p>We worked with full sentences to characterize PoS,
relations, actions (verbs) and full instructions. Gazetteers
and rules were thus generated. The results from our NLP
workflow are very granular; for instance, we are able to
identify DNA purification reagents, digest reaction reagents,
cell disruption instruments, etc. Text like “plant species” is
identified as sample, so are organisms and parts of
organisms. The sentences and PoS where the vocabulary is
located are also identified and characterized. For instance,
PoS such as “leaf tissue finely ground using a mortar and
pestle, then aliquoted (1 g) for each extraction” are
2 https://code.google.com/p/information-artifact-ontology/
3 http://vocab.linkeddata.es/SMARTProtocols/sp-documentV2.0.htm
4 http://vocab.linkeddata.es/SMARTProtocols/sp-workflowV2.0.htm
identified, characterized and annotated; in this example
sample, action, cell disruption instrument are identified
and characterized. We are using ANNIE (A Nearly-New
Information Extraction) as our information extraction
system and JAPE for coding rules.</p>
    </sec>
    <sec id="sec-3">
      <title>3 FINAL REMARKS</title>
      <p>We have presented the integration of three modules in the
development of a repository for experimental protocols.</p>
      <p>Unlike existing repositories, the SP repository focuses on
facilitating the production of semantic protocols, intelligent
search and retrieval and social activity over experimental
protocols. We have extensively studied existing
experimental protocols; key functionalities from these will
also been included in our repository. We have also
presented the SP ontology, the SIRO model for MI and the
use of GATE in our architecture. Our workflow addresses
scenarios with PDFs and de novo protocols – those born
semantics based on the SP ontology. For de novo documents
we are using the ontology as a template; the resulting
instantiated RDF is annotated and the conventional
document metadata is extracted. For PDFs we are tuning the
NLP workflow for extracting SIRO automatically.</p>
      <p>Extracting the Objective has proven to be a challenging task.</p>
      <p>Actions e.g. grind the sample, usually have well defined
grammatical structures; but, the Objective of the
experimental protocol is usually hidden in a complex prose.</p>
      <p>We are constantly improving the rules; new documents
pertaining to other subdomains in biomedical sciences are
added to the corpus; then, the rules are tested. Results are
manually evaluated and the rules and gazetteers are
consequently enriched.
Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Booth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Brice</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Formulating answerable questions</article-title>
          . In A. B.
          <string-name>
            <surname>Booth</surname>
          </string-name>
          , A (Eds) (Ed.),
          <source>Evidence Based Practice for Information Professionals: A Handbook</source>
          (pp.
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          ): London: Facet Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Courtot</surname>
            ,
            <given-names>Mélanie.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bug</surname>
          </string-name>
          , William.,
          <string-name>
            <surname>Gibson</surname>
          </string-name>
          , Frank.,
          <string-name>
            <surname>Lister</surname>
          </string-name>
          , Allyson L.,
          <string-name>
            <surname>Malone</surname>
          </string-name>
          , James., Schober, Daniel., . . .
          <string-name>
            <surname>Ruttenberg</surname>
          </string-name>
          , Alan. (
          <year>2008</year>
          ).
          <source>The OWL of Biomedical Investigations Paper presented at the OWLED workshop in the International Semantic Web Conference (ISWC)</source>
          , Karlsruhe, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Garijo</surname>
          </string-name>
          , Daniel., &amp;
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>Yolanda.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data</article-title>
          .
          <source>Paper presented at the The 2nd International Workshop on Linked Science</source>
          <year>2012</year>
          , Boston.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Giraldo</surname>
            ,
            <given-names>Olga.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García</surname>
            ,
            <given-names>Alexander.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>Oscar.</given-names>
          </string-name>
          (
          <year>2014</year>
          ). SMART Protocols:
          <article-title>SeMAntic RepresenTation for Experimental Protocols</article-title>
          .
          <source>Paper presented at the 4th Workshop on Linked Science 2014 - Making Sense Out of Data (LISC2014)</source>
          ,
          <source>Riva del Garda</source>
          , Trentino, Italy. http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1282</volume>
          /lisc2014_submission_2.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Soldatova</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aubrey</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Clare</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>The EXACT description of biomedical protocols</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>24</volume>
          (
          <issue>13</issue>
          ),
          <fpage>i295</fpage>
          -
          <lpage>303</lpage>
          . doi: btn156 [pii]
          <fpage>10</fpage>
          .1093/bioinformatics/btn156
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Soldatova</surname>
            ,
            <given-names>Larisa N.</given-names>
          </string-name>
          , &amp; D.,
          <string-name>
            <surname>King</surname>
            <given-names>Roos.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>An ontology of scientific experiments</article-title>
          .
          <source>journal of the royal society interface</source>
          ,
          <volume>3</volume>
          (
          <issue>11</issue>
          ),
          <fpage>795</fpage>
          -
          <lpage>803</lpage>
          . doi:
          <volume>10</volume>
          .1098/rsif.
          <year>2006</year>
          .0134
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>