Using Semantics and NLP in the SMART Protocols Repository Olga Giraldo1,* Alexander Garcia1, 2 and Oscar Corcho1 1 Ontology  Engineering  Group,  Universidad  Politécnica  de  Madrid,  Spain 2Linkingdata  I/O  LLC,  Fort  Collins,  Colorado,  USA     ABSTRACT and reproducibility for experimental protocols. iii) The In   this   poster   we   present   the   semantic   and   NLP   layers   in   the   NLP engine. The semantics defined by the SP ontology, development   of   our   repository   for   experimental   protocols.   We   have   studied   existing   repositories   for   experimental   protocols   as   well   the   SIRO, and several domain ontologies is used by our NLP experimental   protocols   themselves.     We   have   identified   end-­‐user   engine, GATE 1 ; thus, facilitating search, retrieval and features   across   existing   repositories;   we   have   also   structured   the   socialization (SeReSo) over experimental protocols. We semantics  for  these  documents,  defined  by  an  ontology  and  a  Minimal   Information   model   for   experimental   protocols.   In   addition,   we   have   have generated rules based on the content of protocols; these built   an   NLP   layer   that   makes   extensive   use   of   semantics.   Our   rules allow us to identify meaningful parts of speech (PoS). integrative   approach   focuses   on   facilitating   search,   retrieval   and   We have reviewed proposed standards for representing socialization   of   experimental   protocols.   We   also   focus   on   facilitating   experimental protocols, investigations, experiments, the  generation  of  documents  that  are  born  semantics.     scientific documents, rhetorical structures and annotations. In addition, we have analyzed existing repositories for 1 INTRODUCTION protocols. Interestingly we have found that there are Experimental protocols are fundamental information numerous similarities across these repositories –e.g. structures that support the description of the processes by business model, end-user features, document management; means of which results are generated in experimental by the same token, the lack of semantics for experimental research. Well-structured and accurately described protocols protocols and the lack of specific features for this particular (procesable by humans and machines) should facilitate type of documents may be seen as a common deficiency in experimental reproducibility. In this poster we present the these repositories. This document is organized as follows; semantic and NLP infrastructure that we are putting together in section 2 the semantic components are presented; in this for machine procesable protocols; we emphasize in the section we also inform on the use of semantics by our NLP integration of key components of this infrastructure during engine. Some issues and final remarks are presented in the implementation of a repository for experimental section 3. protocols. Our components include: i) The SMART Protocols (SP) Ontology: this ontology results from the 2 SEMANTICS PLUS NLP analysis of over 200 experimental protocols in various The combination of semantics and NLP makes it possible to domains –molecular biology, cell and developmental deliver a tool that facilitates the generation of experimental biology and others. Domain experts also participated in the protocols that are to be born semantics –fully annotated, development of the SP ontology (Giraldo, García, & linked to the web of data, with fully identified PoS, Corcho, 2014). Using the SP ontology allows us to annotate procesable by machines as well as by humans. In the same and generate Linked Open Data (LOD) for existing and de vein, a similar process for existing experimental protocols in novo protocols –protocols to be born semantics. ii) The formats such as PDF is also supported. Furthermore, Sample Instrument Reagent Objective (SIRO) model. searching for queries such as: “What bacteria have been This is a twofold model; on the one hand it defines an used in protocols for persister cells isolation?”, “What extended layer of metadata for this kind of documents. On imaging analysis software is used for quantitative analysis the other hand, SIRO is a Minimal Information (MI) model of locomotor movements, buccal pumping and cardiac conceived in the same realm as PICO (Booth & Brice, activity on X. tropicalis?”, “How to prepare the stock 2004), supporting search, retrieval and classification solutions of the H2DCF and DHE dyes?”, is also possible. purposes. SIRO is based on an exhaustive study of over We are using the SP ontology; SP aims to formalize the 200 protocols in biochemistry, molecular biology, cell and description of experimental protocols, which we understand developmental biology, health care as well as interviews as domain-specific workflows embedded within documents. with end users. SIRO includes information elements that SP delivers a structured workflow, document and domain were identified as central for describing, searching and knowledge representation written in OWL DL. For the sharing protocols. Furthermore, as SIRO is rooted in the representation of document aspects we are extending the content of the document, it defines a score of completeness 1 * To whom correspondence should be addressed: ogiraldo@fi.upm.es http://gate.ac.uk/ Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Using Semantics and NLP in the SMART Protocols Repository Information Artifact Ontology (IAO).2 The representation of identified, characterized and annotated; in this example executable aspects of a protocol is captured with concepts sample, action, cell disruption instrument are identified from P-Plan Ontology (P-Plan) (Garijo & Gil, 2012); we are and characterized. We are using ANNIE (A Nearly-New also reusing EXPO (Larisa N. Soldatova & D., 2006), Information Extraction) as our information extraction EXACT (L. N. Soldatova, Aubrey, King, & Clare, 2008) system and JAPE for coding rules. and OBI (Courtot et al., 2008). For domain knowledge, we rely on existing biomedical ontologies. Our ontology-based 3 FINAL REMARKS representation for experimental protocols is composed of We have presented the integration of three modules in the two modules, namely SP-document3 and SP-workflow.4 In development of a repository for experimental protocols. this way, we represent the workflow, document and domain Unlike existing repositories, the SP repository focuses on knowledge implicit in experimental protocols. By facilitating the production of semantic protocols, intelligent combining both modules we are delivering a born-semantics search and retrieval and social activity over experimental self- describing document. protocols. We have extensively studied existing We are also working with the SIRO model; our model experimental protocols; key functionalities from these will breaks down the protocol in key elements that are common to “all” laboratory protocols: i) Sample/Specimen (S), ii) also been included in our repository. We have also Instruments (I), iii) Reagents (R) and iv) Objective (O). presented the SP ontology, the SIRO model for MI and the SIRO is motivated by minimal information models as well use of GATE in our architecture. Our workflow addresses as by the Patient/Population/Problem scenarios with PDFs and de novo protocols – those born Intervention/Prognostic/Factor/Exposure Comparison semantics based on the SP ontology. For de novo documents Outcome (PICO) model. For the sample it is considered the we are using the ontology as a template; the resulting strain, line or genotype, developmental stage, organism part, instantiated RDF is annotated and the conventional growth conditions, pre-treatment of the sample and, document metadata is extracted. For PDFs we are tuning the volume/mass of sample. For the instruments it is NLP workflow for extracting SIRO automatically. considered the commercial name, manufacturer and Extracting the Objective has proven to be a challenging task. identification number. For the reagents it is considered the Actions e.g. grind the sample, usually have well defined commercial name, manufacturer and identification number; grammatical structures; but, the Objective of the it is also important to know the storage conditions for the experimental protocol is usually hidden in a complex prose. reagents in the protocol. Identifying the objective or goal of We are constantly improving the rules; new documents the protocol, helps readers to make a decision about the pertaining to other subdomains in biomedical sciences are suitability of the protocol for their experimental problem. added to the corpus; then, the rules are tested. Results are The four elements are also automatically annotated with manually evaluated and the rules and gazetteers are existing ontologies and exposed as LOD. consequently enriched. The NLP engine, GATE, uses the semantics defined by the SP ontology and SIRO. We have classified our corpus of REFERENCES protocols according to purpose/objective (e.g. extraction of Booth, A., & Brice, A. (2004). Formulating answerable questions. In A. B. Booth, A nucleic acids, DNA amplification and visualization of (Eds) (Ed.), Evidence Based Practice for Information Professionals: A Handbook nucleic acids) and then we transformed them to text. For (pp. 61-70): London: Facet Publishing. each protocol, metadata available, reagents, instruments Courtot, Mélanie., Bug, William., Gibson, Frank., Lister, Allyson L., Malone, James., samples, actions and instructions were manually identified. Schober, Daniel., . . . Ruttenberg, Alan. (2008). The OWL of Biomedical We worked with full sentences to characterize PoS, Investigations Paper presented at the OWLED workshop in the International relations, actions (verbs) and full instructions. Gazetteers Semantic Web Conference (ISWC), Karlsruhe, Germany. and rules were thus generated. The results from our NLP Garijo, Daniel., & Gil, Yolanda. (2012). Augmenting PROV with Plans in P-PLAN: workflow are very granular; for instance, we are able to Scientific Processes as Linked Data. Paper presented at the The 2nd International identify DNA purification reagents, digest reaction reagents, Workshop on Linked Science 2012, Boston. cell disruption instruments, etc. Text like “plant species” is Giraldo, Olga., García, Alexander., & Corcho, Oscar. (2014). SMART Protocols: identified as sample, so are organisms and parts of SeMAntic RepresenTation for Experimental Protocols. Paper presented at the 4th organisms. The sentences and PoS where the vocabulary is Workshop on Linked Science 2014 - Making Sense Out of Data (LISC2014), Riva located are also identified and characterized. For instance, del Garda, Trentino, Italy. http://ceur-ws.org/Vol-1282/lisc2014_submission_2.pdf PoS such as “leaf tissue finely ground using a mortar and Soldatova, L. N., Aubrey, W., King, R. D., & Clare, A. (2008). The EXACT pestle, then aliquoted (1 g) for each extraction” are description of biomedical protocols. Bioinformatics, 24(13), i295-303. doi: btn156 [pii]10.1093/bioinformatics/btn156 2 Soldatova, Larisa N., & D., King Roos. (2006). An ontology of scientific experiments. https://code.google.com/p/information-artifact-ontology/ 3 journal of the royal society interface, 3(11), 795–803. doi: 10.1098/rsif.2006.0134 http://vocab.linkeddata.es/SMARTProtocols/sp-documentV2.0.htm 4 http://vocab.linkeddata.es/SMARTProtocols/sp-workflowV2.0.htm Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes2