Penn Medicine Biobank Informatics
                                           OBI Influenced Software Design


                                                Heather Williams, David Birtwell
                                                     Penn Medicine BioBank
                                                    University of Pennsylvania
                                                      Philadelphia, PA USA
                                              hwilli@upenn.edu, birtwell@upenn.edu


    Abstract— We present a use case of the Ontology for                  information must be easily accessible, discoverable, and
Biomedical Investigations [1] (OBI) informing the software               query-able, and data provenance must be maintained.
design of a suite of biobanking applications. We describe how
OBI has influenced the design of the Penn Medicine BioBank               Since 2013, the PMBB informatics infrastructure has been
applications that support the collection, processing, and storage
                                                                         implemented by a suite of biobanking applications collectively
of biobank specimens and our work in creating a robust search
system over data produced by BioBank applications and other              called Squash that are founded on OBI concepts with the aim
sources. We show that applications that have been designed with          of presenting and interacting with biobank information in a
the tenets of OBI in mind, particularly those of being reality           semantically rich ontology adherent manner. We designed our
based and modeling events as OBI style processes, have proven to         data model to follow patterns and conventions established by
effectively express richly interconnected data and be easily             OBI, its higher order ontology Basic Formal Ontology [8]
extendable.                                                              (BFO), and the OBO Relation Ontology [9]. BFO is a theory
                                                                         of the basic structures of reality currently being developed at
   Keywords— BFO; Biobanking; OBI; Ontology; Process;                    the Institute for Formal Ontology and Medical Information
Search; Software Design
                                                                         Science (IFOMIS) at the University of Leipzig [11]. The
                                                                         OBO Relation Ontology provides guidelines for creating
                      I. INTRODUCTION                                    ontologies with consistent relational assertions.
Bio-specimens and the data gained by their analysis are
valuable resources for bio-medical investigators. Biobanks,              To date, we have implemented a web based specimen
collections of bio-specimens (specimens) made available for              collection and processing application named Pumpkin that
research, are of extreme importance to investigators, because            makes heavy use of the concept of a process [2] and have
they can provide a large enough sample size to perform robust            prototyped a query system that searches over OBI annotated
statistical analysis and can be used to find specimens with rare         data. We have found that modeling events such as pre-storage
genotypes or phenotypes of interest. Information associated              specimen processes like aliquoting, centrifugation, and
with specimens in biobanks and the subjects from whom the                freezing as processes with specific end-points, inputs, and
specimens were collected is frequently as important to                   outputs, has led to a powerful application with an expressive
research as the information gleaned from specimen analysis.              data model that reflects reality and is easily transformable to
Information technology such as databases and web application             an ontology friendly format. We also found that keeping our
frameworks provide basic support for the storage and retrieval           data model reality based following the example set by OBI has
of biobank information. However, these technologies do not               resulted in a data model over which it is easy to reason and
provide models for complex bio-medical data. Modeling such               that facilitates organic extension.
rich interconnected data remains a challenge for bio-medical
investigators and informaticians, one that must be overcome
for specimen based research to reach its full potential.
                                                                                                II. METHODS
The Penn Medicine BioBank (PMBB) enables biomedical
                                                                         A. OBI Driven Software Design
research by providing centralized access to a large number of
annotated blood and tissue specimens. The Penn Medicine                  Pumpkin was developed using the web application framework
BioBank Informatics Team has been tasked with supporting                 Grails [3] and is written in Groovy [4] and Java [5] using
this initiative by creating the informatics infrastructure to            MySql [6] as the relational database backend. Pumpkin
enable the collection, processing, and storage of specimens              supports the specimen collection process from the initial
and associated subject data, and making the biomedical and               creation of specimen collection packets, through the
demographic information associated with its subjects and                 processing and ultimate storage of specimens.
specimens readily available to the research community. The


  57	
  
Grails incorporates an object-relational mapping (ORM)
powered by Hibernate [7] that provides an abstraction layer
over relational databases. Instead of creating tables with fields                                                           Specimen
                                                                                                                            Extraction
and foreign keys, one creates inter-related domain classes that                     is_input_specimen

specify the database schema. Data are written and read from                         has_output_speicmen

the database at the Groovy object level rather than via SQL.                                                                   S11
                                                                                                                              Blood
Using the rich Grails ORM, we were able to model our
persistent data in a manner very similar to the way classes are
defined in an ontology like OBI -- reality based with class                                                                   Spin
inheritance.                                                                                                                 FICOLL


When designing our data model we considered the concepts                                                             S21       S22               S23
represented in OBI and the relationships between them and                                                           Buffy     Buffy            Plasma

theorized how new concepts would be represented as a guide
to designing persistent domain objects. In this way, OBI                                                                              Dilute
                                                                                                          Aliquot                                       Annotate
informs both the software architecture and the structure of                                                                            PBS

data that is created by the application. Some examples of OBI
terms that were modeled as domain objects are the concepts of                                     S31                S32
                                                                                                 Buffy              Buffy
protocol, specimen collection, containers, and specimens.


                                                                                   Fig. 2. An example of a typical specimen workflow showing specimen
                                                                                processes and their input and output specimens as modeled in Pumpkin
                                                                                following OBI guidelines. S11 is the primogenitor specimen. Because
                                                                                specimen processes are explicitly modeled, information can be directly
                                                                                associated with processes and specimens or inferred via the graph. For
                                                                                example, information pertaining to the specimen extraction of S11, like the
                                                                                study subject, is directly associated only with S11 and discoverable for
                                                                                derivative specimens by graph traversal.


                                                                                B. OBI Annotated Data Search
                                                                                We have developed a prototype search system that implements
                                                                                a natural language query interface over OBI annotated data.
                                                                                Ontology experts analyzed several small existing biomedical
                                                                                data sets and created a mapping between the data and concepts
                                                                                in the OBI ontology. D2RQ [10] was used to present these
                                                                                annotated data as a SPARQL endpoint.

                                                                                To enable natural-language-like queries (NLQ), a pipeline
 Fig. 1. The specimen process architecture expressed here in an informal
graph exemplifies how OBI concepts influenced the design of Pumpkin.            following the standard programming language compilation
Specimens, SpecimenContainers, and SpecimenProcesses are all modeled as         process was created. An NLQ is first parsed as per a fully
persistent domain objects.                                                      specified context free grammar. The resulting parse tree is fed
                                                                                to an interpreter that creates a logical query representation. A
The concept of a process heavily influenced our design. From                    query generator takes as input this logical query representation
the BFO concept of a process, we included both start and end                    and generates a SPARQL query that is run against the NLQ
times in our process domain classes. The OBI relationships                      query endpoint.
has_specified_input      and      has_specified_output        are
implemented as well. For example, we modeled a domain
super-class SpecimenProcess that includes input specimens,                                                          III. RESULTS
output specimens, start and end times, and a user (participant).                Pumpkin has been in production since June 2013 and to date
Subclasses of SpecimenProcess include common specimen                           has stored over 90,000 specimens from over 8,000 collections.
processes like aliquot, spin (centrifugation), dilute, and trash.               Its design has proven to be adequate to handle our initial
Given specimen processes modeled in this way, each                              collection specifications and be easily extendable to additional
specimen is part of a directed specimen process graph that                      processes and concepts, such as new specimen and collection
starts with a specimen extraction process and terminates with                   attributes.   Since the data model is reality-based and
processes that have output specimens bound for storage.                         expressive as it is in graph form, it provides a common
                                                                                representation for all biobank related data, independent of
                                                                                individual lab nomenclature and idiosyncrasies. Because the


  58	
  
data are stored in a harmonized data model, no transformation            computational load of traversing the process graph for
is required to query across these data.                                  common tasks, each specimen was assigned a direct pointer to
                                                                         its primogenitor specimen allowing single database queries
                                                                         rather than recursive searches.
                       IV. DISCUSSION
Early in the requirements gathering and design process, it               In the hopes of finding a more general solution to efficient
became clear that one of the primary difficulties of biobanking          data retrieval, we are experimenting with mirroring our data in
informatics is the heterogeneity and interconnectedness of the           a graph database. Graph databases are designed to store and
information involved. Application developers are mostly                  operate efficiently over data in graph format and may provide
unaccustomed to modeling entities and processes as diverse               a mechanism to perform efficient reads of our data.
and complex as those found in biology and biobanking. While
the volume of data is small in modern terms, the complexity              In addition to specimen processes, we have loosely modeled
and fragility is great. In order to remain useful, each bit of           the concept of a ‘task’ to follow the OBI methodology as a
information concerning a biological process must be richly               time-based process with inputs and outputs. Currently, we
explained, which often means complex links to other bits of              model specimen intake as a task. In the future, other tasks will
information and semantic definitions. Our development team,              be included. As with specimen processes, task data will be
staffed with computer science and math majors, found itself ill          expressed in graph form and so the same efficiency
equipped to meet the challenge of modeling the information of            considerations will exist and methods used.
a robust biobanking informatics landscape. Traditional data
modeling techniques as they apply to relational and document             Still to be developed is Carnival, the system that will tie
databases fall short. It was only after several months                   together the subject and specimen data generated by Squash
acquainting ourselves with OBI and ontology concepts in                  applications with data from other sources and present them in
general that we were able to see a path to an informatics                a discoverable and query-able format. This will be an
system that would provide data expressivity equal to the task.           expansion of the prototype natural language query tool. The
What ensued was the implementation of a web based                        data generated by and stored in Squash applications exist at
biobanking application designed from the ground-up to be                 rest in a form that is compatible with OBI. We plan to
OBI compliant.                                                           annotate any additional data from sources outside Squash with
                                                                         OBI terms and provenance information in order to create a
Two tenets of OBI stand out as particularly significant. The             unified search endpoint.
first is the dedication to remaining reality based. It is often
more convenient to model data for a given requirement in a               Through our experiences attempting to create ontology
way that satisfies that requirement only, usually following the          adherent database applications, we have gained an
path of least resistance of the implementation technology, than          appreciation for the valuable work that has been and continues
it is to adhere to a reality based model. From the outset, we            to be done in ontology development. We suspect that the
committed ourselves to a reality based data model following              perceived value of ontologies within the biomedical research
the example set by OBI. While this commitment did prove                  community will increase over time as those outside the
difficult and seemed unnecessarily so at times, inevitably it led        immediate ontology community learn the contributions that
to an understandable and often surprisingly easily extendable            ontologies like OBI and BFO can make towards their efforts.
data model. The second is our choice to model events as BFO              It remains to be seen whether ontology influenced software
style processes, occurrents with temporal boundaries,                    design will be adopted by the broader software development
following the OBI convention of including process inputs and             community, but if there is continued success of the Penn
outputs. It was unclear at the outset that this approach would           Medicine Biobank, it will be due in large part to the influence
lead to an improved data model. We found however that                    ontologies have had on our software development team.
much like our commitment to remaining reality based,
modeling our processes in this way resulted in an
understandable and easily extendable data model.

This approach has not been without challenges. One notable
difficulty arose around efficient information retrieval from the
database. To get the full data for a particular specimen, the
specimen process graph must be generated, which in our
initial implementation required recursive domain class
traversals resulting in an explosion of computationally
expensive database calls. We addressed this issue via shortcut
pointers in the database. In most instances the data needed for
a particular specimen are associated with either the specimen
itself or its primogenitor specimen.         To alleviate the


  59	
  
                                                                                [5]  Java. [http://www.java.com].
                             REFERENCES                                         [6]  MySQL. [http://www.mysql.com].
[1]   Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone          [7]  Hibernate. [http://hibernate.org].
      J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA,        [8]  Grenon P, Smith B, Goldberg L. Ontologies in Medicine. IOS Press;
      Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J; OBI consortium.             2004. Biodynamic Ontology: Applying BFO in the Biomedical Domain.
      (2010) Modeling biomedical experimental processes with OBI. J                  pp.20–32.
      Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.PMID: 20626927
                                                                                [9] Smith B, Ceusters W, Klagges B. Relations in Biomedical Ontologies.
[2]   Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T,              Genome Biology. 2005;6:R46. doi: 10.1186/gb-2005-6-5-r46.
      Musen MA. BioPortal: enhanced functionality via new Web services
                                                                                [10] D2RQ [http://d2rq.org/].
      from the National Center for Biomedical Ontology to access and use
      ontologies in software applications. Nucleic Acids Res. 2011              [11] Grenon, P. and Smith, B. (2004) “SNAP and SPAN: Towards Dynamic
      Jul;39(Web Server issue):W541-5. Epub 2011 Jun 14.                             Spatial Ontology”, Spatial Cognition and Computation, 4:1, 69-103.
[3]   Grails. [https://grails.org].
[4]   Groovy. [http://groovy.codehaus.org].


  60