=Paper=
{{Paper
|id=Vol-1309/paper6
|storemode=property
|title=Penn Medicine biobank informatics
|pdfUrl=https://ceur-ws.org/Vol-1309/paper6.pdf
|volume=Vol-1309
}}
==Penn Medicine biobank informatics==
Penn Medicine Biobank Informatics
OBI Influenced Software Design
Heather Williams, David Birtwell
Penn Medicine BioBank
University of Pennsylvania
Philadelphia, PA USA
hwilli@upenn.edu, birtwell@upenn.edu
Abstract— We present a use case of the Ontology for information must be easily accessible, discoverable, and
Biomedical Investigations [1] (OBI) informing the software query-able, and data provenance must be maintained.
design of a suite of biobanking applications. We describe how
OBI has influenced the design of the Penn Medicine BioBank Since 2013, the PMBB informatics infrastructure has been
applications that support the collection, processing, and storage
implemented by a suite of biobanking applications collectively
of biobank specimens and our work in creating a robust search
system over data produced by BioBank applications and other called Squash that are founded on OBI concepts with the aim
sources. We show that applications that have been designed with of presenting and interacting with biobank information in a
the tenets of OBI in mind, particularly those of being reality semantically rich ontology adherent manner. We designed our
based and modeling events as OBI style processes, have proven to data model to follow patterns and conventions established by
effectively express richly interconnected data and be easily OBI, its higher order ontology Basic Formal Ontology [8]
extendable. (BFO), and the OBO Relation Ontology [9]. BFO is a theory
of the basic structures of reality currently being developed at
Keywords— BFO; Biobanking; OBI; Ontology; Process; the Institute for Formal Ontology and Medical Information
Search; Software Design
Science (IFOMIS) at the University of Leipzig [11]. The
OBO Relation Ontology provides guidelines for creating
I. INTRODUCTION ontologies with consistent relational assertions.
Bio-specimens and the data gained by their analysis are
valuable resources for bio-medical investigators. Biobanks, To date, we have implemented a web based specimen
collections of bio-specimens (specimens) made available for collection and processing application named Pumpkin that
research, are of extreme importance to investigators, because makes heavy use of the concept of a process [2] and have
they can provide a large enough sample size to perform robust prototyped a query system that searches over OBI annotated
statistical analysis and can be used to find specimens with rare data. We have found that modeling events such as pre-storage
genotypes or phenotypes of interest. Information associated specimen processes like aliquoting, centrifugation, and
with specimens in biobanks and the subjects from whom the freezing as processes with specific end-points, inputs, and
specimens were collected is frequently as important to outputs, has led to a powerful application with an expressive
research as the information gleaned from specimen analysis. data model that reflects reality and is easily transformable to
Information technology such as databases and web application an ontology friendly format. We also found that keeping our
frameworks provide basic support for the storage and retrieval data model reality based following the example set by OBI has
of biobank information. However, these technologies do not resulted in a data model over which it is easy to reason and
provide models for complex bio-medical data. Modeling such that facilitates organic extension.
rich interconnected data remains a challenge for bio-medical
investigators and informaticians, one that must be overcome
for specimen based research to reach its full potential.
II. METHODS
The Penn Medicine BioBank (PMBB) enables biomedical
A. OBI Driven Software Design
research by providing centralized access to a large number of
annotated blood and tissue specimens. The Penn Medicine Pumpkin was developed using the web application framework
BioBank Informatics Team has been tasked with supporting Grails [3] and is written in Groovy [4] and Java [5] using
this initiative by creating the informatics infrastructure to MySql [6] as the relational database backend. Pumpkin
enable the collection, processing, and storage of specimens supports the specimen collection process from the initial
and associated subject data, and making the biomedical and creation of specimen collection packets, through the
demographic information associated with its subjects and processing and ultimate storage of specimens.
specimens readily available to the research community. The
57
Grails incorporates an object-relational mapping (ORM)
powered by Hibernate [7] that provides an abstraction layer
over relational databases. Instead of creating tables with fields Specimen
Extraction
and foreign keys, one creates inter-related domain classes that is_input_specimen
specify the database schema. Data are written and read from has_output_speicmen
the database at the Groovy object level rather than via SQL. S11
Blood
Using the rich Grails ORM, we were able to model our
persistent data in a manner very similar to the way classes are
defined in an ontology like OBI -- reality based with class Spin
inheritance. FICOLL
When designing our data model we considered the concepts S21 S22 S23
represented in OBI and the relationships between them and Buffy Buffy Plasma
theorized how new concepts would be represented as a guide
to designing persistent domain objects. In this way, OBI Dilute
Aliquot Annotate
informs both the software architecture and the structure of PBS
data that is created by the application. Some examples of OBI
terms that were modeled as domain objects are the concepts of S31 S32
Buffy Buffy
protocol, specimen collection, containers, and specimens.
Fig. 2. An example of a typical specimen workflow showing specimen
processes and their input and output specimens as modeled in Pumpkin
following OBI guidelines. S11 is the primogenitor specimen. Because
specimen processes are explicitly modeled, information can be directly
associated with processes and specimens or inferred via the graph. For
example, information pertaining to the specimen extraction of S11, like the
study subject, is directly associated only with S11 and discoverable for
derivative specimens by graph traversal.
B. OBI Annotated Data Search
We have developed a prototype search system that implements
a natural language query interface over OBI annotated data.
Ontology experts analyzed several small existing biomedical
data sets and created a mapping between the data and concepts
in the OBI ontology. D2RQ [10] was used to present these
annotated data as a SPARQL endpoint.
To enable natural-language-like queries (NLQ), a pipeline
Fig. 1. The specimen process architecture expressed here in an informal
graph exemplifies how OBI concepts influenced the design of Pumpkin. following the standard programming language compilation
Specimens, SpecimenContainers, and SpecimenProcesses are all modeled as process was created. An NLQ is first parsed as per a fully
persistent domain objects. specified context free grammar. The resulting parse tree is fed
to an interpreter that creates a logical query representation. A
The concept of a process heavily influenced our design. From query generator takes as input this logical query representation
the BFO concept of a process, we included both start and end and generates a SPARQL query that is run against the NLQ
times in our process domain classes. The OBI relationships query endpoint.
has_specified_input and has_specified_output are
implemented as well. For example, we modeled a domain
super-class SpecimenProcess that includes input specimens, III. RESULTS
output specimens, start and end times, and a user (participant). Pumpkin has been in production since June 2013 and to date
Subclasses of SpecimenProcess include common specimen has stored over 90,000 specimens from over 8,000 collections.
processes like aliquot, spin (centrifugation), dilute, and trash. Its design has proven to be adequate to handle our initial
Given specimen processes modeled in this way, each collection specifications and be easily extendable to additional
specimen is part of a directed specimen process graph that processes and concepts, such as new specimen and collection
starts with a specimen extraction process and terminates with attributes. Since the data model is reality-based and
processes that have output specimens bound for storage. expressive as it is in graph form, it provides a common
representation for all biobank related data, independent of
individual lab nomenclature and idiosyncrasies. Because the
58
data are stored in a harmonized data model, no transformation computational load of traversing the process graph for
is required to query across these data. common tasks, each specimen was assigned a direct pointer to
its primogenitor specimen allowing single database queries
rather than recursive searches.
IV. DISCUSSION
Early in the requirements gathering and design process, it In the hopes of finding a more general solution to efficient
became clear that one of the primary difficulties of biobanking data retrieval, we are experimenting with mirroring our data in
informatics is the heterogeneity and interconnectedness of the a graph database. Graph databases are designed to store and
information involved. Application developers are mostly operate efficiently over data in graph format and may provide
unaccustomed to modeling entities and processes as diverse a mechanism to perform efficient reads of our data.
and complex as those found in biology and biobanking. While
the volume of data is small in modern terms, the complexity In addition to specimen processes, we have loosely modeled
and fragility is great. In order to remain useful, each bit of the concept of a ‘task’ to follow the OBI methodology as a
information concerning a biological process must be richly time-based process with inputs and outputs. Currently, we
explained, which often means complex links to other bits of model specimen intake as a task. In the future, other tasks will
information and semantic definitions. Our development team, be included. As with specimen processes, task data will be
staffed with computer science and math majors, found itself ill expressed in graph form and so the same efficiency
equipped to meet the challenge of modeling the information of considerations will exist and methods used.
a robust biobanking informatics landscape. Traditional data
modeling techniques as they apply to relational and document Still to be developed is Carnival, the system that will tie
databases fall short. It was only after several months together the subject and specimen data generated by Squash
acquainting ourselves with OBI and ontology concepts in applications with data from other sources and present them in
general that we were able to see a path to an informatics a discoverable and query-able format. This will be an
system that would provide data expressivity equal to the task. expansion of the prototype natural language query tool. The
What ensued was the implementation of a web based data generated by and stored in Squash applications exist at
biobanking application designed from the ground-up to be rest in a form that is compatible with OBI. We plan to
OBI compliant. annotate any additional data from sources outside Squash with
OBI terms and provenance information in order to create a
Two tenets of OBI stand out as particularly significant. The unified search endpoint.
first is the dedication to remaining reality based. It is often
more convenient to model data for a given requirement in a Through our experiences attempting to create ontology
way that satisfies that requirement only, usually following the adherent database applications, we have gained an
path of least resistance of the implementation technology, than appreciation for the valuable work that has been and continues
it is to adhere to a reality based model. From the outset, we to be done in ontology development. We suspect that the
committed ourselves to a reality based data model following perceived value of ontologies within the biomedical research
the example set by OBI. While this commitment did prove community will increase over time as those outside the
difficult and seemed unnecessarily so at times, inevitably it led immediate ontology community learn the contributions that
to an understandable and often surprisingly easily extendable ontologies like OBI and BFO can make towards their efforts.
data model. The second is our choice to model events as BFO It remains to be seen whether ontology influenced software
style processes, occurrents with temporal boundaries, design will be adopted by the broader software development
following the OBI convention of including process inputs and community, but if there is continued success of the Penn
outputs. It was unclear at the outset that this approach would Medicine Biobank, it will be due in large part to the influence
lead to an improved data model. We found however that ontologies have had on our software development team.
much like our commitment to remaining reality based,
modeling our processes in this way resulted in an
understandable and easily extendable data model.
This approach has not been without challenges. One notable
difficulty arose around efficient information retrieval from the
database. To get the full data for a particular specimen, the
specimen process graph must be generated, which in our
initial implementation required recursive domain class
traversals resulting in an explosion of computationally
expensive database calls. We addressed this issue via shortcut
pointers in the database. In most instances the data needed for
a particular specimen are associated with either the specimen
itself or its primogenitor specimen. To alleviate the
59
[5] Java. [http://www.java.com].
REFERENCES [6] MySQL. [http://www.mysql.com].
[1] Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone [7] Hibernate. [http://hibernate.org].
J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, [8] Grenon P, Smith B, Goldberg L. Ontologies in Medicine. IOS Press;
Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J; OBI consortium. 2004. Biodynamic Ontology: Applying BFO in the Biomedical Domain.
(2010) Modeling biomedical experimental processes with OBI. J pp.20–32.
Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.PMID: 20626927
[9] Smith B, Ceusters W, Klagges B. Relations in Biomedical Ontologies.
[2] Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Genome Biology. 2005;6:R46. doi: 10.1186/gb-2005-6-5-r46.
Musen MA. BioPortal: enhanced functionality via new Web services
[10] D2RQ [http://d2rq.org/].
from the National Center for Biomedical Ontology to access and use
ontologies in software applications. Nucleic Acids Res. 2011 [11] Grenon, P. and Smith, B. (2004) “SNAP and SPAN: Towards Dynamic
Jul;39(Web Server issue):W541-5. Epub 2011 Jun 14. Spatial Ontology”, Spatial Cognition and Computation, 4:1, 69-103.
[3] Grails. [https://grails.org].
[4] Groovy. [http://groovy.codehaus.org].
60