Semantic Digitization of Experimental Data in Biological Sciences

                                                      Saurabh Raghuvanshi
                                             Department of Plant Molecular Biology
                                               University of Delhi South Campus
                                                       New Delhi, India
                                               Email: saurabh@genomeindia.org


Abstract— A major bulk of published experimental data,            cannot keep up with the huge amount of experimental data
referred to as ‘Gold Standard’ data, is available in a format     that gets published every year.
that cannot be easily accessed by computers unless effectively
curated. Most curation techniques bank on mining the text for        II. DESCRIPTION OF THE DATA PRESENTATION FORMATS
information. Here we propose and demonstrate the efficacy of      Here we address these issue by proposing formats or data-
curating the experimental data itself. The data models            models for digitization and semantic representation of the
facilitate digitization of the every aspect of the information    experimental data in biological sciences. These formats have a
associated with the experimental data. The models utilize         very generic nature and are flexible enough to be used for
several universally accepted ontologies as well as in-house       digitization of diverse experimental data. In essence, these
developed alphanumeric notations for digitizing different         models attempt to represent every aspect of the data in terms
aspect of the data. The data models have sufficient flexibility   of an alpha-numeric notation which can be an ontology term
to address the extensive variability in experimental data. They   or custom notation. As depicted in Fig. 1, every experimental
have a very generic nature and can be used to curate and          data is composed of several data units, which are usually
digitize experimental data from any organism. The digitized       represented by a single ‘bar’ of a bar-chart or a ‘band’ of a gel
data is easily stored in a relational database management         profile. Each of these data-points is associated with two types
system and can thus be rapidly searched and integrated. These     of information. The first is the actual value of the data-unit
models have been successfully used to digitize data from over     (height of bar) while the other is an array of information such
20,000 experiments spanning over 500 research articles on         as Gene id., plant type, tissue, growth condition etc.
rice biology. The entire dataset is available as a database       Depending on the complexity of the experiment each data-unit
entitled ‘Manually Curated Database of Rice Proteins’ at          may be associated with upto 10-15 different types of
www.genomeindia.org/biocuration.                                  information. As per the data-models each of these information
                                                                  types is represented by either a custom alpha-numerical
                                                                  notation or an ontology term. Thus, every data unit is now
   Keywords—Digitization, Ontologies, Rice, Gold standard data    represented as a collection or group of alpha-numeric terms.
                                                                  This collection can be easily visualized as an array or equation
                     I. INTRODUCTION                              as depicted in Fig. 1. Every term of this array is actually stored
A ‘Systems’ level understanding of any organism requires          in a specific relational database table. Consequently, since
integration and analysis of multi-dimensional experimental        experimental data consists of several data-units, there will be
data [1]. Due to its very nature the underlying experimental      as many groups of alpha-numerical notations, which can be
data is extremely diverse and complex. Thus, integration of       easily stored in an relational database table.
such data is never straightforward. These are several issues
that impede seamless integration of experimental data. The
                                                                    III. ALPHANUMERIC NOTATIONS FOR DATA DIGITIZATION
foremost is the fact that there is no standard data
representation format, especially for the ‘Gold standard’ or      The data models extensively utilize several ontologies such as
published experimental data [1]. Experimental data coming         plant ontology, environment ontology, trait ontology as well
out of different techniques is presented in very different        as gene ontology terms to digitize various aspects of the
formats and thus cannot be easily correlated and integrated.      information associated with every data-unit [2][3]. As is
Moreover, the description of the experimental conditions,         obvious, plant ontology terms are used to describe plant
biological material etc. is extremely variable in terms of        part/tissue or development stage whereas environment
details and thus integration of such poorly explained data        ontology term is used to digitize the growth conditions or any
might not be very useful. Further, in general, the published      other treatment (chemical or physical) that may have been
experimental is presented in pictorial format either as an        administered. Similarly, trait ontology terms define any trait
image of graph and is thus not amenable to computerized           (molecular, biochemical or phenotypic) that has been studied
search, let alone seamless integration. Several of the above      in the experiment. Besides these ontologies other notations
mentioned issues could be partially resolved by extensive         have also been used extensively. Some of these include
manual curation which is very slow and labor intensive and        GenBank accessions or locus identifiers, taxon db ids as well
                                                                  as several in-house developed notations to represent
information such as promoter type, mutant type, transgenic                           fundamentally it is possible to rapidly retrieve a single data-
line etc. Several of the ontology terms need to be qualified in                      unit from a collection of over 80,000 data-units spread over
order to precisely digitize the information. For example, in                         >500 different research articles. Further, the same information


 Figure 1. The figure depicts the basic fundamental of data model for digitization of experimental data. Usually experimental data is represented in published
 research articles as a graph or image. In the present example, digitization of experimental data from a gene expression graph is shown. Every bar of the graph is
 considered as a ‘data-unit’. Each data-unit is associated with several types of information/data such as gene type, transgene, plant type etc. As per the digitization
 model, each of these data types can be represented as an alphanumeric notation. These notations might be an ontology term or any custom notation. Thus, every
 data-unit can be represented by a structured array of such notations, while the whole experiment can be ultimately represented as a collection many such arrays.
 Since the notations are alphanumeric, they can be easily stored in a relational database table.


order to represent plant developmental stage, the age of the                         can be retrieved from several different aspects. For example,
plant (in days) need to be added. Similarly, concentration of                        in the browse page one can easily retrieve all the genes that
duration of treatment is appended to environment ontology                            have been studied in a particular plant part or developmental
terms.                                                                               stage. Similarly, all genes that have been associated or studied
                                                                                     for a particular trait can be easily accessed.
         IV. IMPLEMENTATION OF THE DATA MODELS                                                    V. ISSUES THAT NEED TO BE ADDRESSED
Based on the aforementioned principles, manual data curation
and digitization portals were created. These portals were used                       The process of digitization is primarily dependent on the depth
to digitize experimental data from over 500 published research                       of various ontologies. During the process of digitization, it
articles on rice biology. The entire data set has been organized                     was realized that there is an extensive amount of untapped
as a database entitled ‘Manually Curated Database of Rice                            variability in all aspects of data that is not completely
Proteins’ (MCDRP) [4]. The database contains digitized                               described by the current ontologies. Consequently, a
experimental data pertaining to over 2300 rice proteins that                         significant number of terms used in the curation endeavor had
has been curated from over 20,000 different experiments.                             been coined anew (www.genomeindia.org/biocuration).
Altogether, these experiments have over 80,000 data units for                        While several new terms had to be coined since there was no
which the information has been digitized by manual curation.                         equivalent term, many were formulated since the existing
The digitization of the data has been done by utilizing >600                         terms had a different perspective. Thus, there is a requirement
plant ontology, >350 environment ontology, >800 trait                                for consistent efforts to enrich and modify the existing
ontology and >350 gene ontology terms. The database is                               ontologies by defining more and more terms. Further, several
updated twice a year.                                                                new ontologies might also be required such as an ontology
Since every aspect of information associated with data-unit of                       describing different methodologies. This is important since the
an experiment has been digitized as alpha-numeric notations,
interpretation of the data can only be done in light of the         at a very early stage (pre-publication) to organize the
precise information about the experimental methodology.             experimental data. The idea is to implement these data models
It was also observed, that many times that description of the       as lab data management portals/softwares. This will greatly
experimental data in published literature is not very precise.      facilitate rapid and seamless access of the experimental data
This has major consequences in terms of data integration.           since there would be no or very minimal need for post-
Thus, usage of precise ontology terms to describe different         publication data curation [5]. Consequently, curators can
aspect of the experimental data must be encouraged.                 address the meta-analysis of such data instead of basic
                                                                    archiving.
                                                                    We regard our study as a step of a much wider area of
                         VI. FUTURE                                 research, since experimental data has several dimensions of
The data digitization formats for the experimental data in          interpretations. So far we have only addressed the initial
biological sciences enable a very precise digitization of the       digitization and archiving of the data. Models for correlating
data. This makes the experimental data (Gold standard data)         the digitized data from different studies need to be worked
amenable to computerized search and integration. Thus,              out. Nevertheless, it was possible to demonstrate an effective
consistent curation efforts must be done to digitize the already    digitization of experimental data in biological sciences.
published experimental data. Further, while currently, the data
models are being used for digitization of published                                     ACKNOWLEDGMENT
experimental data, the main aim is to develop data curation             The author acknowledgs financial support from Department
and exchange formats that can be implemented in every lab           of Biotechnology, Government of India.
itself even before publication. Thus, at the time of publication
all the experimental data is in a pre-defined digitized format
than can be easily integrated in any database or public                                     REFERENCES
repository.
                                                                             [1] Rhee, S.Y. and Mutwil, M, “Towards revealing
                                                                                 the functions of all genes in plants”, Trends
                      VII. DISCUSSION                                            Plant Sci., 19: 212–221, 2014.
                                                                             [2] Jaiswal, P., Ware, D., Ni, J., Chang, K., Zhao,
Seamless availability and semantic integration of experimental
                                                                                 W., Schmidt, S., Pan, X., Clark, K., Teytelman,
data is essential to comprehend the complex behavior of
                                                                                 L., Cartinhour, S., Stein, L., and McCouch, S,
biological systems. The data digitization fundamentals briefly
                                                                                 “Gramene: Development and integration of trait
explained in this article facilitate digitization of almost every
                                                                                 and gene ontologies for rice”, Comp. Funct.
aspect of the experimental data and thus should prove
                                                                                 Genomics 3: 132–136, 2002.
instrumental in achieving a higher understanding of any
                                                                             [3] Gene Ontology Consortium: going forward.
biological system, if implemented universally. The basic
                                                                                 Nucleic Acids Res., 43: D1049–56, 2014.
fundamental is very generic in nature and can be applied to
                                                                             [4] Gour, P., Garg, P., Jain, R., Joseph, S. V.,
data from any biological system. In essence, the data models
                                                                                 Tyagi, A.K., and Raghuvanshi, S, “Manually
facilitate digitization of every data-unit of the experimental
                                                                                 curated database of rice proteins”. Nucleic
data in terms of an organized array of alpha-numeric
                                                                                 Acids Res., 42, 2014.
notations. The structure of the array is important since same
                                                                             [5] Baumgartner, W.A., Cohen, K.B., Fox, L.M.,
notation can be used in different positions to mean differently.
                                                                                 Acquaah-mensah, G., and Hunter, L, “Manual
For example, a plant ontology term can be used to describe the
                                                                                 curation is not sufficient for annotation of
tissue wherein expression of a particular gene has been
                                                                                 genomic databases”, 23: 41–48, 2007.
studied. It can also be used to qualify a gene ontology term to
associate a particular ‘molecular activity’ in a specialized
tissue or developmental stage. Thus, same notation can be
used to signify different aspect of the information. This gives
flexibility as well as universality to the concept.
One of the basic aims of this endeavor is to develop data
digitization, archiving and exchange formats that can be used