Semantic Digitization of Experimental Data in Biological Sciences Saurabh Raghuvanshi Department of Plant Molecular Biology University of Delhi South Campus New Delhi, India Email: saurabh@genomeindia.org Abstract— A major bulk of published experimental data, cannot keep up with the huge amount of experimental data referred to as ‘Gold Standard’ data, is available in a format that gets published every year. that cannot be easily accessed by computers unless effectively curated. Most curation techniques bank on mining the text for II. DESCRIPTION OF THE DATA PRESENTATION FORMATS information. Here we propose and demonstrate the efficacy of Here we address these issue by proposing formats or data- curating the experimental data itself. The data models models for digitization and semantic representation of the facilitate digitization of the every aspect of the information experimental data in biological sciences. These formats have a associated with the experimental data. The models utilize very generic nature and are flexible enough to be used for several universally accepted ontologies as well as in-house digitization of diverse experimental data. In essence, these developed alphanumeric notations for digitizing different models attempt to represent every aspect of the data in terms aspect of the data. The data models have sufficient flexibility of an alpha-numeric notation which can be an ontology term to address the extensive variability in experimental data. They or custom notation. As depicted in Fig. 1, every experimental have a very generic nature and can be used to curate and data is composed of several data units, which are usually digitize experimental data from any organism. The digitized represented by a single ‘bar’ of a bar-chart or a ‘band’ of a gel data is easily stored in a relational database management profile. Each of these data-points is associated with two types system and can thus be rapidly searched and integrated. These of information. The first is the actual value of the data-unit models have been successfully used to digitize data from over (height of bar) while the other is an array of information such 20,000 experiments spanning over 500 research articles on as Gene id., plant type, tissue, growth condition etc. rice biology. The entire dataset is available as a database Depending on the complexity of the experiment each data-unit entitled ‘Manually Curated Database of Rice Proteins’ at may be associated with upto 10-15 different types of www.genomeindia.org/biocuration. information. As per the data-models each of these information types is represented by either a custom alpha-numerical notation or an ontology term. Thus, every data unit is now Keywords—Digitization, Ontologies, Rice, Gold standard data represented as a collection or group of alpha-numeric terms. This collection can be easily visualized as an array or equation I. INTRODUCTION as depicted in Fig. 1. Every term of this array is actually stored A ‘Systems’ level understanding of any organism requires in a specific relational database table. Consequently, since integration and analysis of multi-dimensional experimental experimental data consists of several data-units, there will be data [1]. Due to its very nature the underlying experimental as many groups of alpha-numerical notations, which can be data is extremely diverse and complex. Thus, integration of easily stored in an relational database table. such data is never straightforward. These are several issues that impede seamless integration of experimental data. The III. ALPHANUMERIC NOTATIONS FOR DATA DIGITIZATION foremost is the fact that there is no standard data representation format, especially for the ‘Gold standard’ or The data models extensively utilize several ontologies such as published experimental data [1]. Experimental data coming plant ontology, environment ontology, trait ontology as well out of different techniques is presented in very different as gene ontology terms to digitize various aspects of the formats and thus cannot be easily correlated and integrated. information associated with every data-unit [2][3]. As is Moreover, the description of the experimental conditions, obvious, plant ontology terms are used to describe plant biological material etc. is extremely variable in terms of part/tissue or development stage whereas environment details and thus integration of such poorly explained data ontology term is used to digitize the growth conditions or any might not be very useful. Further, in general, the published other treatment (chemical or physical) that may have been experimental is presented in pictorial format either as an administered. Similarly, trait ontology terms define any trait image of graph and is thus not amenable to computerized (molecular, biochemical or phenotypic) that has been studied search, let alone seamless integration. Several of the above in the experiment. Besides these ontologies other notations mentioned issues could be partially resolved by extensive have also been used extensively. Some of these include manual curation which is very slow and labor intensive and GenBank accessions or locus identifiers, taxon db ids as well as several in-house developed notations to represent information such as promoter type, mutant type, transgenic fundamentally it is possible to rapidly retrieve a single data- line etc. Several of the ontology terms need to be qualified in unit from a collection of over 80,000 data-units spread over order to precisely digitize the information. For example, in >500 different research articles. Further, the same information Figure 1. The figure depicts the basic fundamental of data model for digitization of experimental data. Usually experimental data is represented in published research articles as a graph or image. In the present example, digitization of experimental data from a gene expression graph is shown. Every bar of the graph is considered as a ‘data-unit’. Each data-unit is associated with several types of information/data such as gene type, transgene, plant type etc. As per the digitization model, each of these data types can be represented as an alphanumeric notation. These notations might be an ontology term or any custom notation. Thus, every data-unit can be represented by a structured array of such notations, while the whole experiment can be ultimately represented as a collection many such arrays. Since the notations are alphanumeric, they can be easily stored in a relational database table. order to represent plant developmental stage, the age of the can be retrieved from several different aspects. For example, plant (in days) need to be added. Similarly, concentration of in the browse page one can easily retrieve all the genes that duration of treatment is appended to environment ontology have been studied in a particular plant part or developmental terms. stage. Similarly, all genes that have been associated or studied for a particular trait can be easily accessed. IV. IMPLEMENTATION OF THE DATA MODELS V. ISSUES THAT NEED TO BE ADDRESSED Based on the aforementioned principles, manual data curation and digitization portals were created. These portals were used The process of digitization is primarily dependent on the depth to digitize experimental data from over 500 published research of various ontologies. During the process of digitization, it articles on rice biology. The entire data set has been organized was realized that there is an extensive amount of untapped as a database entitled ‘Manually Curated Database of Rice variability in all aspects of data that is not completely Proteins’ (MCDRP) [4]. The database contains digitized described by the current ontologies. Consequently, a experimental data pertaining to over 2300 rice proteins that significant number of terms used in the curation endeavor had has been curated from over 20,000 different experiments. been coined anew (www.genomeindia.org/biocuration). Altogether, these experiments have over 80,000 data units for While several new terms had to be coined since there was no which the information has been digitized by manual curation. equivalent term, many were formulated since the existing The digitization of the data has been done by utilizing >600 terms had a different perspective. Thus, there is a requirement plant ontology, >350 environment ontology, >800 trait for consistent efforts to enrich and modify the existing ontology and >350 gene ontology terms. The database is ontologies by defining more and more terms. Further, several updated twice a year. new ontologies might also be required such as an ontology Since every aspect of information associated with data-unit of describing different methodologies. This is important since the an experiment has been digitized as alpha-numeric notations, interpretation of the data can only be done in light of the at a very early stage (pre-publication) to organize the precise information about the experimental methodology. experimental data. The idea is to implement these data models It was also observed, that many times that description of the as lab data management portals/softwares. This will greatly experimental data in published literature is not very precise. facilitate rapid and seamless access of the experimental data This has major consequences in terms of data integration. since there would be no or very minimal need for post- Thus, usage of precise ontology terms to describe different publication data curation [5]. Consequently, curators can aspect of the experimental data must be encouraged. address the meta-analysis of such data instead of basic archiving. We regard our study as a step of a much wider area of VI. FUTURE research, since experimental data has several dimensions of The data digitization formats for the experimental data in interpretations. So far we have only addressed the initial biological sciences enable a very precise digitization of the digitization and archiving of the data. Models for correlating data. This makes the experimental data (Gold standard data) the digitized data from different studies need to be worked amenable to computerized search and integration. Thus, out. Nevertheless, it was possible to demonstrate an effective consistent curation efforts must be done to digitize the already digitization of experimental data in biological sciences. published experimental data. Further, while currently, the data models are being used for digitization of published ACKNOWLEDGMENT experimental data, the main aim is to develop data curation The author acknowledgs financial support from Department and exchange formats that can be implemented in every lab of Biotechnology, Government of India. itself even before publication. Thus, at the time of publication all the experimental data is in a pre-defined digitized format than can be easily integrated in any database or public REFERENCES repository. [1] Rhee, S.Y. and Mutwil, M, “Towards revealing the functions of all genes in plants”, Trends VII. DISCUSSION Plant Sci., 19: 212–221, 2014. [2] Jaiswal, P., Ware, D., Ni, J., Chang, K., Zhao, Seamless availability and semantic integration of experimental W., Schmidt, S., Pan, X., Clark, K., Teytelman, data is essential to comprehend the complex behavior of L., Cartinhour, S., Stein, L., and McCouch, S, biological systems. The data digitization fundamentals briefly “Gramene: Development and integration of trait explained in this article facilitate digitization of almost every and gene ontologies for rice”, Comp. Funct. aspect of the experimental data and thus should prove Genomics 3: 132–136, 2002. instrumental in achieving a higher understanding of any [3] Gene Ontology Consortium: going forward. biological system, if implemented universally. The basic Nucleic Acids Res., 43: D1049–56, 2014. fundamental is very generic in nature and can be applied to [4] Gour, P., Garg, P., Jain, R., Joseph, S. V., data from any biological system. In essence, the data models Tyagi, A.K., and Raghuvanshi, S, “Manually facilitate digitization of every data-unit of the experimental curated database of rice proteins”. Nucleic data in terms of an organized array of alpha-numeric Acids Res., 42, 2014. notations. The structure of the array is important since same [5] Baumgartner, W.A., Cohen, K.B., Fox, L.M., notation can be used in different positions to mean differently. Acquaah-mensah, G., and Hunter, L, “Manual For example, a plant ontology term can be used to describe the curation is not sufficient for annotation of tissue wherein expression of a particular gene has been genomic databases”, 23: 41–48, 2007. studied. It can also be used to qualify a gene ontology term to associate a particular ‘molecular activity’ in a specialized tissue or developmental stage. Thus, same notation can be used to signify different aspect of the information. This gives flexibility as well as universality to the concept. One of the basic aims of this endeavor is to develop data digitization, archiving and exchange formats that can be used