=Paper= {{Paper |id=Vol-2994/paper5 |storemode=property |title=snpStorage: a database to store and share experimental SNPs data sets |pdfUrl=https://ceur-ws.org/Vol-2994/paper5.pdf |volume=Vol-2994 |authors=Giuseppe Agapito,Mario Cannataro |dblpUrl=https://dblp.org/rec/conf/sebd/AgapitoC21 }} ==snpStorage: a database to store and share experimental SNPs data sets== https://ceur-ws.org/Vol-2994/paper5.pdf
snpStorage: a database to store and share
experimental SNPs data sets
Giuseppe Agapito1,3 , Mario Cannataro2,3
1
  Dep. of Legal, Economic and Social Sciences, University Magna Græcia,Catanzaro, 88100, Italy
2
  Dep. of Medical and Surgical Sciences, University Magna Græcia, Catanzaro, 88100, Italy
3
  Data Analytics Research Center, University Magna Græcia, Catanzaro, 88100, Italy


                                         Abstract
                                         The last years have seen a continuous improvement in sequencing and genotyping technologies. These
                                         improvements have spawned a rapid advancement in bioinformatics databases and software relating
                                         to the collection and analysis of genetic data. Genetic data holds several types of variations such as:
                                         indels, microsatellites, copy number variants (CNV), and Single Nucleotide Polymorphisms (SNPs). SNPs are
                                         the most abundant type of genetic variation and are now the principals raw material underlying most
                                         genetic studies and databases. SNPs can be used as potential biomarkers to investigate cancer growth
                                         or progression, and planning more effective treatment regimens. To make data quickly accessible to
                                         the users, e.g., to face the burden related to the relation of several experiments and to allow an easy
                                         and controlled sharing of genotyping data among different labs, the need for specialized SNP databases
                                         arises. For these reasons, we developed snpStorage (and related querying service), a database to collect
                                         experimental SNPs data integrated with clinical data if available. snpStorage collects experimental SNPs
                                         data in pharmacogenomics studies, obtained by using DMET (Drug Metabolism Enzymes and Transporters)
                                         platform and subsequently enriched with clinical data. The main contributions of snpStorage are i)
                                         automatic translation, normalization and integration of SNP data sets with clinical data and mapping in
                                         relational schema. ii) A repository where research laboratories can store data (anonymously), merge
                                         and handle SNP data sets for more specific analysis; iii) To provide medical professionals a software
                                         instrument to facilitate clinical decisions based on the individual’s genome, and tailoring health care
                                         services to the patient’s genotype.

                                         Keywords
                                         Databases, Single Nucleotide Polymorphisms (SNPs), SQL, NoSQL, Genomics, Integrative Bioinformatics




1. Introduction
The last years have seen exponential growth in the investigation of genetic and genomic varia-
tions as more genomes have been sequenced, corresponding with the continuous improvement
in sequencing and genotyping technologies.
   These improvements have spawned a rapidly advanced in bioinformatics databases and
softwares relating to the collection, management, and analysis of genomics data.
   Genomics data include the variations that are very common and useful in genomics studies,

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy
Envelope-Open agapito@unicz.it (G. Agapito); cannataro@unicz.it (M. Cannataro)
GLOBE https://kmitd.github.io/ilaria/ (M. Cannataro)
Orcid 0000-0002-0877-7063 (G. Agapito); 0000-0001-7116-9338 (M. Cannataro)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
including indels, microsatellites, copy number variants (CNV), and Single Nucleotide Polymor-
phisms (SNPs). SNPs are the most abundant type of genetic variations, and are now the principal
raw material underlying most genetic studies. SNPs are common mutations in the population;
the most common and frequent SNP type is the substitution of a single DNA’s bases. The
DNA’s bases are Adenine, Cytosine, Guanine, and Thymine (while Thymine is replaced from
the Uracil into the RNA) in short the nucleotides are (𝐴, 𝐶, 𝐺, {𝑇 /𝑈 }). Among the other types
of variations SNPs are mainly the easiest to observe and the most useful and widely applied
markers in genetic and epigenetic studies. The continue development and improvement of the
high-throughput assays such as Next Generation Sequencing (NGS), gene-expression and SNP
microarrays, enable the production of massive quantities of data that have to be opportunely
stored to perform further analysis, such as discriminative statistical analysis [1], or data mining
by using association rules [2, 3, 4] to find multiple correlations among SNPs.
   Thus, researchers and clinician-researchers need databases able to store, and retrieve SNPs
data. In addition, the integration of SNP data sets with clinical data, can be used for assessing
disease risks, predict susceptibility, early diagnosing, and planning treatment regimens. Data
collection, arrangement and storage are essential steps in the execution of daily activities
ranging from the individual research level as well as in research areas. Hence, it is mandatory
to permanently store and make data quickly accessible to the users through the employment of
apposite containers such as databases. A database is a collection of data, used to manage and
store information for any domain of interest.
   Currently, genomics studies results are stored in flat textual files (not structured), and memo-
rized locally into the research storage-system. The use of flat text files put some limitations such
as: it makes difficult to integrate data referring to the same genomic study, to share data among
researchers and research laboratories, excluding or making challenging the production of bigger
and more informative datasets obtained by sharing and merging multiple SNP data sets. Thus,
the need to represent not structured experimental genomics data in structured format arises.
Goal that can be accomplished by mapping the experimental genomics information contained
into the flat files in a structured model by means of a relational databases management system
(RDBMS).
   Thus, in this scenario where the data to handle are of different types, alongside the classic
relational database management systems (RDBMS) [5], have been developed several different
kinds of database management systems (DBMS) known as NoSQL (No Structured Query Lan-
guage) [6]. RDBMS are relational data management systems based primarily on tables and use
the SQL (Structured Query Language) language for data manipulation. Conversely, NoSQL
databases are not relational data management systems, they are not based on tables to represent
the association among data, and do not use SQL for data manipulation. Examples of NoSQL
databases are graph databases, employed to handle data coming from social networks, biological
networks (proteins interaction networks and biological pathways), where data do not require a
relational model to be handled.
   Databases have become essential for genetics, proteomics and interactomics studies.
   The available public online SNP databases are a source of information from which researchers
can readily retrieve and use data to design in silico studies. In silico studies allow researchers to
improve the way to perform wet-lab experiments. Because, in silico studies make it possible
to obtain in advance the most likely genomic regions affected by the SNPs to investigate. For
example, the knowledge of the genomic region to investigate avoids to the researchers to
proceed by attempts saving in this way time and money, since a wet lab experiment may cost
thousands of dollars.
   There are different databases that provide information on SNPs data such as dbSNP [7],
HGBASE [8], GBD [9], OMIM [10], and HapMap [11]. All the listed SNP databases represent
a comprehensive, general and reliable catalogue of human SNPs, from which browsing and
searching the core human genome SNP data. The listed databases contain relevant information
for coding regions, including transcription structure, genome maps, bibliography, protein data,
sequences, and the links to the supporting data, to facilitate large-scale studies in association
genetics, pharmaco-genomics etc. On the other hand, the listed databases lack in the capability
to handle experimental SNPs data obtained from clinical studies and enriched with clinical
data. In this paper we present snpStorage a database able to collect experimental SNPs data
obtained by using DMET (Drug Metabolism Enzymes and Transporters) [12] platform and
enriched with clinical information. With respect to the above listed databases, snpStorage is
not a comprehensive catalog of SNPs, but it is an attempt to produce a SNPs repository where
research laboratories can store data (anonymously), merge and handle them, with which to
yield more accurate SNP data sets annotated with clinical data when available, to perform more
in-depth analysis. The goal pursuit by snpStorage is to make the analysis of integrated multiple
SNP datasets more accurate w.r.t small SNP datasets, allowing users to extend and collect all their
data in an unique repository. Current DMET SNPs outputs are arranged in a tabular form with
1931 rows (probes) and a number of column related to the number of patients enrolled in the
study (commonly, in a number less than 100). Thus, snpStorage allows researchers of a research
center or from different research centers, to store and query SNP data sets as well as, to merge
together outcomes from several studies, making it possible to extend a small experimental study
in a more informative one. Enlarging the size of an experimental study, allows to produce more
informative data sets, providing a more accurate description of the study under-investigation,
than analyzing data sets individually.
   The rest of the paper is arranged as follows. Section 2 describes the major popular SNP
databases, Section 3 describes the requirements analysis of snpStorage along with the input
data. Section 4 illustrates how snpStorage has been implemented, highlighting the efficiency in
store and retrieve information. Finally, Section 5 outlines the future directions of snpStorage
and concludes the paper.


2. Related Work
Over the years, many public and private data repositories have been developed to facilitate the
researcher’s studies by freely collecting and sharing curated and inferred SNP data worldwide.
Below, we report the list of some popular SNP databases.

    • dbSNP 1 [7] is a comprehensive and reliable catalogue of human SNP variability with an
      efficient system to cross-reference multiple submissions of the same SNPs from centers
      outside NCB.

   1
       https://www.ncbi.nlm.nih.gov/projects/SNP/
    • The Human Gene Mutation Database (HGMD)2 [13] is a catalog of disease-associated mu-
      tation. In particular, HGMD comprises the following information: SNPs in coding regions,
      regulatory and splicing of relevant regions in human, insertions/deletions, invariant
      regions and complex rearrangements.
    • The Human Genic Bi-Allelic SEquences (HGBASE)3 database [8] collects information
      about SNPs along with the connection among SNP and the affected gene. Data in HGBASE
      are obtained by collecting data from all the available public sources.
    • The Genome Database (GDB)4 [9] is a public data repository and catalogue of human’s
      SNPs, genes and clones. GDB’s data are obtained from the scientific literature and
      collecting submissions from genotyping centers.
    • GenBank5 [14] is a nucleotide sequence database available in NCBI. GenBank is a collec-
      tion of sequences obtained from several different species updated daily. GenBank can
      provide to the users interested in SNP analysis specific context sequence of nucleotides
      to design a genotyping study.
    • HapMap6 [15] provides a wide range of data, but it is focused on SNPs data. HapMap
      allows users to graphically browse the genome to visualize the SNP variability currently
      available. Finally, SNPs characterized by HapMap are linked to their public databases
      from which retrieve more insight related to the specific entry.

   All the listed databases pursue the goal of providing an online catalog containing all the
identified genetic variations, which can be used to guide research in genomics, pharmacoge-
nomics studies with information about the association of genetic variations and phenotypic
traits. Conversely, from the listed database, snpStorage is not only a human SNP catalog. It
provides users with a sandbox where they can store his/her data in an anonymous and secure
place, encouraging collaboration among researchers who belong to the same research centers
and affiliates to different research centers.
   snpStorage’s primary mission is to provide a collection of intuitive procedures that can
automatize complex tasks like data harmonization, integration, and annotation, responsible for
the difficulties encountered by the researcher in sharing their data to start new collaborations. In
addition, to make it even more straightforward, using snpStorage for users unfamiliar with query
languages, e.g., SQL, for data querying, and performing complex tasks such as harmonization,
integration, and annotation, snpStorage will provide all these features interactively through a
customized Graphical User Interface (GUI). In this manner, snpStorage could serve as a valuable
resource for creating more accurate and helpful models to investigate complex conditions from
a broader perspective, contributing to gain more knowledge than analyzing a single data set at
the time.
   Briefly, snpStorage is more than a repository where researchers can only store their exper-
imental SNPs data sets obtained using genomics platforms (i.e., DMET microarrays). Still, it
provides an automatic panel of graphical operations to quickly and intuitively enrich SNP with
   2
     http://www.hgmd.cf.ac.uk/ac/index.php
   3
     http://hgbase.cgr.ki.se
   4
     http://www.gdb.org
   5
     https://www.ncbi.nlm.nih.gov/genbank/
   6
     https://www.genome.gov/10001688/international-hapmap-project
clinical information, joining several experimental SNP data sets, store the newly built data
models, query new data set as well as old data sets, encouraging data sharing.


3. Requirement Analysis
Experimental SNPs data obtained from pharmacogenomics studies using DMET microarrays
are arranged in a tabular format and saved as flat textual files (without a structure).
   The current analysis practice poses some limitations, w.r.t. the possibility to increase the
size of an experiment by merging more files regarding equivalent pharmacogenomics studies.
Moreover, this merging operation has to be done manually by the researchers limiting at the
same time the possibility to share data to obtain more informative data sets. Indeed, statistical
analysis and machine learning techniques produce more accurate and relevant results when
dealing with a huge amount of data than smaller ones. The current DMET outcomes are arranged
in the form of a textual table saved in flat text files. In a generic DMET SNP data set, each row
represents a probe (a well-defined genomic region known as chromosome containing several
genes). In the current version of DMET microarray, there are 1931 probes able to investigate
233 genes involved in pharmacogenomics ADME (Absorption, Distribution, Metabolism, and
Excretion). Instead, the number of columns is variable. The columns represent the number
of patients enrolled in the experiments, the (i-th, j-th) cell contains the SNP detected in the
i-th probes belonging to the j-th subject. Also, the subjects are memorized into the table using
generic labels, making it impossible to harm the subjects’ privacy. An example of DMET SNP
dataset is depicted in Figure 1.
   The possible operations that can be carried out on flat textual files are limited and complex
because, researchers have to know how the data are related among them into the file to manually
extract the information of interest to perform the next analysis. To figure out how the SNPs
are distributed in a specific gene, for example the gene C Y P 7 B 1 that consists of the following
probes: ’ A M _ 1 5 0 5 3 ’ , ’ A M _ 1 5 0 8 5 ’ , ’ A M _ 1 5 0 8 6 ’ , a n d ’ A M _ 1 5 0 9 0 ’ , researchers have to extract for
each subject the values related to those probes, to figure out the presence/absence of a specific
SNP could be used as biomarkers for the condition under investigation. Gene extraction is not
an easy operation due to the flat data representation of the DMET SNP data sets, so it must be
performed manually. In addition, clinical data are stored in different formats and into separate
flat text files, for which it is a not trivial operation to merge properly clinical and SNP data
sets. The current structure and organization of SNP and clinical data sets highlights many
limitations: i) do not allow to represent relationships among data; ii) make it challenging to
share data among researchers into the same research center, as well among different research
center; iii) limit the investigation power of the analysis. Mapping SNP data sets in a relational
database system allows to overcome the actual limitation of flat files representation in particular:
i) it makes it convenient to store and merge more experimental SNP and clinical data sets; ii)
it promotes file sharing and consequently data integration; iii) it allows to perform further
analysis other than classic case/control studies, such as it will enable to discriminate SNPs
and clinical data in all the population. iv) Finally, it allows the extraction of complex relations
among SNPs, diseases, and clinical condition quickly through queries.
4. snpStorage
To promote storing, merging, and sharing of experimental SNPs data sets with the possibility to
integrate SNP data using clinical information, we implemented and developed snpStorage, a
database using the relational database management systems (RDBMS) [5] MySQL server 8.0.13.
snpStorage allows handling SNP data sets obtained by performing experimental investigation
using the DMET microarray platform, and to integrate SNP data with clinical information if
available. All the information related to subjects enrolled in a pharmacogenomics case-control




Figure 1: Illustration of the outcomes obtained form a DMET microarray analysis.


study and stored in snpStorage are anonymized, making it impossible to harm patients’ privacy.
This is possible because snpStorage does not store sensitive information such as name, surname,
address, etc., that could provide some clues on the identity of the subjects.
   Analyzing Figure 1 it is worthy to note that the columns are named using generic labels, i.e.,
”CEU_1, CEU_2 …” making it challenging to associate them with a real subject (except the data
owner).
   To store the SNP data sets permanently and to make it possible and convenient to extend
the stored SNP data sets, it is necessary to map the flat textual SNP data sets into a database
management system, along with all the additional, i.e., clinical information. In this way, we
make it convenient and quick investigate, integrate, and share multiple SNP data sets. The tables
used in snpStorage to map SNP data sets are: Subjects table contains the essential information
necessary to link a subject with the healthcare center of belonging, the pathology under
investigation, the detected SNPs, and the collected clinical data. HealthcareCenters table stores
information about the center, the performed experiments, and the enrolled subjects in every
study. ClinicalInformation table contains the clinical data, the pathology under investigation,
and the information to link clinical data with the subject to which they belong. The pathology
is stored and identified using the International Classification of Disease7 (ICD). To overcome

    7
        https://www.who.int/standards/classifications/classification-of-diseases
the main limitation imposed from the textual files, it is not enough to map the SNP data sets
into the DBMS. It is necessary to relate the relationships between SNPs, diseases, subjects
and laboratories to support complex queries. A possible solution is to group the 1931 probes
using the genes and to link each gene to the Subjects table (let see Figure 2 gene table ”PPP”).
To accomplish the mapping task we used the annotations information provided by DMET
microarray vendor Affymetrix 8 . As a result, we obtained 233 genes table (i.e., one table for
each supported ADME gene from DMET platform), linking all the gene tables to the Subject
table. Finally, the ProbesGenesMapping table contains the mapping between the probes and
the gene of belonging, in the current version it contains only the DMET microarray mapping.
Figure 2 shows the Entity-Relation schema (E/R) implemented to store and collect experimental
DMET data enriched with clinical information. It is worthy to note that it is a simplified schema
because we visualized 1 gene table (e.g., the PPP gene table) to simplify the visualization and
readability of the diagram.




Figure 2: The E/R schema used to implement the data representation model.


  snpStorage stores and efficiently handles SNPs and clinical information. Thus, it can help
    8
     http://www.Affymetrix.com/support/
technical/byproduct.affx?product=dmet_2
researchers to quickly share and integrate data from experimental pharmacogenomics SNP
data sets possibly integrated with clinical data. Also, the integration of multiple SNP data sets,
contribute to increasetypes of analysis to be performed, as well as the accuracy of the results.
More data allow to describe the problem under investigation more accurately, opening new
perspective not reachable by analyzing single small SNP data sets individually.

4.1. Performance evaluation
snpStorage is a relational database composed of 243 small tables. This design choice is due to
the necessity to relate all the SNPs data with the correct probes, the diseases under investigation,
the subjects and the laboratory/ies to which they belong. The relation schema allows obtaining
narrow tables, with the advantage that they fit into the main memory. Thus the querying
operation is faster than retrieving data from the secondary memory. Moreover, the SQL’s
operations such as count, avg, sum as well as group by, order, and so on, are available and can
be computed easily through join operations among tables. The only limitation of this solution
could be that join operations are CPU and RAM-intensive tasks. However, it is worthy to note
that selecting some columns from specific tables does not require all the 243 sub-tables to be
joined. Indeed, we will join only the sub-tables columns referenced in a given query.
    An alternative solution to the multiple small tables could be the use of object-values columns.
The object column type memorizes objects as values, allowing to obtain an enormous number
of virtual columns into a single table. The object-value solution will enable tables to benefit
from the compactness of regular columns while still allowing huge tables to exist, albeit less
efficiently. Object columns require an extra layer of elaboration. The retrieved data have to
be converted into the native data type, introducing extra CPU work to parse the object in the
native type object. The object approach’s advantage is that joining operations among tables is
unnecessary since whole data can be held into a single table.
    To compare the two possible implementations, we created small and object tables, evaluat-
ing the query’s average response time by simulating multiple parallel queries using multiple
parallel threads. Both tables contain the same number of tuples equals to 1, 000. The average
response times refer to the time necessary to perform the queries on both implementation the
small and object tables, respectively. As a benchmark, we used two queries, a simple Select
query and a Count query, monitoring even if the number of simultaneously performed queries
influences the average query time. The select queries are: S E L E C T * F R O M S u b j e c t s _ s m a l l T a b l e
and S E L E C T * F R O M S u b j e c t s _ o b j e c t C o l u m n . The count queries are: S E L E C T C O U N T ( A l l e l e ) F R O M
S u b j e c t s _ s m a l l T a b l e W H E R E A l l e l e = ” A / A ” ; and S E L E C T * F R O M S u b j e c t s _ o b j e c t C o l u m n . To
count in the objectColumn table we need to select all the value and after convert the ob-
ject in the native domain and counting it by using the additional computational layer. The
query times on both representation are reported in Table 1 and Table 2. Analyzing Figure 3
we can see that until the queries retrieve values without being necessary to manipulate it, the
trend querying between small and object tables is quite similar. Running a simple count query
on both models, we can see a slight increase in the response query time in small table querying
(since it added a little extra work-load due to the count function). Conversely, the response
query’s time increase in some order of magnitude when necessary to use the additional layer
of computation to extract the further essential information that returns the count value of the
submitted query.
   In conclusion, snpStorage allows to convert unstrutured SNP data sets in structured data sets,
as well as to take advantage of the native SQL functions to handle the stored data. In this way,
further complex data analysis can be performed faster and in a shorter time since it can handle
directly relevant data without using an additional computational layer to extract the necessary
information.

Table 1
The measured average time in second necessary to perform the Select and Count query on the database
containing small tables implementation, varying the number of threads used.
                               #Th=1    #Th=5    #Th=10    #Th=15    #Th=20
                      Select   4.218     6.36     6.571     8.24      8.105
                      Count    4.383    5.957     6.29      8.009     9.751




Figure 3: Average times of multiple Select and Count queries.



Table 2
The measured average time in second necessary to perform the Select and Count query on the database
based on object column tables implementation, varying the number of threads used.
                               #Th=1    #Th=5    #Th=10    #Th=15    #Th=20
                      Select    4.319    6.277     7.203     8.070    7.477
                      Count    11.147   15.809    16.721    18.259   21.1344
5. Conclusion
SNPs are a relevant source of information spanning from genomics to pharmacogenomics and
personalized medicine. Linking SNP information with clinical data allows spurring light on the
possible hidden relation between the healthy/diseased status of subjects and the arrangement
of the patient’s genome. In this way, SNPs may help scientists assume better the susceptibility
of an individual to a disease or the adverse drug reaction. Since snpStorage stores and effi-
ciently handles SNPs and clinical information, it can help researchers to quickly discriminate
information among SNPs and clinical data into the population under investigation through
simple queries. Future works will be focused on developing and implementing the web interface
of snpStorage, through which analysis pipelines will be made available to help researchers
quickly analyze their data stored into snpStorage without being necessary to use additional
software tools. snpStorage pursues the target to provide an environment where researchers can
perform queries built in a graphical way, without to be necessary to use SQL, on specific genes
a group of genes, or consider detailed clinical information. Thus, the unification of the SNPs
and clinical data represents a step toward defining individual and personalized treatments of
complex diseases such as psychiatric disorders, diabetes, hypertension, and cancer.
   Finally, snpStorage is more than a repository where researchers can only store their exper-
imental SNPs data sets obtained using genomics platforms (i.e., DMET microarrays). Still, it
provides an automatic panel of graphical operations to quickly and intuitively enrich SNP with
clinical information, joining several experimental SNP data sets, store the newly built data
models, query new data set as well as old data sets, encouraging data sharing.


References
 [1] P. H. Guzzi, G. Agapito, M. T. Di Martino, M. Arbitrio, P. Tassone, P. Tagliaferri, M. Can-
     nataro, Dmet-analyzer: automatic analysis of affymetrix dmet data, BMC Bioinformatics 13
     (2012) 258. URL: https://doi.org/10.1186/1471-2105-13-258. doi:1 0 . 1 1 8 6 / 1 4 7 1 - 2 1 0 5 - 1 3 - 2 5 8 .
 [2] G. Agapito, P. H. Guzzi, M. Cannataro, Dmet-miner: Efficient discovery of association
     rules from pharmacogenomic data, Journal of Biomedical Informatics 56 (2015) 273 –
     283. URL: http://www.sciencedirect.com/science/article/pii/S153204641500115X. doi:h t t p s :
     //doi.org/10.1016/j.jbi.2015.06.005.
 [3] G. Agapito, P. H. Guzzi, M. Cannataro, Parallel extraction of association rules from
     genomics data, Applied Mathematics and Computation 350 (2019) 434–446.
 [4] G. Agapito, P. H. Guzzi, M. Cannataro, Parallel and distributed association rule mining
     in life science: A novel parallel algorithm to mine genomics data, Information Sciences
     (2018).
 [5] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, T. G. Price, Access path
     selection in a relational database management system, in: Proceedings of the 1979
     ACM SIGMOD International Conference on Management of Data, SIGMOD ’79, ACM,
     New York, NY, USA, 1979, pp. 23–34. URL: http://doi.acm.org/10.1145/582095.582099.
     doi:1 0 . 1 1 4 5 / 5 8 2 0 9 5 . 5 8 2 0 9 9 .
 [6] J. Han, H. E, G. Le, J. Du, Survey on nosql database, in: 2011 6th International Conference
     on Pervasive Computing and Applications, 2011, pp. 363–366. doi:1 0 . 1 1 0 9 / I C P C A . 2 0 1 1 .
     6106531.
 [7] S. T. Sherry, M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, K. Sirotkin,
     dbsnp: the ncbi database of genetic variation, Nucleic Acids Research 29 (2001) 308–311.
     URL: http://dx.doi.org/10.1093/nar/29.1.308. doi:1 0 . 1 0 9 3 / n a r / 2 9 . 1 . 3 0 8 .
 [8] A. J. Brookes, H. Lehväslaiho, M. Siegfried, J. G. Boehm, Y. P. Yuan, C. M. Sarkar, P. Bork,
     F. Ortigao, Hgbase: a database of snps and other variations in and around human genes,
     Nucleic Acids Research 28 (2000) 356–360. URL: http://dx.doi.org/10.1093/nar/28.1.356.
     doi:1 0 . 1 0 9 3 / n a r / 2 8 . 1 . 3 5 6 .
 [9] S. I. Letovsky, R. W. Cottingham, C. J. Porter, P. W. D. Li, Gdb: The human genome database,
     Nucleic Acids Research 26 (1998) 94–99. URL: http://dx.doi.org/10.1093/nar/26.1.94. doi:1 0 .
     1093/nar/26.1.94.
[10] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, V. A. McKusick, Online mendelian
     inheritance in man (omim), a knowledgebase of human genes and genetic disorders,
     Nucleic Acids Research 33 (2005) D514–D517. URL: http://dx.doi.org/10.1093/nar/gki033.
     doi:1 0 . 1 0 9 3 / n a r / g k i 0 3 3 .
[11] G. A. Thorisson, A. V. Smith, L. Krishnan, L. D. Stein,                                   The international
     hapmap project web site,                             Genome Research 15 (2005) 1592–1593. URL:
     http://genome.cshlp.org/content/15/11/1592.abstract.                                doi:1 0 . 1 1 0 1 / g r . 4 4 1 3 1 0 5 .
     arXiv:http://genome.cshlp.org/content/15/11/1592.full.pdf+html.
[12] M. Arbitrio, M. T. Di Martino, F. Scionti, G. Agapito, P. H. Guzzi, M. Cannataro, P. Tassone,
     P. Tagliaferri, Dmet™(drug metabolism enzymes and transporters): a pharmacogenomic
     platform for precision medicine, Oncotarget 7 (2016) 54028–54050. URL: https://www.ncbi.
     nlm.nih.gov/pubmed/27304055. doi:1 0 . 1 8 6 3 2 / o n c o t a r g e t . 9 9 2 7 .
[13] P. D. Stenson, M. Mort, E. V. Ball, K. Howells, A. D. Phillips, N. S. Thomas, D. N. Cooper,
     The human gene mutation database: 2008 update, Genome Medicine 1 (2009) 13. URL:
     https://doi.org/10.1186/gm13. doi:1 0 . 1 1 8 6 / g m 1 3 .
[14] D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman,
     J. Ostell, E. W. Sayers,                          GenBank,        Nucleic Acids Research 41 (2012)
     D36–D42.               URL:            https://doi.org/10.1093/nar/gks1195.        doi:1 0 . 1 0 9 3 / n a r / g k s 1 1 9 5 .
     arXiv:https://academic.oup.com/nar/article-pdf/41/D1/D36/3680750/gks1195.pdf.
[15] I. H. Consortium, et al., A second generation human haplotype map of over 3.1 million
     snps, Nature 449 (2007) 851.