=Paper=
{{Paper
|id=None
|storemode=property
|title=Weekend Triple Billionaire
|pdfUrl=https://ceur-ws.org/Vol-559/HighlightPoster1.pdf
|volume=Vol-559
|dblpUrl=https://dblp.org/rec/conf/swat4ls/BollemanK09
}}
==Weekend Triple Billionaire==
Weekend Triple Billionaire
Maintaining a Large RDF Data Set in the Life Sciences
Jerven Bolleman, Thomas Kappler, and the UniProt Consortium
Swiss Institute of Bioinformatics
jerven.bolleman@isb-sib.ch, thomas.kappler@isb-sib.ch
Abstract. The UniProt Knowledgebase offers both manually curated
and automatically generated information on proteins, and is one of the
leading biological databases. While it is one of the largest free data sets
that is available in RDF, our infrastructure and website are not based
on RDF. We present numbers about the volume and growth of UniProt
and show why this volume of data prevents using RDF triple stores and
SPARQL with currently available tools.
1 UniProt Data: Nature and Volume
The UniProt Knowledgebase (UniProtKB) [1] consists of two parts: UniProtKB/
Swiss-Prot, containing manually annotated records describing proteins with in-
formation from literature and curator-evaluated computational analysis, and
UniProtKB/TrEMBL, with automatically annotated records. The UniProt con-
sortium provides several additional data sets: UniRef [2] with clustered sets of
sequences from UniProt, the amino acid sequence archive UniParc [3], and sup-
porting data sets such as taxonomy and keywords.
UniProt consists of almost three billion triples (see Table 1), making it one
of the largest freely available RDF data sets. The data is maintained at three
consortium member sites, and must be kept in sync and integrated into one
combined public release1 .
This number of triples consumes considerable harddisk space. We show esti-
mates comparing different RDF stores, based on published LUBM 80002 results,
with our solution, in Table 2.
2 Performance
UniProt is released every three weeks. In each such period, there is a time window
of only six days to prepare all source data for publication. When the curators
freeze their work, we need to retrieve all related data from our global part-
ners. We then convert all source data (42 GB as of release 15.8) to RDF/XML
1
Available for download at http://www.uniprot.org/downloads.
2
LUBM benchmark, http://swat.cse.lehigh.edu/projects/lubm/.
Table 1. Number of triples in UniProt, releases 15.8 and 15.9
Data set Triples in 15.8 Triples in 15.9 Difference Projection for
September 2010
Citations 8,612,297 8,616,640 0.05%
Enzyme 36,461 36,461 0%
GO 195,249 195,249 0%
Keywords 7,641 7,649 +0.10%
Mapped citationsa 84,308,128 82,605,680 −2.02%
Locations 4,537 4,532 −0.11%
Pathways 8,661 8,673 +0.14%
Taxonomy 3,493,485 3,520,871 +0.78%
Tissues 6551 6551 0%
UniParc 640,191,073 691,432,967 +8.00% 2,560,000,000
UniProtKB 1,572,097,256 1,610,723,778 +2.46% 2,433,000,000
UniRef 487,209,989 499,947,698 +2.61% 775,000,000
Total 2,796,164,777 2,897,100,198 +3.61% 5,768,000,000
a
Automatically generated, not public.
(180 GB, 18 GB compressed), while validating it. This conversion took around
16 hours for release 15.8 of September 2009.
The UniProt.org [4] website and its query engine run on an internally de-
veloped solution using BDB/je3 and Lucene4 . On a release, data is loaded into
its store in RDF form; the store and query engine themselves, however, are not
RDF-based. Loading and full text indexing all UniProt data sets took 29 hours
for release 15.8, with 2 GB of memory on a dual core 2.8 GHz XeonTM and a
single hard disk. More than 20,000 unique users visit UniProt.org each workday,
averaging 130,000 queries and 1,710,000 direct lookups, while consuming just
under a terabyte of bandwidth a month. On average, 99.9% of queries finish in
less than 0.8 seconds, including the transfer over the internet.
UniProt.org runs on three mirrors, each of which needs to be provisioned with
all data in time for a release. Data size is a major factor in our deployment as we
are limited by upload speed. UniProt 15.8 as needed for the website consumes
39 GB when gzip compressed, and takes four hours to upload. Larger data sizes
increase the risk of transfer failure.
We also have an internal website for our curators and internal software
tools. Because it needs to reflect current work, we rebuild the manually cu-
rated UniProtKB/Swiss-Prot every night with a time window of 3.5 hours for
converting, loading and validating about 160 million triples.
UniProtKB/TrEMBL with supporting data is converted, loaded, and vali-
dated every Sunday. At this occasion we generate about 1.4 billion triples while
checking the data against our validation rules, making us indeed weekend triple
3
http://www.oracle.com/database/berkeley-db/je/index.html
4
http://lucene.apache.org/java/docs/index.html
billionaires. Current generic RDF stores are unable to handle this amount of
data on the limited hardware budget available.
Table 2. Rough disk consumption estimates for UniProt release 15.8. The factor 1.4 is
the ratio between the larger number of long literals in UniProt compared to a LUBM
8000 data set.
RDF Store Data provided Estimate Source
Allegro Graph 155GB/1.1B LUBM 8000 552 GB 155
1.1
× 2.8 × 1.4 (a )
OWLIM 92GB/1.85B LUBM 8000 195 GB 92
1.85
× 2.8 × 1.4 (b )
Oracle 11G 154GB/1.1B LUBM 8000 549 GB 154
1.1
× 2.8 × 1.4 (c )
Virtuoso 120GB/1.1B various 305GB 120
1.1
× 2.8d (e )
f
UniProt.org 36GB Store, 36GB Text index 72GB (real) Internal
a
www.franz.com/agraph/allegrograph/
b
www.ontotext.com/owlim/OWLIMPres.pdf
c
www.oracle.com/technology/tech/semantic technologies/htdocs/performance.html
d
Storage cost of Virtuoso was not multiplied as they used a dataset with a higher
number of triples with large literals in comparison to LUBM8000.
e
virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple
f
Our current custom solution, described in Section 2, not a SPARQL engine
3 Data Growth
We have to deal with ever growing data volumes (Fig. 3). When evaluating tools,
we need to take into account not just performance on today’s data, but also on
the expected data in five years. The yearly growth rate of the core UniProtKB
data is currently 51%. This is a doubling time of 17 months, faster than Moore’s
Law5 that predicts a doubling time of 24 months.
Assuming that the current growth rate of UniParc does not accelerate from
the current 270% a year (doubling every 7 months), we will see at least 10 billion
sequences in UniParc in five years. This estimate translates to 3 × 1011 triples,
around 470 times the current size.
4 Summary
We have met the requirements for data processing speed and query performance
for the near future, unfortunately on a non-RDF technology stack. Currently the
available tools do not meet our performance needs for UniProt.org, especially
since any RDF solution must preserve the current user interface and full text
search. We do however aim to deploy a public SPARQL endpoint once it becomes
feasible.
5
http://en.wikipedia.org/wiki/Moores law
Fig. 1. UniProtKB is growing faster over time. This graph does not show the growth
of existing entries as more information becomes available.
.
5 Acknowledgments
UniProt is mainly supported by the National Institutes of Health (NIH) grant
2 U01 HG02712-04. Additional support for the EBI’s involvement in UniProt
comes from the European Commission (EC)’s FELICS grant (021902RII3) and
from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are sup-
ported by the Swiss Federal Government through the Federal Office of Educa-
tion and Science and the European Commission contracts FELICS (021902RII3)
and SLING (226073). PIR activities are also supported by the NIH grants and
contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the
Department of Defense grant W81XWH0720112. This work has benefited from
Oracle funding support.
References
1. The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids
Res. 37: D169-D174 (2009).
2. Suzek B.E., Huang H., McGarvey P., Mazumder R., and Wu C.H. UniRef: Com-
prehensive and Non-Redundant UniProt Reference Clusters. Bioinformatics 23:1282-
1288 (2007).
3. Apweiler R., Bairoch A., and Wu C.H. Protein sequence databases. Curr. Opin.
Chem. Biol. 8:76-80 (2004).
4. Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J.,
McGarvey P., and Gasteiger E. Infrastructure for the life sciences: design and imple-
mentation of the UniProt website. BMC Bioinformatics 10:136 (2009).