Weekend Triple Billionaire
    Maintaining a Large RDF Data Set in the Life Sciences

        Jerven Bolleman, Thomas Kappler, and the UniProt Consortium

                         Swiss Institute of Bioinformatics
             jerven.bolleman@isb-sib.ch, thomas.kappler@isb-sib.ch


        Abstract. The UniProt Knowledgebase offers both manually curated
        and automatically generated information on proteins, and is one of the
        leading biological databases. While it is one of the largest free data sets
        that is available in RDF, our infrastructure and website are not based
        on RDF. We present numbers about the volume and growth of UniProt
        and show why this volume of data prevents using RDF triple stores and
        SPARQL with currently available tools.


1     UniProt Data: Nature and Volume

The UniProt Knowledgebase (UniProtKB) [1] consists of two parts: UniProtKB/
Swiss-Prot, containing manually annotated records describing proteins with in-
formation from literature and curator-evaluated computational analysis, and
UniProtKB/TrEMBL, with automatically annotated records. The UniProt con-
sortium provides several additional data sets: UniRef [2] with clustered sets of
sequences from UniProt, the amino acid sequence archive UniParc [3], and sup-
porting data sets such as taxonomy and keywords.
    UniProt consists of almost three billion triples (see Table 1), making it one
of the largest freely available RDF data sets. The data is maintained at three
consortium member sites, and must be kept in sync and integrated into one
combined public release1 .
    This number of triples consumes considerable harddisk space. We show esti-
mates comparing different RDF stores, based on published LUBM 80002 results,
with our solution, in Table 2.


2     Performance

UniProt is released every three weeks. In each such period, there is a time window
of only six days to prepare all source data for publication. When the curators
freeze their work, we need to retrieve all related data from our global part-
ners. We then convert all source data (42 GB as of release 15.8) to RDF/XML
1
    Available for download at http://www.uniprot.org/downloads.
2
    LUBM benchmark, http://swat.cse.lehigh.edu/projects/lubm/.
             Table 1. Number of triples in UniProt, releases 15.8 and 15.9

 Data set              Triples in 15.8     Triples in 15.9   Difference     Projection for
                                                                          September 2010
 Citations                  8,612,297           8,616,640       0.05%
 Enzyme                        36,461              36,461          0%
 GO                           195,249             195,249          0%
 Keywords                       7,641               7,649      +0.10%
 Mapped citationsa         84,308,128          82,605,680      −2.02%
 Locations                      4,537               4,532      −0.11%
 Pathways                       8,661               8,673      +0.14%
 Taxonomy                   3,493,485           3,520,871      +0.78%
 Tissues                         6551                6551          0%
 UniParc                  640,191,073         691,432,967      +8.00%       2,560,000,000
 UniProtKB              1,572,097,256       1,610,723,778      +2.46%       2,433,000,000
 UniRef                   487,209,989         499,947,698      +2.61%         775,000,000
 Total                  2,796,164,777       2,897,100,198      +3.61%       5,768,000,000
a
    Automatically generated, not public.


(180 GB, 18 GB compressed), while validating it. This conversion took around
16 hours for release 15.8 of September 2009.
    The UniProt.org [4] website and its query engine run on an internally de-
veloped solution using BDB/je3 and Lucene4 . On a release, data is loaded into
its store in RDF form; the store and query engine themselves, however, are not
RDF-based. Loading and full text indexing all UniProt data sets took 29 hours
for release 15.8, with 2 GB of memory on a dual core 2.8 GHz XeonTM and a
single hard disk. More than 20,000 unique users visit UniProt.org each workday,
averaging 130,000 queries and 1,710,000 direct lookups, while consuming just
under a terabyte of bandwidth a month. On average, 99.9% of queries finish in
less than 0.8 seconds, including the transfer over the internet.
    UniProt.org runs on three mirrors, each of which needs to be provisioned with
all data in time for a release. Data size is a major factor in our deployment as we
are limited by upload speed. UniProt 15.8 as needed for the website consumes
39 GB when gzip compressed, and takes four hours to upload. Larger data sizes
increase the risk of transfer failure.
    We also have an internal website for our curators and internal software
tools. Because it needs to reflect current work, we rebuild the manually cu-
rated UniProtKB/Swiss-Prot every night with a time window of 3.5 hours for
converting, loading and validating about 160 million triples.
    UniProtKB/TrEMBL with supporting data is converted, loaded, and vali-
dated every Sunday. At this occasion we generate about 1.4 billion triples while
checking the data against our validation rules, making us indeed weekend triple
3
    http://www.oracle.com/database/berkeley-db/je/index.html
4
    http://lucene.apache.org/java/docs/index.html
billionaires. Current generic RDF stores are unable to handle this amount of
data on the limited hardware budget available.


Table 2. Rough disk consumption estimates for UniProt release 15.8. The factor 1.4 is
the ratio between the larger number of long literals in UniProt compared to a LUBM
8000 data set.

RDF Store     Data provided             Estimate                     Source
Allegro Graph 155GB/1.1B LUBM 8000      552 GB      155
                                                    1.1
                                                         × 2.8 × 1.4 (a )
OWLIM         92GB/1.85B LUBM 8000      195 GB       92
                                                    1.85
                                                         × 2.8 × 1.4 (b )
Oracle 11G    154GB/1.1B LUBM 8000      549 GB      154
                                                    1.1
                                                         × 2.8 × 1.4 (c )
Virtuoso      120GB/1.1B various        305GB       120
                                                    1.1
                                                         × 2.8d      (e )
           f
UniProt.org 36GB Store, 36GB Text index 72GB (real)                  Internal
a
  www.franz.com/agraph/allegrograph/
b
  www.ontotext.com/owlim/OWLIMPres.pdf
c
  www.oracle.com/technology/tech/semantic technologies/htdocs/performance.html
d
  Storage cost of Virtuoso was not multiplied as they used a dataset with a higher
  number of triples with large literals in comparison to LUBM8000.
e
  virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple
f
  Our current custom solution, described in Section 2, not a SPARQL engine


3     Data Growth
We have to deal with ever growing data volumes (Fig. 3). When evaluating tools,
we need to take into account not just performance on today’s data, but also on
the expected data in five years. The yearly growth rate of the core UniProtKB
data is currently 51%. This is a doubling time of 17 months, faster than Moore’s
Law5 that predicts a doubling time of 24 months.
   Assuming that the current growth rate of UniParc does not accelerate from
the current 270% a year (doubling every 7 months), we will see at least 10 billion
sequences in UniParc in five years. This estimate translates to 3 × 1011 triples,
around 470 times the current size.


4     Summary
We have met the requirements for data processing speed and query performance
for the near future, unfortunately on a non-RDF technology stack. Currently the
available tools do not meet our performance needs for UniProt.org, especially
since any RDF solution must preserve the current user interface and full text
search. We do however aim to deploy a public SPARQL endpoint once it becomes
feasible.
5
    http://en.wikipedia.org/wiki/Moores law
Fig. 1. UniProtKB is growing faster over time. This graph does not show the growth
of existing entries as more information becomes available.


                                                                                   .


5    Acknowledgments
UniProt is mainly supported by the National Institutes of Health (NIH) grant
2 U01 HG02712-04. Additional support for the EBI’s involvement in UniProt
comes from the European Commission (EC)’s FELICS grant (021902RII3) and
from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are sup-
ported by the Swiss Federal Government through the Federal Office of Educa-
tion and Science and the European Commission contracts FELICS (021902RII3)
and SLING (226073). PIR activities are also supported by the NIH grants and
contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the
Department of Defense grant W81XWH0720112. This work has benefited from
Oracle funding support.


References
1. The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids
  Res. 37: D169-D174 (2009).
2. Suzek B.E., Huang H., McGarvey P., Mazumder R., and Wu C.H. UniRef: Com-
  prehensive and Non-Redundant UniProt Reference Clusters. Bioinformatics 23:1282-
  1288 (2007).
3. Apweiler R., Bairoch A., and Wu C.H. Protein sequence databases. Curr. Opin.
  Chem. Biol. 8:76-80 (2004).
4. Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J.,
  McGarvey P., and Gasteiger E. Infrastructure for the life sciences: design and imple-
  mentation of the UniProt website. BMC Bioinformatics 10:136 (2009).