Weekend Triple Billionaire Maintaining a Large RDF Data Set in the Life Sciences

Weekend Triple Billionaire Maintaining a Large RDF Data Set in the Life Sciences September 2010 JervenBolleman jerven.bolleman@isb-sib.ch ThomasKappler thomas.kappler@isb-sib.ch UniprotConsortium Swiss Institute of Bioinformatics

8, 612, 297 8, 616, 640 0.05% Enzyme 36, 461 36, 461 0% GO 195, 249 195, 249 0% Keywords 7 641 7, 649

8, 661 8, 673 +0.14%, Taxonomy 3, 493, 485 3, 0% UniParc 640, 191, 073 691, 432, 967 +8.00% 2, 560, 000, 000 ; UniProtKB 1, 572, 097, 256 1, 610, 723, 778 +2.46% 2, 433, 000, 000 UniRef 487, 209, 989 499, 947, 698 +2.61% 775, 000, 000 Total 2, 796, 164, 777 2, 897, 100, 198 +3.61% 5 Tissues 520, 871 +0.78% 6551 6551, 768, 000, 000

Weekend Triple Billionaire Maintaining a Large RDF Data Set in the Life Sciences September 2010 688066775A7EBC3B89C50D5E4B70C9F7 GROBID - A machine learning software for extracting information from scholarly documents 84 308 128 82 605 680 −2.02% Locations 4 537 4 532 −0.11%

The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools. Table 1. Number of triples in UniProt, releases 15.8 and 15.9 Data set Triples in 15.8 Triples in 15.9 Difference Projection for

1 UniProt Data: Nature and Volume

The UniProt Knowledgebase (UniProtKB) [1] consists of two parts: UniProtKB/ Swiss-Prot, containing manually annotated records describing proteins with information from literature and curator-evaluated computational analysis, and UniProtKB/TrEMBL, with automatically annotated records. The UniProt consortium provides several additional data sets: UniRef [2] with clustered sets of sequences from UniProt, the amino acid sequence archive UniParc [3], and supporting data sets such as taxonomy and keywords.

UniProt consists of almost three billion triples (see Table 1), making it one of the largest freely available RDF data sets. The data is maintained at three consortium member sites, and must be kept in sync and integrated into one combined public release 1 .

This number of triples consumes considerable harddisk space. We show estimates comparing different RDF stores, based on published LUBM 80002 results, with our solution, in Table 2.

Performance

UniProt is released every three weeks. In each such period, there is a time window of only six days to prepare all source data for publication. When the curators freeze their work, we need to retrieve all related data from our global partners. We then convert all source data (42 GB as of release 15.8) to RDF/XML billionaires. Current generic RDF stores are unable to handle this amount of data on the limited hardware budget available.

Data Growth

We have to deal with ever growing data volumes (Fig. 3). When evaluating tools, we need to take into account not just performance on today's data, but also on the expected data in five years. The yearly growth rate of the core UniProtKB data is currently 51%. This is a doubling time of 17 months, faster than Moore's Law5 that predicts a doubling time of 24 months.

Assuming that the current growth rate of UniParc does not accelerate from the current 270% a year (doubling every 7 months), we will see at least 10 billion sequences in UniParc in five years. This estimate translates to 3 × 10 11 triples, around 470 times the current size.

Summary

We have met the requirements for data processing speed and query performance for the near future, unfortunately on a non-RDF technology stack. Currently the available tools do not meet our performance needs for UniProt.org, especially since any RDF solution must preserve the current user interface and full text search. We do however aim to deploy a public SPARQL endpoint once it becomes feasible. .

Fig. 1 .1Fig. 1. UniProtKB is growing faster over time. This graph does not show the growth of existing entries as more information becomes available.

Table 2 .2Rough disk consumption estimates for UniProt release 15.8. The factor 1.4 is the ratio between the larger number of long literals in UniProt compared to a LUBM 8000 data set.RDF StoreData providedEstimateSourceAllegro Graph 155GB/1.1B LUBM 8000 OWLIM 92GB/1.85B LUBM 8000 Oracle 11G 154GB/1.1B LUBM 8000 Virtuoso 120GB/1.1B various UniProt.org f 36GB Store, 36GB Text index 72GB (real) 552 GB 195 GB 549 GB 305GB155 1.1 × 2.8 × 1.4 ( a ) 92 1.85 × 2.8 × 1.4 ( b ) 154 1.1 × 2.8 × 1.4 ( c ) 120 ( e ) 1.1 × 2.8 d Internal

a www.franz.com/agraph/allegrograph/ b www.ontotext.com/owlim/OWLIMPres.pdf c www.oracle.com/technology/tech/semantic technologies/htdocs/performance.html d Storage cost of Virtuoso was not multiplied as they used a dataset with a higher number of triples with large literals in comparison to LUBM8000. e virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple f Our current custom solution, described in Section 2, not a SPARQL engine Available for download at http://www.uniprot.org/downloads. LUBM benchmark, http://swat.cse.lehigh.edu/projects/lubm/. http://en.wikipedia.org/wiki/Moores law

Acknowledgments

UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112. This work has benefited from Oracle funding support.

The Universal Protein Resource (UniProt) Nucl. Acids Res 37 2009 The UniProt Consortium UniRef: Comprehensive and Non-Redundant UniProt Reference Clusters BESuzek HHuang PMcgarvey RMazumder CHWu Bioinformatics 23 2007 Protein sequence databases RApweiler ABairoch CHWu Curr. Opin. Chem. Biol 8 2004 Infrastructure for the life sciences: design and implementation of the UniProt website EJain ABairoch SDuvaud IPhan NRedaschi BESuzek MJMartin PMcgarvey EGasteiger BMC Bioinformatics 10 136 2009