<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weekend Triple Billionaire</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jerven Bolleman</string-name>
          <email>jerven.bolleman@isb-sib.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Kappler</string-name>
          <email>thomas.kappler@isb-sib.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>the UniProt Consortium</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Swiss Institute of Bioinformatics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>UniProt is released every three weeks. In each such period, there is a time window
of only six days to prepare all source data for publication. When the curators
freeze their work, we need to retrieve all related data from our global
partners. We then convert all source data (42 GB as of release 15.8) to RDF/XML</p>
    </sec>
    <sec id="sec-2">
      <title>1 Available for download at http://www.uniprot.org/downloads.</title>
    </sec>
    <sec id="sec-3">
      <title>2 LUBM benchmark, http://swat.cse.lehigh.edu/projects/lubm/.</title>
      <sec id="sec-3-1">
        <title>Data set</title>
      </sec>
      <sec id="sec-3-2">
        <title>Triples in 15.8</title>
      </sec>
      <sec id="sec-3-3">
        <title>Triples in 15.9</title>
      </sec>
      <sec id="sec-3-4">
        <title>Difference</title>
      </sec>
      <sec id="sec-3-5">
        <title>Citations</title>
      </sec>
      <sec id="sec-3-6">
        <title>Enzyme GO</title>
      </sec>
      <sec id="sec-3-7">
        <title>Keywords</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Mapped citationsa</title>
      <sec id="sec-4-1">
        <title>Locations</title>
      </sec>
      <sec id="sec-4-2">
        <title>Pathways</title>
      </sec>
      <sec id="sec-4-3">
        <title>Taxonomy</title>
      </sec>
      <sec id="sec-4-4">
        <title>Tissues</title>
      </sec>
      <sec id="sec-4-5">
        <title>UniParc</title>
      </sec>
      <sec id="sec-4-6">
        <title>UniProtKB</title>
      </sec>
      <sec id="sec-4-7">
        <title>UniRef</title>
        <p>Total
8,612,297
36,461
195,249</p>
        <p>7,641
84,308,128
4,537
8,661
3,493,485</p>
        <p>6551
640,191,073
1,572,097,256</p>
        <p>487,209,989
2,796,164,777
8,616,640
36,461
195,249</p>
        <p>7,649
82,605,680
4,532
8,673
3,520,871</p>
        <p>6551
691,432,967
1,610,723,778</p>
        <p>499,947,698
2,897,100,198
a Automatically generated, not public.
(180 GB, 18 GB compressed), while validating it. This conversion took around
16 hours for release 15.8 of September 2009.</p>
        <p>
          The UniProt.org [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] website and its query engine run on an internally
developed solution using BDB/je3 and Lucene4. On a release, data is loaded into
its store in RDF form; the store and query engine themselves, however, are not
RDF-based. Loading and full text indexing all UniProt data sets took 29 hours
for release 15.8, with 2 GB of memory on a dual core 2.8 GHz XeonTMand a
single hard disk. More than 20,000 unique users visit UniProt.org each workday,
averaging 130,000 queries and 1,710,000 direct lookups, while consuming just
under a terabyte of bandwidth a month. On average, 99.9% of queries finish in
less than 0.8 seconds, including the transfer over the internet.
        </p>
        <p>UniProt.org runs on three mirrors, each of which needs to be provisioned with
all data in time for a release. Data size is a major factor in our deployment as we
are limited by upload speed. UniProt 15.8 as needed for the website consumes
39 GB when gzip compressed, and takes four hours to upload. Larger data sizes
increase the risk of transfer failure.</p>
        <p>We also have an internal website for our curators and internal software
tools. Because it needs to reflect current work, we rebuild the manually
curated UniProtKB/Swiss-Prot every night with a time window of 3.5 hours for
converting, loading and validating about 160 million triples.</p>
        <p>UniProtKB/TrEMBL with supporting data is converted, loaded, and
validated every Sunday. At this occasion we generate about 1.4 billion triples while
checking the data against our validation rules, making us indeed weekend triple</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3 http://www.oracle.com/database/berkeley-db/je/index.html</title>
    </sec>
    <sec id="sec-6">
      <title>4 http://lucene.apache.org/java/docs/index.html</title>
      <p>billionaires. Current generic RDF stores are unable to handle this amount of
data on the limited hardware budget available.
a www.franz.com/agraph/allegrograph/
b www.ontotext.com/owlim/OWLIMPres.pdf
c www.oracle.com/technology/tech/semantic technologies/htdocs/performance.html
d Storage cost of Virtuoso was not multiplied as they used a dataset with a higher
number of triples with large literals in comparison to LUBM8000.
e virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple
f Our current custom solution, described in Section 2, not a SPARQL engine
3</p>
      <sec id="sec-6-1">
        <title>Data Growth</title>
        <p>We have to deal with ever growing data volumes (Fig. 3). When evaluating tools,
we need to take into account not just performance on today’s data, but also on
the expected data in five years. The yearly growth rate of the core UniProtKB
data is currently 51%. This is a doubling time of 17 months, faster than Moore’s
Law5 that predicts a doubling time of 24 months.</p>
        <p>Assuming that the current growth rate of UniParc does not accelerate from
the current 270% a year (doubling every 7 months), we will see at least 10 billion
sequences in UniParc in five years. This estimate translates to 3 × 1011 triples,
around 470 times the current size.
4</p>
      </sec>
      <sec id="sec-6-2">
        <title>Summary</title>
        <p>We have met the requirements for data processing speed and query performance
for the near future, unfortunately on a non-RDF technology stack. Currently the
available tools do not meet our performance needs for UniProt.org, especially
since any RDF solution must preserve the current user interface and full text
search. We do however aim to deploy a public SPARQL endpoint once it becomes
feasible.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5 http://en.wikipedia.org/wiki/Moores law</title>
      <p>.
5</p>
      <sec id="sec-7-1">
        <title>Acknowledgments</title>
        <p>UniProt is mainly supported by the National Institutes of Health (NIH) grant
2 U01 HG02712-04. Additional support for the EBI’s involvement in UniProt
comes from the European Commission (EC)’s FELICS grant (021902RII3) and
from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are
supported by the Swiss Federal Government through the Federal Office of
Education and Science and the European Commission contracts FELICS (021902RII3)
and SLING (226073). PIR activities are also supported by the NIH grants and
contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the
Department of Defense grant W81XWH0720112. This work has benefited from
Oracle funding support.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. The UniProt Consortium.
          <article-title>The Universal Protein Resource (UniProt)</article-title>
          .
          <source>Nucl. Acids Res</source>
          .
          <volume>37</volume>
          :
          <fpage>D169</fpage>
          -
          <lpage>D174</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Suzek</surname>
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>McGarvey P.</given-names>
            ,
            <surname>Mazumder</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            , and
            <surname>Wu C.H. UniRef</surname>
          </string-name>
          : Comprehensive and
          <string-name>
            <surname>Non-Redundant UniProt Reference Clusters</surname>
          </string-name>
          .
          <source>Bioinformatics</source>
          <volume>23</volume>
          :
          <fpage>1282</fpage>
          -
          <lpage>1288</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Apweiler</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bairoch</surname>
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wu</surname>
            <given-names>C.H.</given-names>
          </string-name>
          <article-title>Protein sequence databases</article-title>
          .
          <source>Curr. Opin. Chem. Biol</source>
          .
          <volume>8</volume>
          :
          <fpage>76</fpage>
          -
          <lpage>80</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jain</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bairoch</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duvaud</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phan</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redaschi</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suzek</surname>
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGarvey P.</surname>
          </string-name>
          , and Gasteiger E.
          <article-title>Infrastructure for the life sciences: design and implementation of the UniProt website</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>10</volume>
          :136 (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>