<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Weekend Triple Billionaire Maintaining a Large RDF Data Set in the Life Sciences</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
				<date type="published" when="2010-09">September 2010</date>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jerven</forename><surname>Bolleman</surname></persName>
							<email>jerven.bolleman@isb-sib.ch</email>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Kappler</surname></persName>
							<email>thomas.kappler@isb-sib.ch</email>
						</author>
						<author>
							<persName><forename type="first">Uniprot</forename><surname>Consortium</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Swiss Institute of Bioinformatics</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<address>
									<addrLine>8, 612, 297 8, 616, 640 0.05% Enzyme 36, 461 36, 461 0% GO 195, 249 195, 249 0% Keywords 7</addrLine>
									<postCode>641 7, 649</postCode>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<address>
									<addrLine>8, 661 8, 673 +0.14%, Taxonomy 3, 493, 485 3, 0% UniParc 640, 191, 073 691, 432, 967 +8.00% 2, 560, 000, 000 ; UniProtKB 1, 572, 097, 256 1, 610, 723, 778 +2.46% 2, 433, 000, 000 UniRef 487, 209, 989 499, 947, 698 +2.61% 775, 000, 000 Total 2, 796, 164, 777 2, 897, 100, 198 +3.61% 5</addrLine>
									<postBox>Tissues</postBox>
									<postCode>520, 871 +0.78% 6551 6551, 768, 000, 000</postCode>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Weekend Triple Billionaire Maintaining a Large RDF Data Set in the Life Sciences</title>
					</analytic>
					<monogr>
						<imprint>
							<date type="published" when="2010-09">September 2010</date>
						</imprint>
					</monogr>
					<idno type="MD5">688066775A7EBC3B89C50D5E4B70C9F7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T02:24+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>84</term>
					<term>308</term>
					<term>128 82</term>
					<term>605</term>
					<term>680 −2.02% Locations 4</term>
					<term>537 4</term>
					<term>532 −0.11%</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools. Table 1. Number of triples in UniProt, releases 15.8 and 15.9 Data set Triples in 15.8 Triples in 15.9 Difference Projection for</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1 UniProt Data: Nature and Volume</p><p>The UniProt Knowledgebase (UniProtKB) <ref type="bibr" target="#b0">[1]</ref> consists of two parts: UniProtKB/ Swiss-Prot, containing manually annotated records describing proteins with information from literature and curator-evaluated computational analysis, and UniProtKB/TrEMBL, with automatically annotated records. The UniProt consortium provides several additional data sets: UniRef <ref type="bibr" target="#b1">[2]</ref> with clustered sets of sequences from UniProt, the amino acid sequence archive UniParc <ref type="bibr" target="#b2">[3]</ref>, and supporting data sets such as taxonomy and keywords.</p><p>UniProt consists of almost three billion triples (see Table <ref type="table">1</ref>), making it one of the largest freely available RDF data sets. The data is maintained at three consortium member sites, and must be kept in sync and integrated into one combined public release <ref type="foot" target="#foot_0">1</ref> .</p><p>This number of triples consumes considerable harddisk space. We show estimates comparing different RDF stores, based on published LUBM 8000<ref type="foot" target="#foot_1">2</ref> results, with our solution, in Table <ref type="table" target="#tab_0">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Performance</head><p>UniProt is released every three weeks. In each such period, there is a time window of only six days to prepare all source data for publication. When the curators freeze their work, we need to retrieve all related data from our global partners. We then convert all source data (42 GB as of release 15.8) to RDF/XML billionaires. Current generic RDF stores are unable to handle this amount of data on the limited hardware budget available. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Growth</head><p>We have to deal with ever growing data volumes (Fig. <ref type="figure">3</ref>). When evaluating tools, we need to take into account not just performance on today's data, but also on the expected data in five years. The yearly growth rate of the core UniProtKB data is currently 51%. This is a doubling time of 17 months, faster than Moore's Law<ref type="foot" target="#foot_2">5</ref> that predicts a doubling time of 24 months.</p><p>Assuming that the current growth rate of UniParc does not accelerate from the current 270% a year (doubling every 7 months), we will see at least 10 billion sequences in UniParc in five years. This estimate translates to 3 × 10 11 triples, around 470 times the current size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Summary</head><p>We have met the requirements for data processing speed and query performance for the near future, unfortunately on a non-RDF technology stack. Currently the available tools do not meet our performance needs for UniProt.org, especially since any RDF solution must preserve the current user interface and full text search. We do however aim to deploy a public SPARQL endpoint once it becomes feasible. .</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. UniProtKB is growing faster over time. This graph does not show the growth of existing entries as more information becomes available.</figDesc><graphic coords="4,147.32,145.72,318.17,159.09" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 .</head><label>2</label><figDesc>Rough disk consumption estimates for UniProt release 15.8. The factor 1.4 is the ratio between the larger number of long literals in UniProt compared to a LUBM 8000 data set.</figDesc><table><row><cell>RDF Store</cell><cell>Data provided</cell><cell>Estimate</cell><cell>Source</cell></row><row><cell cols="3">Allegro Graph 155GB/1.1B LUBM 8000 OWLIM 92GB/1.85B LUBM 8000 Oracle 11G 154GB/1.1B LUBM 8000 Virtuoso 120GB/1.1B various UniProt.org f 36GB Store, 36GB Text index 72GB (real) 552 GB 195 GB 549 GB 305GB</cell><cell>155 1.1 × 2.8 × 1.4 ( a ) 92 1.85 × 2.8 × 1.4 ( b ) 154 1.1 × 2.8 × 1.4 ( c ) 120 ( e ) 1.1 × 2.8 d Internal</cell></row></table><note>a www.franz.com/agraph/allegrograph/ b www.ontotext.com/owlim/OWLIMPres.pdf c www.oracle.com/technology/tech/semantic technologies/htdocs/performance.html d Storage cost of Virtuoso was not multiplied as they used a dataset with a higher number of triples with large literals in comparison to LUBM8000. e virtuoso.openlinksw.com/Whitepapers/html/Virt6FAQ.html#StorageCostPerTriple f Our current custom solution, described in Section 2, not a SPARQL engine</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Available for download at http://www.uniprot.org/downloads.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">LUBM benchmark, http://swat.cse.lehigh.edu/projects/lubm/.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">http://en.wikipedia.org/wiki/Moores law</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Acknowledgments</head><p>UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112. This work has benefited from Oracle funding support.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The Universal Protein Resource (UniProt)</title>
	</analytic>
	<monogr>
		<title level="j">Nucl. Acids Res</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="D169" to="D174" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
	<note>The UniProt Consortium</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">UniRef: Comprehensive and Non-Redundant UniProt Reference Clusters</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">E</forename><surname>Suzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcgarvey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mazumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="1282" to="1288" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Protein sequence databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Apweiler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bairoch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Curr. Opin. Chem. Biol</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="76" to="80" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Infrastructure for the life sciences: design and implementation of the UniProt website</title>
		<author>
			<persName><forename type="first">E</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bairoch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Duvaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Phan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Redaschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">E</forename><surname>Suzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcgarvey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gasteiger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">136</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
