-

Estimating effective DNA database size via compression ⋆

Martina Viˇsnˇovsk´a

Michal N´an´asi

Tom´aˇs Vinaˇr

Bronˇa Brejov´a

1 0 Department of Applied Informatics, Faculty of Mathematics , Physics , and Informatics Comenius University , Mlynsk ́a Dolina, 842 48 Bratislava , Slovakia 1 Department of Computer Science, Faculty of Mathematics , Physics , and Informatics Comenius University , Mlynsk ́a Dolina, 842 48 Bratislava , Slovakia

63 70

Search for sequence similarity in large-scale and deletion of nucleotides also invokes penalty. The databases of DNA and protein sequences is one of the goal is then to find a region in the query and a region of essential problems in bioinformatics. To distinguish ran- the sequence database with the highest possible score. dom matches from biologically relevant similarities, it is This task can be accomplished either by a dynamic customary to compute statistical P-value of each discov- programming algorithm [21], or by fast heuristic methered match. In this context, P-value is the probability that ods, such as BLAST [1]. Regardless of the method, the cahsainmcielairnitay cwoimthpaarigsiovnenofscaorreanodrohmigqhuererwyoaunldd aaprpaenadrobmy result is a set of high scoring pairs (substring of a query database. Note that P-value is a function of the database matched to a substring of a database) together with size, since a high-scoring similarity is more likely to exist their similarity scores. by chance in a larger database. Typical genomic databases are large: human Biological databases often contain redundant, identical, or genome alone has approx. 3.2 GB, and the GenBank very similar sequences. This fact is not taken into account traditional database [4] contains more than 100 GB in P-value estimation, resulting in pessimistic estimates. of DNA sequences as of April 2010. In these large One way to address this problem is to use a lower effective databases, many query sequences will have high database size instead of its real size. In this work, we pro- scoring matches purely by chance. To distinguish real pose to estimate the effective size of a database by its com- matches from spurious ones, we need to assess their fpercetsisveedlysiszteo.reAnonalpypraopsriniagtlee ccoompyproesfseioanchalrgeopreiathtemd wstirllinegf-, statistical significance. resulting in a file whose size roughly corresponds to the Traditionally, the statistical significance of a high amount of unique sequence in the database. We evaluate scoring pair is assessed by P -value. If the high scorour approach on real and simulated databases. ing pair has score s, its P -value is the probability of a match with score s or better occurring in homology search of a randomly generated query in a randomly 1 Introduction generated database. The P -value obviously depends on the sizes of the query and the database, but to Recent progress in genome sequencing technologies led achieve more realistic P -values, we can also take into to rapid increase in the size of DNA sequence data- account other properties of the sequences, such as frebases. Perhaps the most common way of accessing quencies of individual nucleotides [17]. these databases is through sequence homology search, Consider an illustration in Fig.1. For a given where users search in the database for sequences sim- score s, we can visualize the sequence database D as ilar to their query of interest [1]. Highly similar se- a set of points in the sequence space with the neighquences have often evolved from a common ances- bourhoods that include all sequences with similarity tor and may also share the same function. Homology score s or higher. In this sequence space, the query Q search is thus the first step in elucidating the function is located at the boundary of one of these neighbourand evolutionary history of a newly sequenced genome. hoods. If the neighbourhoods cover a large fraction of To formally define sequence homology search as the sequence space, the P-value is high, because a rana computational problem, we consider a query Q and domly generated query will have a high probability of a database D, which are both strings composed of falling into one of those neighbourhoods, resulting in nucleotides (symbols from the alphabet fA; C; G; T g). score larger than s. We typically introduce a scoring function that assigns While for a given database the P -values could be a positive score to matching nucleotides in the two se- computed exactly, such a computation would be imquences, and negative score to mismatches. Insertion practical. Instead, one uses various approximations as⋆ This research is funded by European Community suming that the sequence in the database is randomly FP7 grants IRG-224885 and IRG-231025, VEGA grant generated by an i.i.d. process or a Markov chain 1/0210/10, and Comenius University grant number (e.g., [9]). These assumptions are often violated by real UK/151/2010. databases.

new database will still be n′ = n, even though its real size is 2n.

Note that the notion of effective database size has been previously used to adjust for border effects in case of short queries and databases [ 17 ], and an option for setting it to an arbitrary value is included in most software tools for homology search. Analogously, statistical models in population genetics use population size as a parameter, but instead of the actual number of individuals, one typically uses effective population size to compensate for various effects that are not considered by the model, such as population size changing over time [11]. 2

Kolmogorov complexity of DNA sequences

For instance, the database may contain several se- For our purposes, the effective database size should quences that are evolutionarily related. Such se- be an estimate of the amount of unique sequence in quences will often contain only a small number of the database D, taking into account substrings that changes: for example, DNA sequences of human and may be present in D in many exact or approximate chimpanzee are identical in 99% of nucleotides over copies. One way of describing the information content most of their lengths. In addition, recently introduced of a database is its Kolmogorov complexity [ 16 ]. next generation sequencing technologies made it fi- Kolmogorov complexity K(D) of a sequence D is nancially viable to sequence multiple individuals from the bit length of the shortest program P for a fixed the same species, leading to various personal human universal Turing machine that outputs sequence D. genome projects [ 23 ]. Sequences of different individ- Kolmogorov complexity can be understood as a lower uals from the same species may differ as little as one bound (up to a constant additive term) of compression in thousand nucleotides. Consequently, databases may achievable by any general-purpose algorithm. A string contain many highly similar sequences, and it is not of length n sampled uniformly at random from a fixed appropriate to model them as completely random. alphabet is on average almost incompressible [16, Sec

Consider again the example in Fig.1. The illustra- tion 2.8.1]. In particular, a string over a four-letter tion on the right shows a database with clusters of sim- alphabet requires on average approximately 2n bits ilar sequences that have overlapping neighbourhoods, (with up to an O(log n) additive term). Thus if we covering a much smaller fraction of the sequence space believe that the databases with the same Kolmogorov than the database on the left. Consequently, the esti- complexity will behave similarly, we should use n′ = mate based on the assumption of random distribution K(D)=2 as an estimate of the effective database size of sequences in the database will necessarily overes- of database D. timate the P -value, potentially leading to rejection of At a first glance, Kolmogorov complexity seems high scoring pairs as matches likely to occur by chance. to be an ideal estimator to use in this context. It

One of the solutions to this problem proposed in accounts for possible major differences between the the literature is to remove the redundancies, such as real database and a randomly generated one. In parclosely related sequences, from the sequence data- ticular, using Kolmogorov complexity would compenbase [ 22 ]. New query can be first searched against such sate for differences in frequencies of individual nunon-redundant database, the statistical significance of cleotides (database containing only a long stretch of resulting matches can be evaluated, and then a sec- As will have a small effective size), and redundant ondary search can be launched against the whole data- sequence content (sequences that have only few difbase. In this paper, we propose a different solution. ferences can be described very efficiently in a small One of the main parameters used in P -value computa- space). Moreover, the concept of Kolmogorov comtion is the database size n. Instead of the real database plexity has been successfully used in similar contexts size, we propose to use an effective database size n′ before, in applications such as computing distance be(n′ < n) which will account for redundancies in the tween genomes [ 15 ] (see also [ 10, 18 ] for overview). database. For example, if we take a random database The exact Kolmogorov complexity of database D of size n, and double its contents by including second is not computable. Yet in practice, we can use variexact copy of each sequence, the effective size of such ous compression algorithms instead of computing the Kolmogorov complexity. For a fixed compression algo- be conservative, i.e., the estimates should be strictly rithm, the compressed size c(D) of database D is an higher than the exact P -values, so that the estimated upper bound on the Kolmogorov complexity K(D), P -value represents an upper bound on the probability and we can use value n′ = c(D)=2 as an estimate of of a particular match being a false positive. the effective database size. Several efficient algorithms We have decided to experimentally evaluate the specifically tuned to compression of DNA sequences accuracy of the P -values obtained by the compression are available (see [ 6, 7, 14, 3 ]). method in a simple scenario motivated by the next

As we will see in the next section, the upper bounds generation sequencing technologies. In this scenario, on P -values (or conservative estimates) are generally a sequencing machine generates many short sequences desirable in cases where exact P -values cannot be com- (reads). These reads need to be mapped as substrings puted. Since the P -values increase monotonically with to previously known reference genomes. In most cases, the database size, using an upper bound on the effec- the reads will match the reference sequence exactly, tive database size should not by itself lead to non- but sometimes we will see one or two mismatches. conservative bounds. These mismatches can be either due to sequencing er

Unfortunately, using Kolmogorov complexity may rors, or (more interestingly) due to differences between not necessarily lead to conservative bounds in all in- individuals. stances. As an extreme case, base-4 expansion of many fundamental constants, such as , can be generated In our simplified scenario, the database D is a sinby a constant-size program. First n bits generated by gle string of length n, the query Q is a string of such a program can be used as a sequence database of length m, and we are searching in D for a substring of size n, replacing digits 0; : : : ; 3 with nucleotides. Kol- length m with the smallest Hamming distance from Q. mogorov complexity of such database is O(log n) (we In contrast to the full homology search problem, we alneed a constant number of bits to represent the pro- ways consider the whole query (each read has to map gram, and log n bits to represent the real size of the completely to the reference), and we also disallow indatabase), yet for all practical purposes, this database sertions and deletions. behaves as a random database of size n [2].

Thus using a Kolmogorov complexity and compression-based estimates of effective database size does not necessarily lead to conservative estimates of P -values in homology search. In the next section, we explore practical issues of using these compression-based estimates in an experimental setting.

We will say that the distance of query Q from database D is the minimum Hamming distance between Q and some substring of length m from database D. This Hamming distance will represent our score. P -value for a particular Hamming distance h is then the number of all m-tuples that are at a distance of at most h from D, divided by 4m (the number of all possible m-tuples). Note that in this definition of P -value, we keep the database fixed, and only the query is random, chosen uniformly from all possible queries of length m. In contrast, most methods consider both query and database as random.

Estimating effective database size through compression 3

Here, motivated by the discussion in the previous section, we explore the use of compression software for es- The main advantage of this simplified model is that timating the effective database size. The methodology we can compute these P -values exactly in a reasonable is very simple. First, we compress the database and amount of time, and compare them to the P -values esmeasure the size of the resulting file in bytes. Then, timated based on the randomly generated database of we multiply this size by four to account for the fact corresponding effective size, thus evaluating the accuthat in a uniformly random database, we need two racy of our concept. The rest of this section is orgabits to encode each nucleotide. In this way, we obtain nized as follows. First, we introduce the algorithm for an estimate of the effective database size which can exact computation of P -values in our simplified scebe used in any formula or algorithm for estimating nario. Then we explore several cases of generated and P -values on uniformly distributed databases. real databases.

The P -values are used to reject high scoring pairs The file compression algorithms typically detect that have a high probability of occurring by chance. In symbols or small groups of symbols that occur in the a typical search there will be many spurious matches text more frequently than the others, and encode them with high P -values, and only a few top-scoring by shorter codewords. Thus, the size of the compressed matches will represent genuine similarities with com- database is related to the entropy of the source genermon evolutionary origins. Since our main task is to ating the database. In addition to that, some methods separate these few good matches from many spuri- try to detect longer strings that are repeated exactly ous matches, we would like the P -value estimator to or with mismatches multiple times, and to store only one copy of each such string as well as differences be- in the given sequence. GC-content varies widely in tween the approximate copies. different genomes, or even between segments of the

In our experiments, we first try to separate these same genome. An i.i.d. database with GC-content g two phenomena by using artificially generated data- has entropy H(g) = g lg(g=2) (1 g) lg((1 g)=2), bases, and in the end, we apply compression software and therefore can be encoded by approximately H(g)n to real DNA sequences, where both of these issues are bits, for example by the arithmetic encoding [ 20 ]. at play. Following our general approach, we can try to use the formulas for the uniform nucleotide frequencies 3.1 Algorithm to compute exact P -values tainmdateeffetchteivPe -dvaatluabesasfeorsitzheeEa(cntu;agl) d=atHab(ags)en=o2f stiozeesnWe have implemented an algorithm that simulta- and GC-content g. neously computes P -values for a given database D We have tested the performance of such estimates and all of its prefixes, and for all distances h = on randomly generated databases with GC-contents 0; 1; : : : ; hmax. In experiments we use values hmax = 3 75% and 90%, averaging values for five randomly genand m = 15. erated databases. Table 1 shows the real P -values,

The algorithm proceeds along the database D and P -values predicted by our compression method, and at each position updates two arrays M and H. P -values obtained by the simple method of considArray M of size 4m stores for each m-tuple Q′ its ering a database of length n and GC-content 50% distance to the current prefix of the database, pro- (disregarding the real GC-content in the P -value esvided that this distance is at most hmax (otherwise it timate). All estimates were computed by our algouses a special 1 value). The second array H stores rithm from Section 3.1. For real P-values the algorithm for each distance h hmax the number of m-tuples was applied to the generated database with a skewed with distance exactly h from the current prefix of the GC-content, for simple and predicted estimates to five database D. random databases of appropriate size with

When we read a new nucleotide of the database, GC-content 50%. we need to consider the last m-tuple of the current For small P -values the real and simple estimates prefix. We enumerate all m-tuples at a distance of at are quite similar, which is expected since in a short most hmax from this new m-tuple, and for each we database very few m-tuples occur multiple times, even update arrays M and H as appropriate. Values in H if the composition of the database is skewed. As a recan be easily converted to the desired P -values for the sult, the compression method gives non-conservative current prefix at different distance thresholds. Note estimates, because it uses a much smaller database that this algorithm is feasible only for small values size. For larger P -values, the compression estimates of m, because it requires (4m) memory. become conservative, and are closer to the true

The algorithm can be used to compute exact P -value than the simple estimates. This is because P -value for a given real sequence database, which we a database with high GC-content is less likely to conconsider as a reference value. It can also be applied to tain m-tuples with low GC-content, and thus a larger randomly generated databases, leading to estimates database size is required to achieve the same P -value. of P -values that would be obtained in a model, where For example, among estimates for GC-content 75% both the database and the query are random. To em- shown in Table 1, compression estimates become conulate the proposed method, we compress a sequence servative for databases larger than 107 and 105 nudatabase, compute its effective size, and then use the cleotides for h = 0 and h = 2, respectively. P -value estimates obtained from random databases of We can study the situation analytically for the case the matching size. when the query appears in the database without a mis

Finally, we also consider a simple method, where match. Let Xi be the event that query Q has an ocwe use random databases of the same size as the orig- currence with distance at most h at position i in the inal database, without compression. P -values com- random database of GC-content g, and let X be the puted in this way correspond to the commonly used total number of occurrences of query Q with at most techniques for P -value estimation without using the h mismatches in the whole database. For g = 0:5, we effective database size mechanism. can compute the probability of Xi by summing over the number of mismatches 3.2

Entropy The four nucleotides do not occur in genomes equally frequently. A commonly used measure of DNA composition is GC-content: the percentage of Cs and Gs

h (m) ( 3 )k ( 1 )m−k P (XijQ; g = 0:5) = ∑ : k=0 k 4 4

For general g we have to distinguish between mismatches on G/C positions and mismatches on A/T

Database size 104 Real (GC 75%) 9:3 10 6 Predicted (GC 75%) 8:4 10 6 Simple (GC 75%) 9:3 10 6 Real (GC 90%) 9:2 10 6 Predicted (GC 90%) 6:5 10 6 Simple (GC 90%) 9:3 10 6 105

The expected number of occurrences can be computed by linearity of expectation (for simplicity, we ignore the edge effect at positions n m + 2; : : : ; n, which is negligible for large n):

E(XjQ; g) = n ∑ E(XijQ) = nP (XijQ; g): i=1

Therefore for small P -values, where the approximation

of 1 e−x is sufficiently accurate, the estimate Pest is lower by approximately a factor of H(g)=2 than the real P-value. This implies that no correction for entropy is in fact necessary for small P -values.

On the other hand, when h = 0 but n is sufficiently large, the compression estimates become conservative.

In particular, let us assume that n >

H(g)2−m−1 m2−m ln(2) (1 g)m = 1:386 H(g)

m4m + o(m4m): This implies e−E(n;g)4 m < 2−me−n2 m(1−g)m . The right-hand side is one of the terms of the sum m (m) ∑ aTWhPeioswicsiosllomanmpdpoirnsotlxryiibmuusaettideonatphwpeirdtohixsittmrhieabtumitoienoan[n1o7f] dv=aisrEriea(gbXalerjdQXs;dgbe)y-. z=0 z 2−me−n2 mgz(1−g)m z ; pendencies between occurrences at adjacent positions and therefore the left-hand side is upper-bounded by Ianndoaulrsosiamssuulmateiosntshaitt nleids ltaorgveearnydgooredlaetsivtiemlyastmesalolf. the Twhhioslesi msupmle absowuneldl. iTshniostimveprlyieisntPeersetstinPgr,easli.nce it P -values (data not shown). If X is from Poisson dis- works only for very large n, where P-values are very tribution, the probability of at least one occurrence of close to one. Nonetheless, it agrees with our observaa given query Q is P (X > 0jQ; g) = 1 e− . To obtain tion that the compression estimate is appropriate for the final P -value, we have to consider this expression sufficiently large n. Perhaps a tighter bound on n can for different queries, or more precisely, for groups of be obtained by considering additional elements of the queries with the same number Gs and Cs, of which Qz sum. is one representative:

m (m) Preal = P (X > 0jg) = ∑ z=0 z 2−mP (X > 0jQz; g): 3.3

Redundancy

∑zm=0 (mz)2−mn2−mgz(1 The compression estimate Pest uses the same formula Next, we consider artificial databases that are conbut g = 0:5 and E(n; g) instead of n. For h = 0, catenation of many mutually similar sequences of the E(XjQ; g) simplifies to n2−mgz(1 g)m−z, and in par- same length k. In the experiments we use k = 104. ticular E(XjQ; 0:5) = n4−m does not depend on z, To generate the database, we first sample a string faUunsndicntgtihotnehrie1sfoarpeep−rPoxexsitcma=natbi1oen,wweel−elEoa(bpntp;agr)io4nximm.aFteodr bsymaxll[8x],. sqStur=einnsgc1esws.2il:l: :bsek tohfelecnegnttherkoufntihfoercml ulystaetr roafnsdiommil.arThseisThe i-th sequence in the concatenated database is Pest = E(n; g)4−m obtained from S by randomly mutating several nuPreal g)m−z = H(g)=2: cleotides of S so that the nucleotide j is the same

Database size 104 Real 9:3 10 6 GenCompress 9:3 10 6 bzip2 1:0 10 5 Simple 9:3 10 6 105 as sj with the probability of 90%, and with the probability of 10% it changes to another nucleotide chosen randomly from the remaining three. This way, we get a clustered database of sequences that differ from the center of the cluster on 10% positions on average.

In this experiment, estimates based on data compression are mostly conservative and better than the estimates obtained by the simple methods.

3.4

Real data

For clustered databases of various sizes, we compute the real P -value simple estimate, which uses effective size equal to the real size of the database without compression, and two estimates based on two different lossless compression programs GenCompress [6] and bzip2. The results are shown in Table 2.

Finally, we have applied the compression method of

estimating the effective database size to real genomic data from human, chimpanzee, and rhesus macaque.

Our set consisted of a portion of human chromosome 22 and corresponding portions of genome from

Bzip2 is based on Burrows–Wheeler transform [5] the two other primates. We have omitted larger blocks which tends to create blocks of identical symbols if the that did not have counterparts in neither chimpanzee, input contains repeated substrings. The transformed nor macaque. The sequence data and the genome text is then encoded by other compression techniques, alignments were obtained from the UCSC genome such as Huffman encoding. To save memory, bzip2 di- browser [13]. vides a file into blocks and processes each block sep- The database contains a short block of the human arately, which may have negative effect on the com- sequence followed by the corresponding block in the pressed size. two other genomes, followed by another block in human, etc. Therefore, similar sequences are close to each

GenCompress [6] is an algorithm developed specifi- other which improves compression with algorithms cally for compressing DNA sequences. It finds approx- such as bzip2 that always consider only a block of the imate repeats in the compressed sequence and encodes whole file. them by a sequence of edit operations. GenCompress The results of the test for different input sizes are is a single-pass algorithm that proceed along the in- shown in Table 3. As we can see, with GenCompress put sequence and in each step it finds the best prefix we often underestimate real P -values, while bzip2 that can be encoded as an approximate repeat of some achieves only a very small improvement compared to substring of already encoded input sequence. the simple method without compression.

Earlier, we have reached a conclusion that skewed similarity common in DNA sequence databases, GC-content should not automatically imply lower ef- one could use fast methods for identifying such fective database sizes since this would lead to under- blocks [ 12, 19 ] to speed up the algorithms developed estimation of small P -values. However, compression specifically for compression of DNA sequences [6, 7, algorithms estimate the effective database size based 14, 3]. on both sequence redundancy and lower sequence en- Finally, we would like to extend our work to more tropy in case of locally high or low GC-contents. That complex scenarios of homology search. Longer query could be the reason why GenCompress estimates are sequences will require handling of insertions, deletions, non-conservative. and matches that involve only portions of the query

We have attempted to further correct for this issue sequence. More complex scoring schemes on both nuby computing an average database entropy H′. The ef- cleotide and protein sequences also need to be examfective database size estimated by GenCompress was ined. then multiplied by the correction factor of 2=H′. This approach leads to surprisingly good P -value estimates References that were also conservative in our experiments (Table 3, line GC corrected).

To estimate the value H′ for this experiment, we have computed entropy separately for non-overlapping windows of size 1000 to capture different properties of individual genomic regions. We have estimated a Markov chain of second order from each window of the sequence and computed an entropy of this Markov chain, that is, entropy where the probability of each nucleotide is conditioned on the two previous nucleotides to capture local dependencies in DNA sequences.

The average entropy H′ of the whole database was then computed as an average of entropy values from all windows. 1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and

D.J. Lipman: Basic local alignment search tool. Journal of Molecular Biology, 215 (3), 1990, 403–410. 2. D. Bailey and R. Crandall: Random generators and normal numbers. Experimental Mathematics, 11 (4), 2002, 527–546. 3. B. Behzadi and F. Le Fessant: DNA compression challenge revisited: a dynamic programming approach.

In: Combinatorial Pattern Matching (CPM 2005),

Springer, 2005, 190–200. 4. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W. Sayers: GenBank. Nucleic Acids Research, 37 (Database issue), 2009, D26–31. 5. M. Burrows and D.J. Wheeler: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 4 Conclusion 1994. 6. X. Chen, S. Kwong, and M. Li: A compression In this paper, we have considered methods for more algorithm for DNA sequences and its applications accurate estimation of P -values in the context of se- in genome comparison. In: Genome Informatics quence homology search. In particular, we propose to 7. X(G. ICWh’e9n9,)M,U.nLiiv,eBr.saMl aA,caanddemJ.yTProremssp,: 1D9N99A, C51o–m6p1r.ess: adjust the size of the database to compensate for the fast and effective DNA sequence compression. Bioinstructure present in the database due to the fact that formatics, 18 (12), 2002, 1696–1698. individual sequences are related by evolution. 8. T.H. Cormen, C.E. Leiserson, and R.L. Rivest: Intro

We have explored the idea of using compression duction to algorithms. MIT Press, 1990. to estimate the effective database size, and we have 9. A. Dembo, S. Karlin, and O. Zeitouni: (1994). Limit demonstrated by experiments that the use of the com- distribution of maximal non-aligned two-sequence segpression algorithms leads to non-conservative P -value mental score. The Annals of Probability, 22 (4), 1994, estimates for small P -values. This is at least partially 10. R20.22G–i2a0n3c9a.rlo, D. Scaturro, and F. Utro: Textual caused by the fact that besides identifying longer re- data compression in computational biology: a synoppeated substrings, compression algorithms also com- sis. Bioinformatics, 25 (13), 2009, 1575–1576. pensate for sequences with low entropy. We have 11. D.L. Hartl and A.G. Clark: Principles of population shown that such compensation should not be consid- genetics, Fourth Edition. Sinauer Associates, 2006. ered, at least in the case of small P -values. We have 12. W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and suggested a simple way to disentangle the portion of D. Haussler: D. Evolution’s cauldron: duplication, the compression coming from locally low entropy and deletion, and rearrangement in the mouse and human tsihmowatnest.hat the correction leads to better P -value es- eg1ne1nc4eo8sm4–oe9fs.t.hPerUocneietdedinSgstaotfetshoef NAamtieornicaal,A1c0a0de(m20y),o2f 0S0c3i-,

The compression would be a fast and efficient way 13. W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, of estimating the effective database size. Even though T.H. Pringle, A.M. Zahler, and D. Haussler: The humost of the general purpose compression algorithms man genome browser at UCSC. Genome Research, 12 (such as bzip2) cannot handle distant large blocks of (6), 2002, 996–1006.

14. G. Korodi and I. Tabus: An efficient normalized maximum likelihood algorithm for dna sequence compression . ACM Transactions on Information Systems , 23 ( 1 ), 2005 , 3 - 34 .

15. M. Li , J.H.

Badger , C.

Xin , S.

Kwong , P.

Kearney , and H. Zhang:

An information based sequence distance and its application to whole mitochondrial genome phylogeny . Bioinformatics , 17 ( 2 ), 2001 , 149 - 154 .

16.

Li and

Vit ´anyi: An introduction to Kolmogorov complexity and its applications . Springer, 2008 .

17.

A.Y.

Mitrophanov and M. Borodovsky

: (

2006 ). Statistical significance in biological sequence analysis . Briefings in Bioinformatics , 7 ( 1 ), 2006 , 2 - 24 .

18.

O.U.

Nalbantoglu ,

D.J.

Russell , and K. Sayood: Data compression concepts and algorithms and their applications to bioinformatics . Entropy (Basel, Switzerland), 12 ( 1 ), 2010 , 34 .

19.

Paten ,

Herrero ,

Fitzgerald ,

Beal ,

Flicek , I. Holmes , and

Birney : Genome-wide nucleotidelevel mammalian ancestor reconstruction . Genome Research , 18 ( 11 ), 2008 , 1829 - 1833 .

20. Rissanen , J. and Langdon , G. ( 1979 ). Arithmetic coding . IBM Journal of Research and Development , 23 ( 2 ): 149 - 162 .

21.

T.F.

Smith and M.S. Waterman: Identification of common molecular subsequences . Journal of Molecular Biology , 147 ( 1 ), 1981 , 195 - 197 .

22.

B.E.

Suzek ,

Huang ,

McGarvey ,

Mazumder , and C.H. Wu: UniRef: comprehensive and nonredundant UniProt reference clusters . Bioinformatics , 23 ( 10 ), 2007 , 1282 - 1288 .

23. J.C. Venter: Multiple personal genomes await . Nature , 464 ( 7289 ), 2010 , 676 - 677 .