Estimating effective DNA database size via compression ⋆ Martina Višňovská1 , Michal Nánási2 , Tomáš Vinař1 , and Broňa Brejová2 1 Department of Applied Informatics, Faculty of Mathematics, Physics, and Informatics Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia 2 Department of Computer Science, Faculty of Mathematics, Physics, and Informatics Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia Abstract. Search for sequence similarity in large-scale and deletion of nucleotides also invokes penalty. The databases of DNA and protein sequences is one of the goal is then to find a region in the query and a region of essential problems in bioinformatics. To distinguish ran- the sequence database with the highest possible score. dom matches from biologically relevant similarities, it is This task can be accomplished either by a dynamic customary to compute statistical P-value of each discov- programming algorithm [21], or by fast heuristic meth- ered match. In this context, P-value is the probability that ods, such as BLAST [1]. Regardless of the method, the a similarity with a given score or higher would appear by chance in a comparison of a random query and a random result is a set of high scoring pairs (substring of a query database. Note that P-value is a function of the database matched to a substring of a database) together with size, since a high-scoring similarity is more likely to exist their similarity scores. by chance in a larger database. Typical genomic databases are large: human Biological databases often contain redundant, identical, or genome alone has approx. 3.2 GB, and the GenBank very similar sequences. This fact is not taken into account traditional database [4] contains more than 100 GB in P-value estimation, resulting in pessimistic estimates. of DNA sequences as of April 2010. In these large One way to address this problem is to use a lower effective databases, many query sequences will have high database size instead of its real size. In this work, we pro- scoring matches purely by chance. To distinguish real pose to estimate the effective size of a database by its com- matches from spurious ones, we need to assess their pressed size. An appropriate compression algorithm will ef- fectively store only a single copy of each repeated string, statistical significance. resulting in a file whose size roughly corresponds to the Traditionally, the statistical significance of a high amount of unique sequence in the database. We evaluate scoring pair is assessed by P -value. If the high scor- our approach on real and simulated databases. ing pair has score s, its P -value is the probability of a match with score s or better occurring in homology search of a randomly generated query in a randomly 1 Introduction generated database. The P -value obviously depends on the sizes of the query and the database, but to Recent progress in genome sequencing technologies led achieve more realistic P -values, we can also take into to rapid increase in the size of DNA sequence data- account other properties of the sequences, such as fre- bases. Perhaps the most common way of accessing quencies of individual nucleotides [17]. these databases is through sequence homology search, Consider an illustration in Fig.1. For a given where users search in the database for sequences sim- score s, we can visualize the sequence database D as ilar to their query of interest [1]. Highly similar se- a set of points in the sequence space with the neigh- quences have often evolved from a common ances- bourhoods that include all sequences with similarity tor and may also share the same function. Homology score s or higher. In this sequence space, the query Q search is thus the first step in elucidating the function is located at the boundary of one of these neighbour- and evolutionary history of a newly sequenced genome. hoods. If the neighbourhoods cover a large fraction of To formally define sequence homology search as the sequence space, the P-value is high, because a ran- a computational problem, we consider a query Q and domly generated query will have a high probability of a database D, which are both strings composed of falling into one of those neighbourhoods, resulting in nucleotides (symbols from the alphabet {A, C, G, T }). score larger than s. We typically introduce a scoring function that assigns While for a given database the P -values could be a positive score to matching nucleotides in the two se- computed exactly, such a computation would be im- quences, and negative score to mismatches. Insertion practical. Instead, one uses various approximations as- ⋆ This research is funded by European Community suming that the sequence in the database is randomly FP7 grants IRG-224885 and IRG-231025, VEGA grant generated by an i.i.d. process or a Markov chain 1/0210/10, and Comenius University grant number (e.g., [9]). These assumptions are often violated by real UK/151/2010. databases. 64 Martina Višňovská et al. new database will still be n′ = n, even though its real size is 2n. Note that the notion of effective database size has been previously used to adjust for border effects in case of short queries and databases [17], and an option for setting it to an arbitrary value is included in most software tools for homology search. Analogously, sta- tistical models in population genetics use population size as a parameter, but instead of the actual number of individuals, one typically uses effective population Fig. 1. Illustration of P -value computation. The points size to compensate for various effects that are not con- represent sequences in database D and the shaded areas sidered by the model, such as population size changing show their neighbourhoods with scores that are at least as over time [11]. good as the similarity score of query Q. Left: Database with randomly distributed sequences. Right: Database with clustered sequences. 2 Kolmogorov complexity of DNA sequences For instance, the database may contain several se- For our purposes, the effective database size should quences that are evolutionarily related. Such se- be an estimate of the amount of unique sequence in quences will often contain only a small number of the database D, taking into account substrings that changes: for example, DNA sequences of human and may be present in D in many exact or approximate chimpanzee are identical in 99% of nucleotides over copies. One way of describing the information content most of their lengths. In addition, recently introduced of a database is its Kolmogorov complexity [16]. next generation sequencing technologies made it fi- Kolmogorov complexity K(D) of a sequence D is nancially viable to sequence multiple individuals from the bit length of the shortest program P for a fixed the same species, leading to various personal human universal Turing machine that outputs sequence D. genome projects [23]. Sequences of different individ- Kolmogorov complexity can be understood as a lower uals from the same species may differ as little as one bound (up to a constant additive term) of compression in thousand nucleotides. Consequently, databases may achievable by any general-purpose algorithm. A string contain many highly similar sequences, and it is not of length n sampled uniformly at random from a fixed appropriate to model them as completely random. alphabet is on average almost incompressible [16, Sec- Consider again the example in Fig.1. The illustra- tion 2.8.1]. In particular, a string over a four-letter tion on the right shows a database with clusters of sim- alphabet requires on average approximately 2n bits ilar sequences that have overlapping neighbourhoods, (with up to an O(log n) additive term). Thus if we covering a much smaller fraction of the sequence space believe that the databases with the same Kolmogorov than the database on the left. Consequently, the esti- complexity will behave similarly, we should use n′ = mate based on the assumption of random distribution K(D)/2 as an estimate of the effective database size of sequences in the database will necessarily overes- of database D. timate the P -value, potentially leading to rejection of At a first glance, Kolmogorov complexity seems high scoring pairs as matches likely to occur by chance. to be an ideal estimator to use in this context. It One of the solutions to this problem proposed in accounts for possible major differences between the the literature is to remove the redundancies, such as real database and a randomly generated one. In par- closely related sequences, from the sequence data- ticular, using Kolmogorov complexity would compen- base [22]. New query can be first searched against such sate for differences in frequencies of individual nu- non-redundant database, the statistical significance of cleotides (database containing only a long stretch of resulting matches can be evaluated, and then a sec- As will have a small effective size), and redundant ondary search can be launched against the whole data- sequence content (sequences that have only few dif- base. In this paper, we propose a different solution. ferences can be described very efficiently in a small One of the main parameters used in P -value computa- space). Moreover, the concept of Kolmogorov com- tion is the database size n. Instead of the real database plexity has been successfully used in similar contexts size, we propose to use an effective database size n′ before, in applications such as computing distance be- (n′ < n) which will account for redundancies in the tween genomes [15] (see also [10, 18] for overview). database. For example, if we take a random database The exact Kolmogorov complexity of database D of size n, and double its contents by including second is not computable. Yet in practice, we can use vari- exact copy of each sequence, the effective size of such ous compression algorithms instead of computing the Estimating effective DNA database size . . . 65 Kolmogorov complexity. For a fixed compression algo- be conservative, i.e., the estimates should be strictly rithm, the compressed size c(D) of database D is an higher than the exact P -values, so that the estimated upper bound on the Kolmogorov complexity K(D), P -value represents an upper bound on the probability and we can use value n′ = c(D)/2 as an estimate of of a particular match being a false positive. the effective database size. Several efficient algorithms We have decided to experimentally evaluate the specifically tuned to compression of DNA sequences accuracy of the P -values obtained by the compression are available (see [6, 7, 14, 3]). method in a simple scenario motivated by the next As we will see in the next section, the upper bounds generation sequencing technologies. In this scenario, on P -values (or conservative estimates) are generally a sequencing machine generates many short sequences desirable in cases where exact P -values cannot be com- (reads). These reads need to be mapped as substrings puted. Since the P -values increase monotonically with to previously known reference genomes. In most cases, the database size, using an upper bound on the effec- the reads will match the reference sequence exactly, tive database size should not by itself lead to non- but sometimes we will see one or two mismatches. conservative bounds. These mismatches can be either due to sequencing er- Unfortunately, using Kolmogorov complexity may rors, or (more interestingly) due to differences between not necessarily lead to conservative bounds in all in- individuals. stances. As an extreme case, base-4 expansion of many In our simplified scenario, the database D is a sin- fundamental constants, such as π, can be generated gle string of length n, the query Q is a string of by a constant-size program. First n bits generated by length m, and we are searching in D for a substring of such a program can be used as a sequence database of length m with the smallest Hamming distance from Q. size n, replacing digits 0, . . . , 3 with nucleotides. Kol- In contrast to the full homology search problem, we al- mogorov complexity of such database is O(log n) (we ways consider the whole query (each read has to map need a constant number of bits to represent the pro- completely to the reference), and we also disallow in- gram, and log n bits to represent the real size of the sertions and deletions. database), yet for all practical purposes, this database behaves as a random database of size n [2]. We will say that the distance of query Q from Thus using a Kolmogorov complexity and compres- database D is the minimum Hamming distance sion-based estimates of effective database size does not between Q and some substring of length m from data- necessarily lead to conservative estimates of P -values base D. This Hamming distance will represent our in homology search. In the next section, we explore score. P -value for a particular Hamming distance h practical issues of using these compression-based esti- is then the number of all m-tuples that are at a dis- mates in an experimental setting. tance of at most h from D, divided by 4m (the number of all possible m-tuples). Note that in this definition of P -value, we keep the database fixed, and only the 3 Estimating effective database size query is random, chosen uniformly from all possible through compression queries of length m. In contrast, most methods con- sider both query and database as random. Here, motivated by the discussion in the previous sec- tion, we explore the use of compression software for es- The main advantage of this simplified model is that timating the effective database size. The methodology we can compute these P -values exactly in a reasonable is very simple. First, we compress the database and amount of time, and compare them to the P -values es- measure the size of the resulting file in bytes. Then, timated based on the randomly generated database of we multiply this size by four to account for the fact corresponding effective size, thus evaluating the accu- that in a uniformly random database, we need two racy of our concept. The rest of this section is orga- bits to encode each nucleotide. In this way, we obtain nized as follows. First, we introduce the algorithm for an estimate of the effective database size which can exact computation of P -values in our simplified sce- be used in any formula or algorithm for estimating nario. Then we explore several cases of generated and P -values on uniformly distributed databases. real databases. The P -values are used to reject high scoring pairs The file compression algorithms typically detect that have a high probability of occurring by chance. In symbols or small groups of symbols that occur in the a typical search there will be many spurious matches text more frequently than the others, and encode them with high P -values, and only a few top-scoring by shorter codewords. Thus, the size of the compressed matches will represent genuine similarities with com- database is related to the entropy of the source gener- mon evolutionary origins. Since our main task is to ating the database. In addition to that, some methods separate these few good matches from many spuri- try to detect longer strings that are repeated exactly ous matches, we would like the P -value estimator to or with mismatches multiple times, and to store only 66 Martina Višňovská et al. one copy of each such string as well as differences be- in the given sequence. GC-content varies widely in tween the approximate copies. different genomes, or even between segments of the In our experiments, we first try to separate these same genome. An i.i.d. database with GC-content g two phenomena by using artificially generated data- has entropy H(g) = −g lg(g/2) − (1 − g) lg((1 − g)/2), bases, and in the end, we apply compression software and therefore can be encoded by approximately H(g)n to real DNA sequences, where both of these issues are bits, for example by the arithmetic encoding [20]. at play. Following our general approach, we can try to use the formulas for the uniform nucleotide frequencies and effective database size E(n, g) = H(g)n/2 to es- 3.1 Algorithm to compute exact P -values timate the P -values for the actual database of size n We have implemented an algorithm that simulta- and GC-content g. neously computes P -values for a given database D We have tested the performance of such estimates and all of its prefixes, and for all distances h = on randomly generated databases with GC-contents 0, 1, . . . , hmax . In experiments we use values hmax = 3 75% and 90%, averaging values for five randomly gen- and m = 15. erated databases. Table 1 shows the real P -values, The algorithm proceeds along the database D and P -values predicted by our compression method, and at each position updates two arrays M and H. P -values obtained by the simple method of consid- Array M of size 4m stores for each m-tuple Q′ its ering a database of length n and GC-content 50% distance to the current prefix of the database, pro- (disregarding the real GC-content in the P -value es- vided that this distance is at most hmax (otherwise it timate). All estimates were computed by our algo- uses a special ∞ value). The second array H stores rithm from Section 3.1. For real P-values the algorithm for each distance h ≤ hmax the number of m-tuples was applied to the generated database with a skewed with distance exactly h from the current prefix of the GC-content, for simple and predicted estimates to five database D. random databases of appropriate size with When we read a new nucleotide of the database, GC-content 50%. we need to consider the last m-tuple of the current For small P -values the real and simple estimates prefix. We enumerate all m-tuples at a distance of at are quite similar, which is expected since in a short most hmax from this new m-tuple, and for each we database very few m-tuples occur multiple times, even update arrays M and H as appropriate. Values in H if the composition of the database is skewed. As a re- can be easily converted to the desired P -values for the sult, the compression method gives non-conservative current prefix at different distance thresholds. Note estimates, because it uses a much smaller database that this algorithm is feasible only for small values size. For larger P -values, the compression estimates of m, because it requires Θ(4m ) memory. become conservative, and are closer to the true The algorithm can be used to compute exact P -value than the simple estimates. This is because P -value for a given real sequence database, which we a database with high GC-content is less likely to con- consider as a reference value. It can also be applied to tain m-tuples with low GC-content, and thus a larger randomly generated databases, leading to estimates database size is required to achieve the same P -value. of P -values that would be obtained in a model, where For example, among estimates for GC-content 75% both the database and the query are random. To em- shown in Table 1, compression estimates become con- ulate the proposed method, we compress a sequence servative for databases larger than 107 and 105 nu- database, compute its effective size, and then use the cleotides for h = 0 and h = 2, respectively. P -value estimates obtained from random databases of We can study the situation analytically for the case the matching size. when the query appears in the database without a mis- Finally, we also consider a simple method, where match. Let Xi be the event that query Q has an oc- we use random databases of the same size as the orig- currence with distance at most h at position i in the inal database, without compression. P -values com- random database of GC-content g, and let X be the puted in this way correspond to the commonly used total number of occurrences of query Q with at most techniques for P -value estimation without using the h mismatches in the whole database. For g = 0.5, we effective database size mechanism. can compute the probability of Xi by summing over the number of mismatches h ( ) ( )k ( )m−k ∑ 3.2 Entropy m 3 1 P (Xi |Q, g = 0.5) = . k 4 4 The four nucleotides do not occur in genomes equally k=0 frequently. A commonly used measure of DNA com- For general g we have to distinguish between mis- position is GC-content: the percentage of Cs and Gs matches on G/C positions and mismatches on A/T Estimating effective DNA database size . . . 67 h=0 h=2 Database size 104 105 106 107 104 105 106 107 Real (GC 75%) 9.3 · 10−6 9.3 · 10−5 9.2 · 10−4 8.4 · 10−3 8.8 · 10−3 7.0 · 10−2 3.2 · 10−1 7.2 · 10−1 Predicted (GC 75%) 8.4 · 10−6 8.5 · 10−5 8.4 · 10−4 8.4 · 10−3 8.3 · 10−3 8.1 · 10−2 5.7 · 10−1 1.0 Simple (GC 75%) 9.3 · 10−6 9.3 · 10−5 9.3 · 10−4 9.3 · 10−3 9.2 · 10−3 8.8 · 10−2 6.0 · 10−1 1.0 Real (GC 90%) 9.2 · 10−6 8.7 · 10−5 6.7 · 10−4 3.9 · 10−3 6.5 · 10−3 3.3 · 10−2 1.1 · 10−1 2.6 · 10−1 Predicted (GC 90%) 6.5 · 10−6 6.8 · 10−5 6.8 · 10−4 6.8 · 10−3 6.4 · 10−3 6.5 · 10−2 4.9 · 10−1 1.0 Simple (GC 90%) 9.3 · 10−6 9.3 · 10−5 9.3 · 10−4 9.3 · 10−3 9.2 · 10−3 8.8 · 10−2 6.0 · 10−1 1.0 Table 1. P -values for random databases of various lengths n, GC-content 75% or 90% and the query distance h = 0 or h = 2. Real P-values are computed by the algorithm described in Section 3.1. Predicted P-values take into account the entropy of the database, using effective database size H(g)n/2. The simple method for computing P-values disregards the GC-content and considers a random database of size n and 50% GC-content. positions in the query. Let z be the number of Gs and Therefore for small P -values, where the approximation Cs in Q, then of 1 − e−x is sufficiently accurate, the estimate Pest is h−i ( ) ( lower by approximately a factor of H(g)/2 than the ∑ g )i ( g )z−i h ∑ z real P-value. This implies that no correction for en- P (Xi |Q, g) = 1− i=0 j=0 i 2 2 tropy is in fact necessary for small P -values. ( )( )j ( )m−z−j m−z 1+g 1−g On the other hand, when h = 0 but n is sufficiently · . large, the compression estimates become conservative. j 2 2 In particular, let us assume that The expected number of occurrences can be computed by linearity of expectation (for simplicity, we ignore m2−m ln(2) 1.386 m the edge effect at positions n − m + 2, . . . , n, which is n> = m4 + o(m4m ). H(g)2−m−1 − (1 − g)m H(g) negligible for large n): −m −m This implies e−E(n,g)4 < 2−m e−n2 (1−g) . The m ∑ n E(X|Q, g) = E(Xi |Q) = nP (Xi |Q, g). right-hand side is one of the terms of the sum i=1 ∑m ( ) We will approximate the distribution of variable X by m −m −n2−m gz (1−g)m−z 2 e , a Poisson distribution with the mean λ = E(X|Q, g). z z=0 This commonly used approximation [17] disregards de- pendencies between occurrences at adjacent positions and therefore the left-hand side is upper-bounded by and also assumes that n is large and λ relatively small. the whole sum as well. This implies Pest ≥ Preal . In our simulations it led to very good estimates of This simple bound is not very interesting, since it P -values (data not shown). If X is from Poisson dis- works only for very large n, where P-values are very tribution, the probability of at least one occurrence of close to one. Nonetheless, it agrees with our observa- a given query Q is P (X > 0|Q, g) = 1−e−λ . To obtain tion that the compression estimate is appropriate for the final P -value, we have to consider this expression sufficiently large n. Perhaps a tighter bound on n can for different queries, or more precisely, for groups of be obtained by considering additional elements of the queries with the same number Gs and Cs, of which Qz sum. is one representative: ∑m ( ) m −m 3.3 Redundancy Preal = P (X > 0|g) = 2 P (X > 0|Qz , g). z=0 z Next, we consider artificial databases that are con- The compression estimate Pest uses the same formula catenation of many mutually similar sequences of the but g = 0.5 and E(n, g) instead of n. For h = 0, same length k. In the experiments we use k = 104 . E(X|Q, g) simplifies to n2−m g z (1−g)m−z , and in par- To generate the database, we first sample a string ticular E(X|Q, 0.5) = n4−m does not depend on z, −m S = s1 s2 . . . sk of length k uniformly at random. This and therefore Pest = 1 − e−E(n,g)4 . For small x, string will be the center of the cluster of similar se- function 1 − e−x can be well approximated by x [8]. quences. Using this approximation, we obtain The i-th sequence in the concatenated database is Pest E(n, g)4−m obtained from S by randomly mutating several nu- = ∑m (m) −m −m z = H(g)/2. Preal z=0 z 2 n2 g (1 − g)m−z cleotides of S so that the nucleotide j is the same 68 Martina Višňovská et al. h=0 h=2 Database size 104 105 106 107 104 105 106 107 Real 9.3 · 10−6 8.1 · 10−5 6.5 · 10−4 4.3 · 10−3 9.2 · 10−3 6.8 · 10−2 3.4 · 10−1 8.8 · 10−1 GenCompress 9.3 · 10−6 7.7 · 10−5 6.8 · 10−4 6.5 · 10−3 9.2 · 10−3 7.3 · 10−2 4.9 · 10−1 1.0 bzip2 1.0 · 10−5 9.2 · 10−5 8.1 · 10−4 8.0 · 10−3 1.0 · 10−2 8.7 · 10−2 5.5 · 10−1 1.0 Simple 9.3 · 10−6 9.3 · 10−5 9.3 · 10−4 9.3 · 10−3 9.2 · 10−3 8.8 · 10−2 6.0 · 10−1 1.0 Table 2. P -values for the artificial clustered database. Real P-values were computed by the algorithm from Section 3.1 applied directly to the clustered database. GenCompress and bzip2 estimates use different compression tools to compute effective database size. The simple method for computing P-values considers a uniformly generated random database of size n. h=0 h=2 Database size 1.6 · 104 2.1 · 105 1.5 · 106 1.1 · 107 1.6 · 104 2.1 · 105 1.5 · 106 1.1 · 107 Real 1.1 · 10−5 1.4 · 10−4 9.4 · 10−4 6.0 · 10−3 1.0 · 10−2 1.1 · 10−1 4.6 · 10−1 9.1 · 10−1 GenCompress 9.3 · 10−6 1.2 · 10−4 8.5 · 10−4 6.0 · 10−3 9.2 · 10−3 1.1 · 10−1 5.7 · 10−1 1.0 GC corrected 1.1 · 10−5 1.4 · 10−4 9.6 · 10−4 6.8 · 10−3 1.1 · 10−2 1.3 · 10−1 6.1 · 10−1 1.0 bzip2 1.5 · 10−5 1.9 · 10−4 1.3 · 10−3 9.4 · 10−3 1.5 · 10−2 1.7 · 10−1 7.2 · 10−1 1.0 Simple 1.5 · 10−5 1.9 · 10−4 1.4 · 10−3 1.1 · 10−2 1.5 · 10−2 1.8 · 10−1 7.4 · 10−1 1.0 Table 3. P -values for genomic data from human, chimpanzee, and rhesus. Real P-values were computed by the algorithm from Section 3.1 applied directly to the genomic sequences. GenCompress and bzip2 estimates use different compression tools to compute effective database size. GC corrected shows the GenCompress estimate corrected for the average database entropy. The simple method for computing P-values considers a uniformly generated random database of size n. as sj with the probability of 90%, and with the prob- In this experiment, estimates based on data com- ability of 10% it changes to another nucleotide chosen pression are mostly conservative and better than the randomly from the remaining three. This way, we get estimates obtained by the simple methods. a clustered database of sequences that differ from the center of the cluster on 10% positions on average. 3.4 Real data For clustered databases of various sizes, we com- pute the real P -value simple estimate, which uses ef- Finally, we have applied the compression method of fective size equal to the real size of the database with- estimating the effective database size to real genomic out compression, and two estimates based on two dif- data from human, chimpanzee, and rhesus macaque. ferent lossless compression programs GenCompress [6] Our set consisted of a portion of human chromo- and bzip2. The results are shown in Table 2. some 22 and corresponding portions of genome from the two other primates. We have omitted larger blocks Bzip2 is based on Burrows–Wheeler transform [5] that did not have counterparts in neither chimpanzee, which tends to create blocks of identical symbols if the nor macaque. The sequence data and the genome input contains repeated substrings. The transformed alignments were obtained from the UCSC genome text is then encoded by other compression techniques, browser [13]. such as Huffman encoding. To save memory, bzip2 di- The database contains a short block of the human vides a file into blocks and processes each block sep- sequence followed by the corresponding block in the arately, which may have negative effect on the com- two other genomes, followed by another block in hu- pressed size. man, etc. Therefore, similar sequences are close to each GenCompress [6] is an algorithm developed specifi- other which improves compression with algorithms cally for compressing DNA sequences. It finds approx- such as bzip2 that always consider only a block of the imate repeats in the compressed sequence and encodes whole file. them by a sequence of edit operations. GenCompress The results of the test for different input sizes are is a single-pass algorithm that proceed along the in- shown in Table 3. As we can see, with GenCompress put sequence and in each step it finds the best prefix we often underestimate real P -values, while bzip2 that can be encoded as an approximate repeat of some achieves only a very small improvement compared to substring of already encoded input sequence. the simple method without compression. Estimating effective DNA database size . . . 69 Earlier, we have reached a conclusion that skewed similarity common in DNA sequence databases, GC-content should not automatically imply lower ef- one could use fast methods for identifying such fective database sizes since this would lead to under- blocks [12, 19] to speed up the algorithms developed estimation of small P -values. However, compression specifically for compression of DNA sequences [6, 7, algorithms estimate the effective database size based 14, 3]. on both sequence redundancy and lower sequence en- Finally, we would like to extend our work to more tropy in case of locally high or low GC-contents. That complex scenarios of homology search. Longer query could be the reason why GenCompress estimates are sequences will require handling of insertions, deletions, non-conservative. and matches that involve only portions of the query We have attempted to further correct for this issue sequence. More complex scoring schemes on both nu- by computing an average database entropy H ′ . The ef- cleotide and protein sequences also need to be exam- fective database size estimated by GenCompress was ined. then multiplied by the correction factor of 2/H ′ . This approach leads to surprisingly good P -value estimates References that were also conservative in our experiments (Ta- ble 3, line GC corrected). 1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and To estimate the value H ′ for this experiment, we D.J. Lipman: Basic local alignment search tool. Jour- have computed entropy separately for non-overlapping nal of Molecular Biology, 215 (3), 1990, 403–410. 2. D. Bailey and R. Crandall: Random generators and windows of size 1000 to capture different properties of normal numbers. Experimental Mathematics, 11 (4), individual genomic regions. We have estimated 2002, 527–546. a Markov chain of second order from each window of 3. B. Behzadi and F. Le Fessant: DNA compression the sequence and computed an entropy of this Markov challenge revisited: a dynamic programming approach. chain, that is, entropy where the probability of each In: Combinatorial Pattern Matching (CPM 2005), nucleotide is conditioned on the two previous nucleo- Springer, 2005, 190–200. tides to capture local dependencies in DNA sequences. 4. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Os- The average entropy H ′ of the whole database was tell, and E.W. Sayers: GenBank. Nucleic Acids Re- then computed as an average of entropy values from search, 37 (Database issue), 2009, D26–31. 5. M. Burrows and D.J. Wheeler: A block-sorting lossless all windows. data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 4 Conclusion 1994. 6. X. Chen, S. Kwong, and M. Li: A compression In this paper, we have considered methods for more algorithm for DNA sequences and its applications in genome comparison. In: Genome Informatics accurate estimation of P -values in the context of se- (GIW’99), Universal Academy Press, 1999, 51–61. quence homology search. In particular, we propose to 7. X. Chen, M. Li, B. Ma, and J. Tromp: DNACompress: adjust the size of the database to compensate for the fast and effective DNA sequence compression. Bioin- structure present in the database due to the fact that formatics, 18 (12), 2002, 1696–1698. individual sequences are related by evolution. 8. T.H. Cormen, C.E. Leiserson, and R.L. Rivest: Intro- We have explored the idea of using compression duction to algorithms. MIT Press, 1990. to estimate the effective database size, and we have 9. A. Dembo, S. Karlin, and O. Zeitouni: (1994). Limit demonstrated by experiments that the use of the com- distribution of maximal non-aligned two-sequence seg- mental score. The Annals of Probability, 22 (4), 1994, pression algorithms leads to non-conservative P -value 2022–2039. estimates for small P -values. This is at least partially 10. R. Giancarlo, D. Scaturro, and F. Utro: Textual caused by the fact that besides identifying longer re- data compression in computational biology: a synop- peated substrings, compression algorithms also com- sis. Bioinformatics, 25 (13), 2009, 1575–1576. pensate for sequences with low entropy. We have 11. D.L. Hartl and A.G. Clark: Principles of population shown that such compensation should not be consid- genetics, Fourth Edition. Sinauer Associates, 2006. ered, at least in the case of small P -values. We have 12. W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and suggested a simple way to disentangle the portion of D. Haussler: D. Evolution’s cauldron: duplication, the compression coming from locally low entropy and deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sci- shown that the correction leads to better P -value es- ences of the United States of America, 100 (20), 2003, timates. 11484–9. The compression would be a fast and efficient way 13. W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, of estimating the effective database size. Even though T.H. Pringle, A.M. Zahler, and D. Haussler: The hu- most of the general purpose compression algorithms man genome browser at UCSC. Genome Research, 12 (such as bzip2) cannot handle distant large blocks of (6), 2002, 996–1006. 70 Martina Višňovská et al. 14. G. Korodi and I. Tabus: An efficient normalized max- imum likelihood algorithm for dna sequence compres- sion. ACM Transactions on Information Systems, 23 (1), 2005, 3–34. 15. M. Li, J.H. Badger, C. Xin, S. Kwong, P. Kearney, and H. Zhang: An information based sequence distance and its application to whole mitochondrial genome phy- logeny. Bioinformatics, 17 (2), 2001, 149–154. 16. M. Li and P. Vitányi: An introduction to Kolmogorov complexity and its applications. Springer, 2008. 17. A.Y. Mitrophanov and M. Borodovsky: (2006). Statis- tical significance in biological sequence analysis. Brief- ings in Bioinformatics, 7(1), 2006, 2–24. 18. O.U. Nalbantoglu, D.J. Russell, and K. Sayood: Data compression concepts and algorithms and their appli- cations to bioinformatics. Entropy (Basel, Switzer- land), 12 (1), 2010, 34. 19. B. Paten, J. Herrero, S. Fitzgerald, K. Beal, P. Flicek, I. Holmes, and E. Birney: Genome-wide nucleotide- level mammalian ancestor reconstruction. Genome Re- search, 18 (11), 2008, 1829–1833. 20. Rissanen, J. and Langdon, G. (1979). Arithmetic coding. IBM Journal of Research and Development, 23(2):149–162. 21. T.F. Smith and M.S. Waterman: Identification of com- mon molecular subsequences. Journal of Molecular Bi- ology, 147 (1), 1981, 195–197. 22. B.E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C.H. Wu: UniRef: comprehensive and non- redundant UniProt reference clusters. Bioinformatics, 23 (10), 2007, 1282–1288. 23. J.C. Venter: Multiple personal genomes await. Nature, 464 (7289), 2010, 676–677.