Estimating eﬀective DNA database size via compression ⋆

                      Martina Višňovská1 , Michal Nánási2 , Tomáš Vinař1 , and Broňa Brejová2
               1
                     Department of Applied Informatics, Faculty of Mathematics, Physics, and Informatics
                             Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia
                   2
                      Department of Computer Science, Faculty of Mathematics, Physics, and Informatics
                             Comenius University, Mlynská Dolina, 842 48 Bratislava, Slovakia

Abstract. Search for sequence similarity in large-scale         and deletion of nucleotides also invokes penalty. The
databases of DNA and protein sequences is one of the            goal is then to ﬁnd a region in the query and a region of
essential problems in bioinformatics. To distinguish ran-       the sequence database with the highest possible score.
dom matches from biologically relevant similarities, it is      This task can be accomplished either by a dynamic
customary to compute statistical P-value of each discov-        programming algorithm [21], or by fast heuristic meth-
ered match. In this context, P-value is the probability that
                                                                ods, such as BLAST [1]. Regardless of the method, the
a similarity with a given score or higher would appear by
chance in a comparison of a random query and a random
                                                                result is a set of high scoring pairs (substring of a query
database. Note that P-value is a function of the database       matched to a substring of a database) together with
size, since a high-scoring similarity is more likely to exist   their similarity scores.
by chance in a larger database.                                     Typical genomic databases are large: human
Biological databases often contain redundant, identical, or     genome alone has approx. 3.2 GB, and the GenBank
very similar sequences. This fact is not taken into account     traditional database [4] contains more than 100 GB
in P-value estimation, resulting in pessimistic estimates.      of DNA sequences as of April 2010. In these large
One way to address this problem is to use a lower eﬀective      databases, many query sequences will have high
database size instead of its real size. In this work, we pro-   scoring matches purely by chance. To distinguish real
pose to estimate the eﬀective size of a database by its com-
                                                                matches from spurious ones, we need to assess their
pressed size. An appropriate compression algorithm will ef-
fectively store only a single copy of each repeated string,
                                                                statistical signiﬁcance.
resulting in a ﬁle whose size roughly corresponds to the            Traditionally, the statistical signiﬁcance of a high
amount of unique sequence in the database. We evaluate          scoring pair is assessed by P -value. If the high scor-
our approach on real and simulated databases.                   ing pair has score s, its P -value is the probability of
                                                                a match with score s or better occurring in homology
                                                                search of a randomly generated query in a randomly
1     Introduction                                              generated database. The P -value obviously depends
                                                                on the sizes of the query and the database, but to
Recent progress in genome sequencing technologies led           achieve more realistic P -values, we can also take into
to rapid increase in the size of DNA sequence data-             account other properties of the sequences, such as fre-
bases. Perhaps the most common way of accessing                 quencies of individual nucleotides [17].
these databases is through sequence homology search,                Consider an illustration in Fig.1. For a given
where users search in the database for sequences sim-           score s, we can visualize the sequence database D as
ilar to their query of interest [1]. Highly similar se-         a set of points in the sequence space with the neigh-
quences have often evolved from a common ances-                 bourhoods that include all sequences with similarity
tor and may also share the same function. Homology              score s or higher. In this sequence space, the query Q
search is thus the ﬁrst step in elucidating the function        is located at the boundary of one of these neighbour-
and evolutionary history of a newly sequenced genome.           hoods. If the neighbourhoods cover a large fraction of
    To formally deﬁne sequence homology search as               the sequence space, the P-value is high, because a ran-
a computational problem, we consider a query Q and              domly generated query will have a high probability of
a database D, which are both strings composed of                falling into one of those neighbourhoods, resulting in
nucleotides (symbols from the alphabet {A, C, G, T }).          score larger than s.
We typically introduce a scoring function that assigns              While for a given database the P -values could be
a positive score to matching nucleotides in the two se-         computed exactly, such a computation would be im-
quences, and negative score to mismatches. Insertion            practical. Instead, one uses various approximations as-
⋆
    This research is funded by European Community               suming that the sequence in the database is randomly
    FP7 grants IRG-224885 and IRG-231025, VEGA grant            generated by an i.i.d. process or a Markov chain
    1/0210/10, and Comenius University grant number             (e.g., [9]). These assumptions are often violated by real
    UK/151/2010.                                                databases.
64      Martina Višňovská et al.

                                                             new database will still be n′ = n, even though its real
                                                             size is 2n.
                                                                 Note that the notion of eﬀective database size has
                                                             been previously used to adjust for border eﬀects in
                                                             case of short queries and databases [17], and an option
                                                             for setting it to an arbitrary value is included in most
                                                             software tools for homology search. Analogously, sta-
                                                             tistical models in population genetics use population
                                                             size as a parameter, but instead of the actual number
                                                             of individuals, one typically uses eﬀective population
Fig. 1. Illustration of P -value computation. The points     size to compensate for various eﬀects that are not con-
represent sequences in database D and the shaded areas       sidered by the model, such as population size changing
show their neighbourhoods with scores that are at least as
                                                             over time [11].
good as the similarity score of query Q. Left: Database
with randomly distributed sequences. Right: Database
with clustered sequences.                                    2   Kolmogorov complexity of DNA
                                                                 sequences
    For instance, the database may contain several se-       For our purposes, the eﬀective database size should
quences that are evolutionarily related. Such se-            be an estimate of the amount of unique sequence in
quences will often contain only a small number of            the database D, taking into account substrings that
changes: for example, DNA sequences of human and             may be present in D in many exact or approximate
chimpanzee are identical in 99% of nucleotides over          copies. One way of describing the information content
most of their lengths. In addition, recently introduced      of a database is its Kolmogorov complexity [16].
next generation sequencing technologies made it ﬁ-               Kolmogorov complexity K(D) of a sequence D is
nancially viable to sequence multiple individuals from       the bit length of the shortest program P for a ﬁxed
the same species, leading to various personal human          universal Turing machine that outputs sequence D.
genome projects [23]. Sequences of diﬀerent individ-         Kolmogorov complexity can be understood as a lower
uals from the same species may diﬀer as little as one        bound (up to a constant additive term) of compression
in thousand nucleotides. Consequently, databases may         achievable by any general-purpose algorithm. A string
contain many highly similar sequences, and it is not         of length n sampled uniformly at random from a ﬁxed
appropriate to model them as completely random.              alphabet is on average almost incompressible [16, Sec-
    Consider again the example in Fig.1. The illustra-       tion 2.8.1]. In particular, a string over a four-letter
tion on the right shows a database with clusters of sim-     alphabet requires on average approximately 2n bits
ilar sequences that have overlapping neighbourhoods,         (with up to an O(log n) additive term). Thus if we
covering a much smaller fraction of the sequence space       believe that the databases with the same Kolmogorov
than the database on the left. Consequently, the esti-       complexity will behave similarly, we should use n′ =
mate based on the assumption of random distribution          K(D)/2 as an estimate of the eﬀective database size
of sequences in the database will necessarily overes-        of database D.
timate the P -value, potentially leading to rejection of         At a ﬁrst glance, Kolmogorov complexity seems
high scoring pairs as matches likely to occur by chance.     to be an ideal estimator to use in this context. It
    One of the solutions to this problem proposed in         accounts for possible major diﬀerences between the
the literature is to remove the redundancies, such as        real database and a randomly generated one. In par-
closely related sequences, from the sequence data-           ticular, using Kolmogorov complexity would compen-
base [22]. New query can be ﬁrst searched against such       sate for diﬀerences in frequencies of individual nu-
non-redundant database, the statistical signiﬁcance of       cleotides (database containing only a long stretch of
resulting matches can be evaluated, and then a sec-          As will have a small eﬀective size), and redundant
ondary search can be launched against the whole data-        sequence content (sequences that have only few dif-
base. In this paper, we propose a diﬀerent solution.         ferences can be described very eﬃciently in a small
One of the main parameters used in P -value computa-         space). Moreover, the concept of Kolmogorov com-
tion is the database size n. Instead of the real database    plexity has been successfully used in similar contexts
size, we propose to use an eﬀective database size n′         before, in applications such as computing distance be-
(n′ < n) which will account for redundancies in the          tween genomes [15] (see also [10, 18] for overview).
database. For example, if we take a random database              The exact Kolmogorov complexity of database D
of size n, and double its contents by including second       is not computable. Yet in practice, we can use vari-
exact copy of each sequence, the eﬀective size of such       ous compression algorithms instead of computing the
                                                                    Estimating eﬀective DNA database size . . .    65

Kolmogorov complexity. For a ﬁxed compression algo-            be conservative, i.e., the estimates should be strictly
rithm, the compressed size c(D) of database D is an            higher than the exact P -values, so that the estimated
upper bound on the Kolmogorov complexity K(D),                 P -value represents an upper bound on the probability
and we can use value n′ = c(D)/2 as an estimate of             of a particular match being a false positive.
the eﬀective database size. Several eﬃcient algorithms             We have decided to experimentally evaluate the
speciﬁcally tuned to compression of DNA sequences              accuracy of the P -values obtained by the compression
are available (see [6, 7, 14, 3]).                             method in a simple scenario motivated by the next
    As we will see in the next section, the upper bounds       generation sequencing technologies. In this scenario,
on P -values (or conservative estimates) are generally         a sequencing machine generates many short sequences
desirable in cases where exact P -values cannot be com-        (reads). These reads need to be mapped as substrings
puted. Since the P -values increase monotonically with         to previously known reference genomes. In most cases,
the database size, using an upper bound on the eﬀec-           the reads will match the reference sequence exactly,
tive database size should not by itself lead to non-           but sometimes we will see one or two mismatches.
conservative bounds.                                           These mismatches can be either due to sequencing er-
    Unfortunately, using Kolmogorov complexity may             rors, or (more interestingly) due to diﬀerences between
not necessarily lead to conservative bounds in all in-         individuals.
stances. As an extreme case, base-4 expansion of many
                                                                   In our simpliﬁed scenario, the database D is a sin-
fundamental constants, such as π, can be generated
                                                               gle string of length n, the query Q is a string of
by a constant-size program. First n bits generated by
                                                               length m, and we are searching in D for a substring of
such a program can be used as a sequence database of
                                                               length m with the smallest Hamming distance from Q.
size n, replacing digits 0, . . . , 3 with nucleotides. Kol-
                                                               In contrast to the full homology search problem, we al-
mogorov complexity of such database is O(log n) (we
                                                               ways consider the whole query (each read has to map
need a constant number of bits to represent the pro-
                                                               completely to the reference), and we also disallow in-
gram, and log n bits to represent the real size of the
                                                               sertions and deletions.
database), yet for all practical purposes, this database
behaves as a random database of size n [2].                        We will say that the distance of query Q from
    Thus using a Kolmogorov complexity and compres-            database D is the minimum Hamming distance
sion-based estimates of eﬀective database size does not        between Q and some substring of length m from data-
necessarily lead to conservative estimates of P -values        base D. This Hamming distance will represent our
in homology search. In the next section, we explore            score. P -value for a particular Hamming distance h
practical issues of using these compression-based esti-        is then the number of all m-tuples that are at a dis-
mates in an experimental setting.                              tance of at most h from D, divided by 4m (the number
                                                               of all possible m-tuples). Note that in this deﬁnition
                                                               of P -value, we keep the database ﬁxed, and only the
3    Estimating eﬀective database size                         query is random, chosen uniformly from all possible
     through compression                                       queries of length m. In contrast, most methods con-
                                                               sider both query and database as random.
Here, motivated by the discussion in the previous sec-
tion, we explore the use of compression software for es-           The main advantage of this simpliﬁed model is that
timating the eﬀective database size. The methodology           we can compute these P -values exactly in a reasonable
is very simple. First, we compress the database and            amount of time, and compare them to the P -values es-
measure the size of the resulting ﬁle in bytes. Then,          timated based on the randomly generated database of
we multiply this size by four to account for the fact          corresponding eﬀective size, thus evaluating the accu-
that in a uniformly random database, we need two               racy of our concept. The rest of this section is orga-
bits to encode each nucleotide. In this way, we obtain         nized as follows. First, we introduce the algorithm for
an estimate of the eﬀective database size which can            exact computation of P -values in our simpliﬁed sce-
be used in any formula or algorithm for estimating             nario. Then we explore several cases of generated and
P -values on uniformly distributed databases.                  real databases.
    The P -values are used to reject high scoring pairs            The ﬁle compression algorithms typically detect
that have a high probability of occurring by chance. In        symbols or small groups of symbols that occur in the
a typical search there will be many spurious matches           text more frequently than the others, and encode them
with high P -values, and only a few top-scoring                by shorter codewords. Thus, the size of the compressed
matches will represent genuine similarities with com-          database is related to the entropy of the source gener-
mon evolutionary origins. Since our main task is to            ating the database. In addition to that, some methods
separate these few good matches from many spuri-               try to detect longer strings that are repeated exactly
ous matches, we would like the P -value estimator to           or with mismatches multiple times, and to store only
66      Martina Višňovská et al.

one copy of each such string as well as diﬀerences be-       in the given sequence. GC-content varies widely in
tween the approximate copies.                                diﬀerent genomes, or even between segments of the
    In our experiments, we ﬁrst try to separate these        same genome. An i.i.d. database with GC-content g
two phenomena by using artiﬁcially generated data-           has entropy H(g) = −g lg(g/2) − (1 − g) lg((1 − g)/2),
bases, and in the end, we apply compression software         and therefore can be encoded by approximately H(g)n
to real DNA sequences, where both of these issues are        bits, for example by the arithmetic encoding [20].
at play.                                                         Following our general approach, we can try to use
                                                             the formulas for the uniform nucleotide frequencies
                                                             and eﬀective database size E(n, g) = H(g)n/2 to es-
3.1   Algorithm to compute exact P -values
                                                             timate the P -values for the actual database of size n
We have implemented an algorithm that simulta-               and GC-content g.
neously computes P -values for a given database D                We have tested the performance of such estimates
and all of its preﬁxes, and for all distances h =            on randomly generated databases with GC-contents
0, 1, . . . , hmax . In experiments we use values hmax = 3   75% and 90%, averaging values for ﬁve randomly gen-
and m = 15.                                                  erated databases. Table 1 shows the real P -values,
    The algorithm proceeds along the database D and          P -values predicted by our compression method, and
at each position updates two arrays M and H.                 P -values obtained by the simple method of consid-
Array M of size 4m stores for each m-tuple Q′ its            ering a database of length n and GC-content 50%
distance to the current preﬁx of the database, pro-          (disregarding the real GC-content in the P -value es-
vided that this distance is at most hmax (otherwise it       timate). All estimates were computed by our algo-
uses a special ∞ value). The second array H stores           rithm from Section 3.1. For real P-values the algorithm
for each distance h ≤ hmax the number of m-tuples            was applied to the generated database with a skewed
with distance exactly h from the current preﬁx of the        GC-content, for simple and predicted estimates to ﬁve
database D.                                                  random       databases of appropriate        size with
    When we read a new nucleotide of the database,           GC-content 50%.
we need to consider the last m-tuple of the current              For small P -values the real and simple estimates
preﬁx. We enumerate all m-tuples at a distance of at         are quite similar, which is expected since in a short
most hmax from this new m-tuple, and for each we             database very few m-tuples occur multiple times, even
update arrays M and H as appropriate. Values in H            if the composition of the database is skewed. As a re-
can be easily converted to the desired P -values for the     sult, the compression method gives non-conservative
current preﬁx at diﬀerent distance thresholds. Note          estimates, because it uses a much smaller database
that this algorithm is feasible only for small values        size. For larger P -values, the compression estimates
of m, because it requires Θ(4m ) memory.                     become conservative, and are closer to the true
    The algorithm can be used to compute exact               P -value than the simple estimates. This is because
P -value for a given real sequence database, which we        a database with high GC-content is less likely to con-
consider as a reference value. It can also be applied to     tain m-tuples with low GC-content, and thus a larger
randomly generated databases, leading to estimates           database size is required to achieve the same P -value.
of P -values that would be obtained in a model, where        For example, among estimates for GC-content 75%
both the database and the query are random. To em-           shown in Table 1, compression estimates become con-
ulate the proposed method, we compress a sequence            servative for databases larger than 107 and 105 nu-
database, compute its eﬀective size, and then use the        cleotides for h = 0 and h = 2, respectively.
P -value estimates obtained from random databases of             We can study the situation analytically for the case
the matching size.                                           when the query appears in the database without a mis-
    Finally, we also consider a simple method, where         match. Let Xi be the event that query Q has an oc-
we use random databases of the same size as the orig-        currence with distance at most h at position i in the
inal database, without compression. P -values com-           random database of GC-content g, and let X be the
puted in this way correspond to the commonly used            total number of occurrences of query Q with at most
techniques for P -value estimation without using the         h mismatches in the whole database. For g = 0.5, we
eﬀective database size mechanism.                            can compute the probability of Xi by summing over
                                                             the number of mismatches
                                                                                       h ( ) ( )k ( )m−k
                                                                                       ∑
3.2   Entropy                                                                              m     3      1
                                                                  P (Xi |Q, g = 0.5) =                           .
                                                                                           k     4      4
The four nucleotides do not occur in genomes equally                     k=0

frequently. A commonly used measure of DNA com- For general g we have to distinguish between mis-
position is GC-content: the percentage of Cs and Gs matches on G/C positions and mismatches on A/T
                                                                           Estimating eﬀective DNA database size . . .               67

                                                h=0                                                      h=2
 Database size      104            105           106          107              104          105           106           107
 Real (GC 75%)      9.3 · 10−6     9.3 · 10−5    9.2 · 10−4   8.4 · 10−3       8.8 · 10−3   7.0 · 10−2    3.2 · 10−1    7.2 · 10−1
 Predicted (GC 75%) 8.4 · 10−6     8.5 · 10−5    8.4 · 10−4   8.4 · 10−3       8.3 · 10−3   8.1 · 10−2    5.7 · 10−1    1.0
 Simple (GC 75%)    9.3 · 10−6     9.3 · 10−5    9.3 · 10−4   9.3 · 10−3       9.2 · 10−3   8.8 · 10−2    6.0 · 10−1    1.0
 Real (GC 90%)      9.2 · 10−6     8.7 · 10−5    6.7 · 10−4   3.9 · 10−3       6.5 · 10−3   3.3 · 10−2     1.1 · 10−1   2.6 · 10−1
 Predicted (GC 90%) 6.5 · 10−6     6.8 · 10−5    6.8 · 10−4   6.8 · 10−3       6.4 · 10−3   6.5 · 10−2     4.9 · 10−1   1.0
 Simple (GC 90%)    9.3 · 10−6     9.3 · 10−5    9.3 · 10−4   9.3 · 10−3       9.2 · 10−3   8.8 · 10−2     6.0 · 10−1   1.0

Table 1. P -values for random databases of various lengths n, GC-content 75% or 90% and the query distance h = 0 or
h = 2. Real P-values are computed by the algorithm described in Section 3.1. Predicted P-values take into account the
entropy of the database, using eﬀective database size H(g)n/2. The simple method for computing P-values disregards
the GC-content and considers a random database of size n and 50% GC-content.


positions in the query. Let z be the number of Gs and            Therefore for small P -values, where the approximation
Cs in Q, then                                                    of 1 − e−x is suﬃciently accurate, the estimate Pest is
                     h−i ( ) (                                   lower by approximately a factor of H(g)/2 than the
                ∑                 g )i ( g )z−i
                  h ∑
                          z                                      real P-value. This implies that no correction for en-
 P (Xi |Q, g) =                1−
                i=0 j=0
                          i       2      2                       tropy is in fact necessary for small P -values.
                  (       )(      )j (        )m−z−j
                    m−z      1+g        1−g                          On the other hand, when h = 0 but n is suﬃciently
                ·                                    .           large, the compression estimates become conservative.
                      j        2           2
                                                                 In particular, let us assume that
The expected number of occurrences can be computed
by linearity of expectation (for simplicity, we ignore                      m2−m ln(2)        1.386 m
the edge eﬀect at positions n − m + 2, . . . , n, which is       n>                         =      m4 + o(m4m ).
                                                                       H(g)2−m−1 − (1 − g)m   H(g)
negligible for large n):
                                                                                              −m                   −m
                                                                 This implies e−E(n,g)4      < 2−m e−n2 (1−g) . The
                                                                                                                            m
                    ∑
                    n
      E(X|Q, g) =          E(Xi |Q) = nP (Xi |Q, g).             right-hand side is one of the terms of the sum
                     i=1
                                                                                ∑m ( )
We will approximate the distribution of variable X by                               m −m −n2−m gz (1−g)m−z
                                                                                      2 e                  ,
a Poisson distribution with the mean λ = E(X|Q, g).                                 z
                                                                                z=0
This commonly used approximation [17] disregards de-
pendencies between occurrences at adjacent positions             and therefore the left-hand side is upper-bounded by
and also assumes that n is large and λ relatively small.         the whole sum as well. This implies Pest ≥ Preal .
In our simulations it led to very good estimates of                  This simple bound is not very interesting, since it
P -values (data not shown). If X is from Poisson dis-            works only for very large n, where P-values are very
tribution, the probability of at least one occurrence of         close to one. Nonetheless, it agrees with our observa-
a given query Q is P (X > 0|Q, g) = 1−e−λ . To obtain            tion that the compression estimate is appropriate for
the ﬁnal P -value, we have to consider this expression           suﬃciently large n. Perhaps a tighter bound on n can
for diﬀerent queries, or more precisely, for groups of           be obtained by considering additional elements of the
queries with the same number Gs and Cs, of which Qz              sum.
is one representative:
                        ∑m ( )
                              m −m                               3.3       Redundancy
 Preal = P (X > 0|g) =             2 P (X > 0|Qz , g).
                        z=0
                               z
                                                                 Next, we consider artiﬁcial databases that are con-
The compression estimate Pest uses the same formula
                                                                 catenation of many mutually similar sequences of the
but g = 0.5 and E(n, g) instead of n. For h = 0,
                                                                 same length k. In the experiments we use k = 104 .
E(X|Q, g) simpliﬁes to n2−m g z (1−g)m−z , and in par-
                                                                 To generate the database, we ﬁrst sample a string
ticular E(X|Q, 0.5) = n4−m does not depend on z,
                                    −m                           S = s1 s2 . . . sk of length k uniformly at random. This
and therefore Pest = 1 − e−E(n,g)4 . For small x,                string will be the center of the cluster of similar se-
function 1 − e−x can be well approximated by x [8].              quences.
Using this approximation, we obtain
                                                                     The i-th sequence in the concatenated database is
Pest            E(n, g)4−m                                       obtained from S by randomly mutating several nu-
      = ∑m (m) −m −m z            = H(g)/2.
Preal    z=0 z 2  n2 g (1 − g)m−z                                cleotides of S so that the nucleotide j is the same
68      Martina Višňovská et al.

                                             h=0                                                  h=2
     Database size 104          105           106          107          104          105           106          107
     Real          9.3 · 10−6   8.1 · 10−5    6.5 · 10−4   4.3 · 10−3   9.2 · 10−3   6.8 · 10−2    3.4 · 10−1   8.8 · 10−1
     GenCompress 9.3 · 10−6     7.7 · 10−5    6.8 · 10−4   6.5 · 10−3   9.2 · 10−3   7.3 · 10−2    4.9 · 10−1   1.0
     bzip2         1.0 · 10−5   9.2 · 10−5    8.1 · 10−4   8.0 · 10−3   1.0 · 10−2   8.7 · 10−2    5.5 · 10−1   1.0
     Simple        9.3 · 10−6   9.3 · 10−5    9.3 · 10−4   9.3 · 10−3   9.2 · 10−3   8.8 · 10−2    6.0 · 10−1   1.0

Table 2. P -values for the artiﬁcial clustered database. Real P-values were computed by the algorithm from Section 3.1
applied directly to the clustered database. GenCompress and bzip2 estimates use diﬀerent compression tools to compute
eﬀective database size. The simple method for computing P-values considers a uniformly generated random database
of size n.


                                             h=0                                                  h=2
     Database size 1.6 · 104    2.1 · 105     1.5 · 106    1.1 · 107    1.6 · 104    2.1 · 105     1.5 · 106    1.1 · 107
     Real          1.1 · 10−5   1.4 · 10−4    9.4 · 10−4   6.0 · 10−3   1.0 · 10−2   1.1 · 10−1    4.6 · 10−1   9.1 · 10−1
     GenCompress 9.3 · 10−6     1.2 · 10−4    8.5 · 10−4   6.0 · 10−3   9.2 · 10−3   1.1 · 10−1    5.7 · 10−1   1.0
     GC corrected 1.1 · 10−5    1.4 · 10−4    9.6 · 10−4   6.8 · 10−3   1.1 · 10−2   1.3 · 10−1    6.1 · 10−1   1.0
     bzip2         1.5 · 10−5   1.9 · 10−4    1.3 · 10−3   9.4 · 10−3   1.5 · 10−2   1.7 · 10−1    7.2 · 10−1   1.0
     Simple        1.5 · 10−5   1.9 · 10−4    1.4 · 10−3   1.1 · 10−2   1.5 · 10−2   1.8 · 10−1    7.4 · 10−1   1.0

Table 3. P -values for genomic data from human, chimpanzee, and rhesus. Real P-values were computed by the
algorithm from Section 3.1 applied directly to the genomic sequences. GenCompress and bzip2 estimates use diﬀerent
compression tools to compute eﬀective database size. GC corrected shows the GenCompress estimate corrected for the
average database entropy. The simple method for computing P-values considers a uniformly generated random database
of size n.


as sj with the probability of 90%, and with the prob-         In this experiment, estimates based on data com-
ability of 10% it changes to another nucleotide chosen pression are mostly conservative and better than the
randomly from the remaining three. This way, we get estimates obtained by the simple methods.
a clustered database of sequences that diﬀer from the
center of the cluster on 10% positions on average.
                                                          3.4 Real data
    For clustered databases of various sizes, we com-
pute the real P -value simple estimate, which uses ef- Finally, we have applied the compression method of
fective size equal to the real size of the database with- estimating the eﬀective database size to real genomic
out compression, and two estimates based on two dif- data from human, chimpanzee, and rhesus macaque.
ferent lossless compression programs GenCompress [6] Our set consisted of a portion of human chromo-
and bzip2. The results are shown in Table 2.              some 22 and corresponding portions of genome from
                                                          the two other primates. We have omitted larger blocks
    Bzip2 is based on Burrows–Wheeler transform [5] that did not have counterparts in neither chimpanzee,
which tends to create blocks of identical symbols if the nor macaque. The sequence data and the genome
input contains repeated substrings. The transformed alignments were obtained from the UCSC genome
text is then encoded by other compression techniques, browser [13].
such as Huﬀman encoding. To save memory, bzip2 di-            The database contains a short block of the human
vides a ﬁle into blocks and processes each block sep- sequence followed by the corresponding block in the
arately, which may have negative eﬀect on the com- two other genomes, followed by another block in hu-
pressed size.                                             man, etc. Therefore, similar sequences are close to each
    GenCompress [6] is an algorithm developed speciﬁ- other which improves compression with algorithms
cally for compressing DNA sequences. It ﬁnds approx- such as bzip2 that always consider only a block of the
imate repeats in the compressed sequence and encodes whole ﬁle.
them by a sequence of edit operations. GenCompress            The results of the test for diﬀerent input sizes are
is a single-pass algorithm that proceed along the in- shown in Table 3. As we can see, with GenCompress
put sequence and in each step it ﬁnds the best preﬁx we often underestimate real P -values, while bzip2
that can be encoded as an approximate repeat of some achieves only a very small improvement compared to
substring of already encoded input sequence.              the simple method without compression.
                                                                  Estimating eﬀective DNA database size . . .      69

    Earlier, we have reached a conclusion that skewed       similarity common in DNA sequence databases,
GC-content should not automatically imply lower ef-         one could use fast methods for identifying such
fective database sizes since this would lead to under-      blocks [12, 19] to speed up the algorithms developed
estimation of small P -values. However, compression         speciﬁcally for compression of DNA sequences [6, 7,
algorithms estimate the eﬀective database size based        14, 3].
on both sequence redundancy and lower sequence en-              Finally, we would like to extend our work to more
tropy in case of locally high or low GC-contents. That      complex scenarios of homology search. Longer query
could be the reason why GenCompress estimates are           sequences will require handling of insertions, deletions,
non-conservative.                                           and matches that involve only portions of the query
    We have attempted to further correct for this issue     sequence. More complex scoring schemes on both nu-
by computing an average database entropy H ′ . The ef-      cleotide and protein sequences also need to be exam-
fective database size estimated by GenCompress was          ined.
then multiplied by the correction factor of 2/H ′ . This
approach leads to surprisingly good P -value estimates      References
that were also conservative in our experiments (Ta-
ble 3, line GC corrected).                                   1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and
    To estimate the value H ′ for this experiment, we           D.J. Lipman: Basic local alignment search tool. Jour-
have computed entropy separately for non-overlapping            nal of Molecular Biology, 215 (3), 1990, 403–410.
                                                             2. D. Bailey and R. Crandall: Random generators and
windows of size 1000 to capture diﬀerent properties of
                                                                normal numbers. Experimental Mathematics, 11 (4),
individual genomic regions. We have estimated
                                                                2002, 527–546.
a Markov chain of second order from each window of           3. B. Behzadi and F. Le Fessant: DNA compression
the sequence and computed an entropy of this Markov             challenge revisited: a dynamic programming approach.
chain, that is, entropy where the probability of each           In: Combinatorial Pattern Matching (CPM 2005),
nucleotide is conditioned on the two previous nucleo-           Springer, 2005, 190–200.
tides to capture local dependencies in DNA sequences.        4. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Os-
The average entropy H ′ of the whole database was               tell, and E.W. Sayers: GenBank. Nucleic Acids Re-
then computed as an average of entropy values from              search, 37 (Database issue), 2009, D26–31.
                                                             5. M. Burrows and D.J. Wheeler: A block-sorting lossless
all windows.
                                                                data compression algorithm. Technical Report 124,
                                                                Digital Equipment Corporation, Palo Alto, California,
4   Conclusion                                                  1994.
                                                             6. X. Chen, S. Kwong, and M. Li: A compression
In this paper, we have considered methods for more              algorithm for DNA sequences and its applications
                                                                in genome comparison.        In: Genome Informatics
accurate estimation of P -values in the context of se-
                                                                (GIW’99), Universal Academy Press, 1999, 51–61.
quence homology search. In particular, we propose to         7. X. Chen, M. Li, B. Ma, and J. Tromp: DNACompress:
adjust the size of the database to compensate for the           fast and eﬀective DNA sequence compression. Bioin-
structure present in the database due to the fact that          formatics, 18 (12), 2002, 1696–1698.
individual sequences are related by evolution.               8. T.H. Cormen, C.E. Leiserson, and R.L. Rivest: Intro-
    We have explored the idea of using compression              duction to algorithms. MIT Press, 1990.
to estimate the eﬀective database size, and we have          9. A. Dembo, S. Karlin, and O. Zeitouni: (1994). Limit
demonstrated by experiments that the use of the com-            distribution of maximal non-aligned two-sequence seg-
                                                                mental score. The Annals of Probability, 22 (4), 1994,
pression algorithms leads to non-conservative P -value
                                                                2022–2039.
estimates for small P -values. This is at least partially   10. R. Giancarlo, D. Scaturro, and F. Utro: Textual
caused by the fact that besides identifying longer re-          data compression in computational biology: a synop-
peated substrings, compression algorithms also com-             sis. Bioinformatics, 25 (13), 2009, 1575–1576.
pensate for sequences with low entropy. We have             11. D.L. Hartl and A.G. Clark: Principles of population
shown that such compensation should not be consid-              genetics, Fourth Edition. Sinauer Associates, 2006.
ered, at least in the case of small P -values. We have      12. W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and
suggested a simple way to disentangle the portion of            D. Haussler: D. Evolution’s cauldron: duplication,
the compression coming from locally low entropy and             deletion, and rearrangement in the mouse and human
                                                                genomes. Proceedings of the National Academy of Sci-
shown that the correction leads to better P -value es-
                                                                ences of the United States of America, 100 (20), 2003,
timates.                                                        11484–9.
    The compression would be a fast and eﬃcient way         13. W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin,
of estimating the eﬀective database size. Even though           T.H. Pringle, A.M. Zahler, and D. Haussler: The hu-
most of the general purpose compression algorithms              man genome browser at UCSC. Genome Research, 12
(such as bzip2) cannot handle distant large blocks of           (6), 2002, 996–1006.
70      Martina Višňovská et al.

14. G. Korodi and I. Tabus: An eﬃcient normalized max-
    imum likelihood algorithm for dna sequence compres-
    sion. ACM Transactions on Information Systems, 23
    (1), 2005, 3–34.
15. M. Li, J.H. Badger, C. Xin, S. Kwong, P. Kearney,
    and H. Zhang: An information based sequence distance
    and its application to whole mitochondrial genome phy-
    logeny. Bioinformatics, 17 (2), 2001, 149–154.
16. M. Li and P. Vitányi: An introduction to Kolmogorov
    complexity and its applications. Springer, 2008.
17. A.Y. Mitrophanov and M. Borodovsky: (2006). Statis-
    tical signiﬁcance in biological sequence analysis. Brief-
    ings in Bioinformatics, 7(1), 2006, 2–24.
18. O.U. Nalbantoglu, D.J. Russell, and K. Sayood: Data
    compression concepts and algorithms and their appli-
    cations to bioinformatics. Entropy (Basel, Switzer-
    land), 12 (1), 2010, 34.
19. B. Paten, J. Herrero, S. Fitzgerald, K. Beal, P. Flicek,
    I. Holmes, and E. Birney: Genome-wide nucleotide-
    level mammalian ancestor reconstruction. Genome Re-
    search, 18 (11), 2008, 1829–1833.
20. Rissanen, J. and Langdon, G. (1979). Arithmetic
    coding. IBM Journal of Research and Development,
    23(2):149–162.
21. T.F. Smith and M.S. Waterman: Identiﬁcation of com-
    mon molecular subsequences. Journal of Molecular Bi-
    ology, 147 (1), 1981, 195–197.
22. B.E. Suzek, H. Huang, P. McGarvey, R. Mazumder,
    and C.H. Wu:        UniRef: comprehensive and non-
    redundant UniProt reference clusters. Bioinformatics,
    23 (10), 2007, 1282–1288.
23. J.C. Venter: Multiple personal genomes await. Nature,
    464 (7289), 2010, 676–677.