<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating effective DNA database size via compression ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martina Viˇsnˇovsk´a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal N´an´asi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom´aˇs Vinaˇr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇa Brejov´a</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Applied Informatics, Faculty of Mathematics</institution>
          ,
          <addr-line>Physics</addr-line>
          ,
          <institution>and Informatics Comenius University</institution>
          ,
          <addr-line>Mlynsk ́a Dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Faculty of Mathematics</institution>
          ,
          <addr-line>Physics</addr-line>
          ,
          <institution>and Informatics Comenius University</institution>
          ,
          <addr-line>Mlynsk ́a Dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>Search for sequence similarity in large-scale and deletion of nucleotides also invokes penalty. The databases of DNA and protein sequences is one of the goal is then to find a region in the query and a region of essential problems in bioinformatics. To distinguish ran- the sequence database with the highest possible score. dom matches from biologically relevant similarities, it is This task can be accomplished either by a dynamic customary to compute statistical P-value of each discov- programming algorithm [21], or by fast heuristic methered match. In this context, P-value is the probability that ods, such as BLAST [1]. Regardless of the method, the cahsainmcielairnitay cwoimthpaarigsiovnenofscaorreanodrohmigqhuererwyoaunldd aaprpaenadrobmy result is a set of high scoring pairs (substring of a query database. Note that P-value is a function of the database matched to a substring of a database) together with size, since a high-scoring similarity is more likely to exist their similarity scores. by chance in a larger database. Typical genomic databases are large: human Biological databases often contain redundant, identical, or genome alone has approx. 3.2 GB, and the GenBank very similar sequences. This fact is not taken into account traditional database [4] contains more than 100 GB in P-value estimation, resulting in pessimistic estimates. of DNA sequences as of April 2010. In these large One way to address this problem is to use a lower effective databases, many query sequences will have high database size instead of its real size. In this work, we pro- scoring matches purely by chance. To distinguish real pose to estimate the effective size of a database by its com- matches from spurious ones, we need to assess their fpercetsisveedlysiszteo.reAnonalpypraopsriniagtlee ccoompyproesfseioanchalrgeopreiathtemd wstirllinegf-, statistical significance. resulting in a file whose size roughly corresponds to the Traditionally, the statistical significance of a high amount of unique sequence in the database. We evaluate scoring pair is assessed by P -value. If the high scorour approach on real and simulated databases. ing pair has score s, its P -value is the probability of a match with score s or better occurring in homology search of a randomly generated query in a randomly 1 Introduction generated database. The P -value obviously depends on the sizes of the query and the database, but to Recent progress in genome sequencing technologies led achieve more realistic P -values, we can also take into to rapid increase in the size of DNA sequence data- account other properties of the sequences, such as frebases. Perhaps the most common way of accessing quencies of individual nucleotides [17]. these databases is through sequence homology search, Consider an illustration in Fig.1. For a given where users search in the database for sequences sim- score s, we can visualize the sequence database D as ilar to their query of interest [1]. Highly similar se- a set of points in the sequence space with the neighquences have often evolved from a common ances- bourhoods that include all sequences with similarity tor and may also share the same function. Homology score s or higher. In this sequence space, the query Q search is thus the first step in elucidating the function is located at the boundary of one of these neighbourand evolutionary history of a newly sequenced genome. hoods. If the neighbourhoods cover a large fraction of To formally define sequence homology search as the sequence space, the P-value is high, because a rana computational problem, we consider a query Q and domly generated query will have a high probability of a database D, which are both strings composed of falling into one of those neighbourhoods, resulting in nucleotides (symbols from the alphabet fA; C; G; T g). score larger than s. We typically introduce a scoring function that assigns While for a given database the P -values could be a positive score to matching nucleotides in the two se- computed exactly, such a computation would be imquences, and negative score to mismatches. Insertion practical. Instead, one uses various approximations as⋆ This research is funded by European Community suming that the sequence in the database is randomly FP7 grants IRG-224885 and IRG-231025, VEGA grant generated by an i.i.d. process or a Markov chain 1/0210/10, and Comenius University grant number (e.g., [9]). These assumptions are often violated by real UK/151/2010. databases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>new database will still be n′ = n, even though its real
size is 2n.</p>
      <p>
        Note that the notion of effective database size has
been previously used to adjust for border effects in
case of short queries and databases [
        <xref ref-type="bibr" rid="ref4">17</xref>
        ], and an option
for setting it to an arbitrary value is included in most
software tools for homology search. Analogously,
statistical models in population genetics use population
size as a parameter, but instead of the actual number
of individuals, one typically uses effective population
size to compensate for various effects that are not
considered by the model, such as population size changing
over time [11].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Kolmogorov complexity of DNA sequences</title>
      <p>
        For instance, the database may contain several se- For our purposes, the effective database size should
quences that are evolutionarily related. Such se- be an estimate of the amount of unique sequence in
quences will often contain only a small number of the database D, taking into account substrings that
changes: for example, DNA sequences of human and may be present in D in many exact or approximate
chimpanzee are identical in 99% of nucleotides over copies. One way of describing the information content
most of their lengths. In addition, recently introduced of a database is its Kolmogorov complexity [
        <xref ref-type="bibr" rid="ref3">16</xref>
        ].
next generation sequencing technologies made it fi- Kolmogorov complexity K(D) of a sequence D is
nancially viable to sequence multiple individuals from the bit length of the shortest program P for a fixed
the same species, leading to various personal human universal Turing machine that outputs sequence D.
genome projects [
        <xref ref-type="bibr" rid="ref10">23</xref>
        ]. Sequences of different individ- Kolmogorov complexity can be understood as a lower
uals from the same species may differ as little as one bound (up to a constant additive term) of compression
in thousand nucleotides. Consequently, databases may achievable by any general-purpose algorithm. A string
contain many highly similar sequences, and it is not of length n sampled uniformly at random from a fixed
appropriate to model them as completely random. alphabet is on average almost incompressible [16,
Sec
      </p>
      <p>Consider again the example in Fig.1. The illustra- tion 2.8.1]. In particular, a string over a four-letter
tion on the right shows a database with clusters of sim- alphabet requires on average approximately 2n bits
ilar sequences that have overlapping neighbourhoods, (with up to an O(log n) additive term). Thus if we
covering a much smaller fraction of the sequence space believe that the databases with the same Kolmogorov
than the database on the left. Consequently, the esti- complexity will behave similarly, we should use n′ =
mate based on the assumption of random distribution K(D)=2 as an estimate of the effective database size
of sequences in the database will necessarily overes- of database D.
timate the P -value, potentially leading to rejection of At a first glance, Kolmogorov complexity seems
high scoring pairs as matches likely to occur by chance. to be an ideal estimator to use in this context. It</p>
      <p>
        One of the solutions to this problem proposed in accounts for possible major differences between the
the literature is to remove the redundancies, such as real database and a randomly generated one. In
parclosely related sequences, from the sequence data- ticular, using Kolmogorov complexity would
compenbase [
        <xref ref-type="bibr" rid="ref9">22</xref>
        ]. New query can be first searched against such sate for differences in frequencies of individual
nunon-redundant database, the statistical significance of cleotides (database containing only a long stretch of
resulting matches can be evaluated, and then a sec- As will have a small effective size), and redundant
ondary search can be launched against the whole data- sequence content (sequences that have only few
difbase. In this paper, we propose a different solution. ferences can be described very efficiently in a small
One of the main parameters used in P -value computa- space). Moreover, the concept of Kolmogorov
comtion is the database size n. Instead of the real database plexity has been successfully used in similar contexts
size, we propose to use an effective database size n′ before, in applications such as computing distance
be(n′ &lt; n) which will account for redundancies in the tween genomes [
        <xref ref-type="bibr" rid="ref2">15</xref>
        ] (see also [
        <xref ref-type="bibr" rid="ref5">10, 18</xref>
        ] for overview).
database. For example, if we take a random database The exact Kolmogorov complexity of database D
of size n, and double its contents by including second is not computable. Yet in practice, we can use
variexact copy of each sequence, the effective size of such ous compression algorithms instead of computing the
Kolmogorov complexity. For a fixed compression algo- be conservative, i.e., the estimates should be strictly
rithm, the compressed size c(D) of database D is an higher than the exact P -values, so that the estimated
upper bound on the Kolmogorov complexity K(D), P -value represents an upper bound on the probability
and we can use value n′ = c(D)=2 as an estimate of of a particular match being a false positive.
the effective database size. Several efficient algorithms We have decided to experimentally evaluate the
specifically tuned to compression of DNA sequences accuracy of the P -values obtained by the compression
are available (see [
        <xref ref-type="bibr" rid="ref1">6, 7, 14, 3</xref>
        ]). method in a simple scenario motivated by the next
      </p>
      <p>As we will see in the next section, the upper bounds generation sequencing technologies. In this scenario,
on P -values (or conservative estimates) are generally a sequencing machine generates many short sequences
desirable in cases where exact P -values cannot be com- (reads). These reads need to be mapped as substrings
puted. Since the P -values increase monotonically with to previously known reference genomes. In most cases,
the database size, using an upper bound on the effec- the reads will match the reference sequence exactly,
tive database size should not by itself lead to non- but sometimes we will see one or two mismatches.
conservative bounds. These mismatches can be either due to sequencing
er</p>
      <p>Unfortunately, using Kolmogorov complexity may rors, or (more interestingly) due to differences between
not necessarily lead to conservative bounds in all in- individuals.
stances. As an extreme case, base-4 expansion of many
fundamental constants, such as , can be generated In our simplified scenario, the database D is a
sinby a constant-size program. First n bits generated by gle string of length n, the query Q is a string of
such a program can be used as a sequence database of length m, and we are searching in D for a substring of
size n, replacing digits 0; : : : ; 3 with nucleotides. Kol- length m with the smallest Hamming distance from Q.
mogorov complexity of such database is O(log n) (we In contrast to the full homology search problem, we
alneed a constant number of bits to represent the pro- ways consider the whole query (each read has to map
gram, and log n bits to represent the real size of the completely to the reference), and we also disallow
indatabase), yet for all practical purposes, this database sertions and deletions.
behaves as a random database of size n [2].</p>
      <p>Thus using a Kolmogorov complexity and
compression-based estimates of effective database size does not
necessarily lead to conservative estimates of P -values
in homology search. In the next section, we explore
practical issues of using these compression-based
estimates in an experimental setting.</p>
      <p>We will say that the distance of query Q from
database D is the minimum Hamming distance
between Q and some substring of length m from
database D. This Hamming distance will represent our
score. P -value for a particular Hamming distance h
is then the number of all m-tuples that are at a
distance of at most h from D, divided by 4m (the number
of all possible m-tuples). Note that in this definition
of P -value, we keep the database fixed, and only the
query is random, chosen uniformly from all possible
queries of length m. In contrast, most methods
consider both query and database as random.</p>
    </sec>
    <sec id="sec-3">
      <title>Estimating effective database size through compression 3</title>
      <p>Here, motivated by the discussion in the previous
section, we explore the use of compression software for es- The main advantage of this simplified model is that
timating the effective database size. The methodology we can compute these P -values exactly in a reasonable
is very simple. First, we compress the database and amount of time, and compare them to the P -values
esmeasure the size of the resulting file in bytes. Then, timated based on the randomly generated database of
we multiply this size by four to account for the fact corresponding effective size, thus evaluating the
accuthat in a uniformly random database, we need two racy of our concept. The rest of this section is
orgabits to encode each nucleotide. In this way, we obtain nized as follows. First, we introduce the algorithm for
an estimate of the effective database size which can exact computation of P -values in our simplified
scebe used in any formula or algorithm for estimating nario. Then we explore several cases of generated and
P -values on uniformly distributed databases. real databases.</p>
      <p>The P -values are used to reject high scoring pairs The file compression algorithms typically detect
that have a high probability of occurring by chance. In symbols or small groups of symbols that occur in the
a typical search there will be many spurious matches text more frequently than the others, and encode them
with high P -values, and only a few top-scoring by shorter codewords. Thus, the size of the compressed
matches will represent genuine similarities with com- database is related to the entropy of the source
genermon evolutionary origins. Since our main task is to ating the database. In addition to that, some methods
separate these few good matches from many spuri- try to detect longer strings that are repeated exactly
ous matches, we would like the P -value estimator to or with mismatches multiple times, and to store only
one copy of each such string as well as differences be- in the given sequence. GC-content varies widely in
tween the approximate copies. different genomes, or even between segments of the</p>
      <p>
        In our experiments, we first try to separate these same genome. An i.i.d. database with GC-content g
two phenomena by using artificially generated data- has entropy H(g) = g lg(g=2) (1 g) lg((1 g)=2),
bases, and in the end, we apply compression software and therefore can be encoded by approximately H(g)n
to real DNA sequences, where both of these issues are bits, for example by the arithmetic encoding [
        <xref ref-type="bibr" rid="ref7">20</xref>
        ].
at play. Following our general approach, we can try to use
the formulas for the uniform nucleotide frequencies
3.1 Algorithm to compute exact P -values tainmdateeffetchteivPe -dvaatluabesasfeorsitzheeEa(cntu;agl) d=atHab(ags)en=o2f
stiozeesnWe have implemented an algorithm that simulta- and GC-content g.
neously computes P -values for a given database D We have tested the performance of such estimates
and all of its prefixes, and for all distances h = on randomly generated databases with GC-contents
0; 1; : : : ; hmax. In experiments we use values hmax = 3 75% and 90%, averaging values for five randomly
genand m = 15. erated databases. Table 1 shows the real P -values,
      </p>
      <p>The algorithm proceeds along the database D and P -values predicted by our compression method, and
at each position updates two arrays M and H. P -values obtained by the simple method of
considArray M of size 4m stores for each m-tuple Q′ its ering a database of length n and GC-content 50%
distance to the current prefix of the database, pro- (disregarding the real GC-content in the P -value
esvided that this distance is at most hmax (otherwise it timate). All estimates were computed by our
algouses a special 1 value). The second array H stores rithm from Section 3.1. For real P-values the algorithm
for each distance h hmax the number of m-tuples was applied to the generated database with a skewed
with distance exactly h from the current prefix of the GC-content, for simple and predicted estimates to five
database D. random databases of appropriate size with</p>
      <p>When we read a new nucleotide of the database, GC-content 50%.
we need to consider the last m-tuple of the current For small P -values the real and simple estimates
prefix. We enumerate all m-tuples at a distance of at are quite similar, which is expected since in a short
most hmax from this new m-tuple, and for each we database very few m-tuples occur multiple times, even
update arrays M and H as appropriate. Values in H if the composition of the database is skewed. As a
recan be easily converted to the desired P -values for the sult, the compression method gives non-conservative
current prefix at different distance thresholds. Note estimates, because it uses a much smaller database
that this algorithm is feasible only for small values size. For larger P -values, the compression estimates
of m, because it requires (4m) memory. become conservative, and are closer to the true</p>
      <p>The algorithm can be used to compute exact P -value than the simple estimates. This is because
P -value for a given real sequence database, which we a database with high GC-content is less likely to
conconsider as a reference value. It can also be applied to tain m-tuples with low GC-content, and thus a larger
randomly generated databases, leading to estimates database size is required to achieve the same P -value.
of P -values that would be obtained in a model, where For example, among estimates for GC-content 75%
both the database and the query are random. To em- shown in Table 1, compression estimates become
conulate the proposed method, we compress a sequence servative for databases larger than 107 and 105
nudatabase, compute its effective size, and then use the cleotides for h = 0 and h = 2, respectively.
P -value estimates obtained from random databases of We can study the situation analytically for the case
the matching size. when the query appears in the database without a
mis</p>
      <p>Finally, we also consider a simple method, where match. Let Xi be the event that query Q has an
ocwe use random databases of the same size as the orig- currence with distance at most h at position i in the
inal database, without compression. P -values com- random database of GC-content g, and let X be the
puted in this way correspond to the commonly used total number of occurrences of query Q with at most
techniques for P -value estimation without using the h mismatches in the whole database. For g = 0:5, we
effective database size mechanism. can compute the probability of Xi by summing over
the number of mismatches
3.2</p>
      <sec id="sec-3-1">
        <title>Entropy</title>
        <sec id="sec-3-1-1">
          <title>The four nucleotides do not occur in genomes equally frequently. A commonly used measure of DNA composition is GC-content: the percentage of Cs and Gs</title>
          <p>h (m) ( 3 )k ( 1 )m−k
P (XijQ; g = 0:5) = ∑
:
k=0
k
4
4</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>For general g we have to distinguish between mismatches on G/C positions and mismatches on A/T</title>
          <p>Database size 104
Real (GC 75%) 9:3 10 6
Predicted (GC 75%) 8:4 10 6
Simple (GC 75%) 9:3 10 6
Real (GC 90%) 9:2 10 6
Predicted (GC 90%) 6:5 10 6
Simple (GC 90%) 9:3 10 6
105</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>The expected number of occurrences can be computed by linearity of expectation (for simplicity, we ignore the edge effect at positions n m + 2; : : : ; n, which is negligible for large n):</title>
          <p>E(XjQ; g) =
n
∑ E(XijQ) = nP (XijQ; g):
i=1</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Therefore for small P -values, where the approximation</title>
          <p>of 1 e−x is sufficiently accurate, the estimate Pest is
lower by approximately a factor of H(g)=2 than the
real P-value. This implies that no correction for
entropy is in fact necessary for small P -values.</p>
          <p>On the other hand, when h = 0 but n is sufficiently
large, the compression estimates become conservative.</p>
          <p>In particular, let us assume that
n &gt;</p>
          <p>H(g)2−m−1
m2−m ln(2)
(1
g)m =
1:386
H(g)</p>
          <p>m4m + o(m4m):
This implies e−E(n;g)4 m &lt; 2−me−n2 m(1−g)m . The
right-hand side is one of the terms of the sum
m (m)
∑
aTWhPeioswicsiosllomanmpdpoirnsotlxryiibmuusaettideonatphwpeirdtohixsittmrhieabtumitoienoan[n1o7f] dv=aisrEriea(gbXalerjdQXs;dgbe)y-. z=0 z 2−me−n2 mgz(1−g)m z ;
pendencies between occurrences at adjacent positions and therefore the left-hand side is upper-bounded by
Ianndoaulrsosiamssuulmateiosntshaitt nleids ltaorgveearnydgooredlaetsivtiemlyastmesalolf. the Twhhioslesi msupmle absowuneldl. iTshniostimveprlyieisntPeersetstinPgr,easli.nce it
P -values (data not shown). If X is from Poisson dis- works only for very large n, where P-values are very
tribution, the probability of at least one occurrence of close to one. Nonetheless, it agrees with our
observaa given query Q is P (X &gt; 0jQ; g) = 1 e− . To obtain tion that the compression estimate is appropriate for
the final P -value, we have to consider this expression sufficiently large n. Perhaps a tighter bound on n can
for different queries, or more precisely, for groups of be obtained by considering additional elements of the
queries with the same number Gs and Cs, of which Qz sum.
is one representative:</p>
          <p>m (m)
Preal = P (X &gt; 0jg) = ∑
z=0
z
2−mP (X &gt; 0jQz; g):
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Redundancy</title>
        <p>∑zm=0 (mz)2−mn2−mgz(1
The compression estimate Pest uses the same formula Next, we consider artificial databases that are
conbut g = 0:5 and E(n; g) instead of n. For h = 0, catenation of many mutually similar sequences of the
E(XjQ; g) simplifies to n2−mgz(1 g)m−z, and in par- same length k. In the experiments we use k = 104.
ticular E(XjQ; 0:5) = n4−m does not depend on z, To generate the database, we first sample a string
faUunsndicntgtihotnehrie1sfoarpeep−rPoxexsitcma=natbi1oen,wweel−elEoa(bpntp;agr)io4nximm.aFteodr bsymaxll[8x],. sqStur=einnsgc1esws.2il:l: :bsek tohfelecnegnttherkoufntihfoercml ulystaetr
roafnsdiommil.arThseisThe i-th sequence in the concatenated database is
Pest = E(n; g)4−m obtained from S by randomly mutating several
nuPreal g)m−z = H(g)=2: cleotides of S so that the nucleotide j is the same</p>
        <p>Database size 104
Real 9:3 10 6
GenCompress 9:3 10 6
bzip2 1:0 10 5
Simple 9:3 10 6
105
as sj with the probability of 90%, and with the
probability of 10% it changes to another nucleotide chosen
randomly from the remaining three. This way, we get
a clustered database of sequences that differ from the
center of the cluster on 10% positions on average.</p>
        <sec id="sec-3-2-1">
          <title>In this experiment, estimates based on data compression are mostly conservative and better than the estimates obtained by the simple methods.</title>
          <p>3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Real data</title>
        <p>For clustered databases of various sizes, we
compute the real P -value simple estimate, which uses
effective size equal to the real size of the database
without compression, and two estimates based on two
different lossless compression programs GenCompress [6]
and bzip2. The results are shown in Table 2.</p>
        <sec id="sec-3-3-1">
          <title>Finally, we have applied the compression method of</title>
          <p>estimating the effective database size to real genomic
data from human, chimpanzee, and rhesus macaque.</p>
          <p>Our set consisted of a portion of human
chromosome 22 and corresponding portions of genome from</p>
          <p>Bzip2 is based on Burrows–Wheeler transform [5] the two other primates. We have omitted larger blocks
which tends to create blocks of identical symbols if the that did not have counterparts in neither chimpanzee,
input contains repeated substrings. The transformed nor macaque. The sequence data and the genome
text is then encoded by other compression techniques, alignments were obtained from the UCSC genome
such as Huffman encoding. To save memory, bzip2 di- browser [13].
vides a file into blocks and processes each block sep- The database contains a short block of the human
arately, which may have negative effect on the com- sequence followed by the corresponding block in the
pressed size. two other genomes, followed by another block in
human, etc. Therefore, similar sequences are close to each</p>
          <p>GenCompress [6] is an algorithm developed specifi- other which improves compression with algorithms
cally for compressing DNA sequences. It finds approx- such as bzip2 that always consider only a block of the
imate repeats in the compressed sequence and encodes whole file.
them by a sequence of edit operations. GenCompress The results of the test for different input sizes are
is a single-pass algorithm that proceed along the in- shown in Table 3. As we can see, with GenCompress
put sequence and in each step it finds the best prefix we often underestimate real P -values, while bzip2
that can be encoded as an approximate repeat of some achieves only a very small improvement compared to
substring of already encoded input sequence. the simple method without compression.</p>
          <p>
            Earlier, we have reached a conclusion that skewed similarity common in DNA sequence databases,
GC-content should not automatically imply lower ef- one could use fast methods for identifying such
fective database sizes since this would lead to under- blocks [
            <xref ref-type="bibr" rid="ref6">12, 19</xref>
            ] to speed up the algorithms developed
estimation of small P -values. However, compression specifically for compression of DNA sequences [6, 7,
algorithms estimate the effective database size based 14, 3].
on both sequence redundancy and lower sequence en- Finally, we would like to extend our work to more
tropy in case of locally high or low GC-contents. That complex scenarios of homology search. Longer query
could be the reason why GenCompress estimates are sequences will require handling of insertions, deletions,
non-conservative. and matches that involve only portions of the query
          </p>
          <p>We have attempted to further correct for this issue sequence. More complex scoring schemes on both
nuby computing an average database entropy H′. The ef- cleotide and protein sequences also need to be
examfective database size estimated by GenCompress was ined.
then multiplied by the correction factor of 2=H′. This
approach leads to surprisingly good P -value estimates References
that were also conservative in our experiments
(Table 3, line GC corrected).</p>
          <p>To estimate the value H′ for this experiment, we
have computed entropy separately for non-overlapping
windows of size 1000 to capture different properties of
individual genomic regions. We have estimated
a Markov chain of second order from each window of
the sequence and computed an entropy of this Markov
chain, that is, entropy where the probability of each
nucleotide is conditioned on the two previous
nucleotides to capture local dependencies in DNA sequences.</p>
          <p>The average entropy H′ of the whole database was
then computed as an average of entropy values from
all windows.
1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and</p>
          <p>D.J. Lipman: Basic local alignment search tool.
Journal of Molecular Biology, 215 (3), 1990, 403–410.
2. D. Bailey and R. Crandall: Random generators and
normal numbers. Experimental Mathematics, 11 (4),
2002, 527–546.
3. B. Behzadi and F. Le Fessant: DNA compression
challenge revisited: a dynamic programming approach.</p>
          <p>In: Combinatorial Pattern Matching (CPM 2005),</p>
          <p>Springer, 2005, 190–200.
4. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J.
Ostell, and E.W. Sayers: GenBank. Nucleic Acids
Research, 37 (Database issue), 2009, D26–31.
5. M. Burrows and D.J. Wheeler: A block-sorting lossless
data compression algorithm. Technical Report 124,
Digital Equipment Corporation, Palo Alto, California,
4 Conclusion 1994.
6. X. Chen, S. Kwong, and M. Li: A compression
In this paper, we have considered methods for more algorithm for DNA sequences and its applications
accurate estimation of P -values in the context of se- in genome comparison. In: Genome Informatics
quence homology search. In particular, we propose to 7. X(G. ICWh’e9n9,)M,U.nLiiv,eBr.saMl aA,caanddemJ.yTProremssp,: 1D9N99A, C51o–m6p1r.ess:
adjust the size of the database to compensate for the fast and effective DNA sequence compression.
Bioinstructure present in the database due to the fact that formatics, 18 (12), 2002, 1696–1698.
individual sequences are related by evolution. 8. T.H. Cormen, C.E. Leiserson, and R.L. Rivest:
Intro</p>
          <p>We have explored the idea of using compression duction to algorithms. MIT Press, 1990.
to estimate the effective database size, and we have 9. A. Dembo, S. Karlin, and O. Zeitouni: (1994). Limit
demonstrated by experiments that the use of the com- distribution of maximal non-aligned two-sequence
segpression algorithms leads to non-conservative P -value mental score. The Annals of Probability, 22 (4), 1994,
estimates for small P -values. This is at least partially 10. R20.22G–i2a0n3c9a.rlo, D. Scaturro, and F. Utro: Textual
caused by the fact that besides identifying longer re- data compression in computational biology: a
synoppeated substrings, compression algorithms also com- sis. Bioinformatics, 25 (13), 2009, 1575–1576.
pensate for sequences with low entropy. We have 11. D.L. Hartl and A.G. Clark: Principles of population
shown that such compensation should not be consid- genetics, Fourth Edition. Sinauer Associates, 2006.
ered, at least in the case of small P -values. We have 12. W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and
suggested a simple way to disentangle the portion of D. Haussler: D. Evolution’s cauldron: duplication,
the compression coming from locally low entropy and deletion, and rearrangement in the mouse and human
tsihmowatnest.hat the correction leads to better P -value es- eg1ne1nc4eo8sm4–oe9fs.t.hPerUocneietdedinSgstaotfetshoef NAamtieornicaal,A1c0a0de(m20y),o2f 0S0c3i-,</p>
          <p>The compression would be a fast and efficient way 13. W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin,
of estimating the effective database size. Even though T.H. Pringle, A.M. Zahler, and D. Haussler: The
humost of the general purpose compression algorithms man genome browser at UCSC. Genome Research, 12
(such as bzip2) cannot handle distant large blocks of (6), 2002, 996–1006.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          14. G.
          <article-title>Korodi and I. Tabus: An efficient normalized maximum likelihood algorithm for dna sequence compression</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <year>2005</year>
          ,
          <fpage>3</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. Li</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          <string-name>
            <surname>Badger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kwong</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kearney</surname>
            , and
            <given-names>H. Zhang:</given-names>
          </string-name>
          <article-title>An information based sequence distance and its application to whole mitochondrial genome phylogeny</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ),
          <year>2001</year>
          ,
          <fpage>149</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          16.
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vit</surname>
          </string-name>
          <article-title>´anyi: An introduction to Kolmogorov complexity and its applications</article-title>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Mitrophanov and M. Borodovsky</surname>
          </string-name>
          <article-title>: (</article-title>
          <year>2006</year>
          ).
          <article-title>Statistical significance in biological sequence analysis</article-title>
          .
          <source>Briefings in Bioinformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <year>2006</year>
          ,
          <fpage>2</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          18.
          <string-name>
            <given-names>O.U.</given-names>
            <surname>Nalbantoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Russell</surname>
          </string-name>
          , and
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Sayood: Data compression concepts and algorithms and their applications to bioinformatics</article-title>
          . Entropy (Basel, Switzerland),
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <year>2010</year>
          ,
          <volume>34</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          19.
          <string-name>
            <given-names>B.</given-names>
            <surname>Paten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fitzgerald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flicek</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Holmes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Birney</surname>
          </string-name>
          :
          <article-title>Genome-wide nucleotidelevel mammalian ancestor reconstruction</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <year>2008</year>
          ,
          <fpage>1829</fpage>
          -
          <lpage>1833</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rissanen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Langdon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1979</year>
          ).
          <article-title>Arithmetic coding</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>149</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          21.
          <string-name>
            <given-names>T.F.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <surname>M.S.</surname>
          </string-name>
          <article-title>Waterman: Identification of common molecular subsequences</article-title>
          .
          <source>Journal of Molecular Biology</source>
          ,
          <volume>147</volume>
          (
          <issue>1</issue>
          ),
          <year>1981</year>
          ,
          <fpage>195</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          22.
          <string-name>
            <given-names>B.E.</given-names>
            <surname>Suzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McGarvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          , and
          <string-name>
            <surname>C.H.</surname>
          </string-name>
          <article-title>Wu: UniRef: comprehensive and nonredundant UniProt reference clusters</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>23</volume>
          (
          <issue>10</issue>
          ),
          <year>2007</year>
          ,
          <fpage>1282</fpage>
          -
          <lpage>1288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          23.
          <string-name>
            <surname>J.C.</surname>
          </string-name>
          <article-title>Venter: Multiple personal genomes await</article-title>
          .
          <source>Nature</source>
          ,
          <volume>464</volume>
          (
          <issue>7289</issue>
          ),
          <year>2010</year>
          ,
          <fpage>676</fpage>
          -
          <lpage>677</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>