<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Approximate Abundance Histograms and Their Use for Genome Size Estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mário Lipovský</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Vinarˇ</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronˇa Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics, and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská dolina, 842 48 Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>27</fpage>
      <lpage>34</lpage>
      <abstract>
        <p>DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>A genome is a collection of DNA molecules storing
genetic information in a cell. It can be represented as a set of
long strings over the alphabet Σ = {A,C, G, T } (each string
corresponding to one chromosome). In a DNA
sequencing experiment, many reads are produced from a genome.
These reads are short substrings obtained from random
locations of the genome, potentially with some sequencing
errors. For a genome to be analyzed, the DNA sequence
of individual chromosomes must be first assembled from
the reads, but the process of genome assembly is
computationally demanding and error-prone.</p>
      <p>
        However, it is possible to estimate the genome size and
some other characteristics without the need of genome
assembly [
        <xref ref-type="bibr" rid="ref2 ref5 ref8 ref9">2, 9, 5, 8</xref>
        ]. These estimates are computed from
a summary statistic of reads called the k-mer abundance
histogram. A k-mer is a substring of length exactly k and
the histogram summarizes the number of occurrences of
individual k-mers in the input set of reads.
      </p>
      <p>
        Most of the histogram computing methods count the
occurrences of each k-mer in the input data using hash
tables or suffix arrays [
        <xref ref-type="bibr" rid="ref3 ref4 ref6">6, 4, 3</xref>
        ]. Consequently, their
memory usage increases at least linearly with the
number of processed distinct k-mers. To reduce the memory
requirements, k-mer abundance histograms can be
computed approximately. One of the newest such algorithms,
Kmerlight [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], combines the techniques of sampling and
hashing to maintain a sketch of k-mers and from the
contents of the sketch computes an estimate of the histogram.
      </p>
      <p>
        The approximate histograms were already used as an
input for genome size estimation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], however the impact of
the approximation errors on the accuracy of the resulting
estimates was not evaluated. In this paper, we study the
character of errors of the Kmerlight’s histograms and their
influence on the subsequent genome size estimation.
      </p>
      <p>We start with an empirical study of the
approximation errors of Kmerlight algorithm, and we discover that
Kmerlight produces systematically biased estimates of
some histograms. We explain the source of the bias and
propose an unbiased modification of Kmerlight. Next we
model the distribution of Kmerlight’s errors with the
normal distribution, and we propose a formula that describes
Kmerlight’s variance. We then experimentally test our
theoretical model and explore its limitations.</p>
      <p>
        Finally, we use CovEst software [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to estimate the sizes
of simulated genomes from approximate histograms
produced by Kmerlight. We describe how different properties
of the input influence the accuracy of the estimates, and
we compare the estimates based on exact histograms to
the estimates based on approximate histograms.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Errors in Kmerlight’s Approximate</title>
    </sec>
    <sec id="sec-3">
      <title>Abundance Histograms</title>
      <p>
        A k-mer is a substring of length exactly k. A read of length
r thus contains r − k + 1 k-mers. If a k-mer occurs i times
among all input reads, we say its abundance is i. In this
paper, we use the value k = 21 as in previous works [
        <xref ref-type="bibr" rid="ref2 ref9">9, 2</xref>
        ].
Definition 1. The k-mer abundance histogram is a
sequence f = f1, f2, . . . fm, where fi is the number of k-mers
that occur in the input set exactly i times, and m is the
maximum observed abundance.
      </p>
      <p>Counting the abundance of each individual k-mer
appearing in a large input file requires large memory. As
we are only interested in the histogram, the problem of
kmer counting can be avoided, allowing us to estimate the
histogram more efficiently. We can reduce the amount of
required memory from dozens of gigabytes to hundreds of
megabytes, allowing these computations to be performed
on a personal computer rather than on a cluster.</p>
      <p>We will consider streaming algorithms, which in
general process a sequence of items (in our context k-mers)
in a single pass using only a limited amount of memory
and time. These algorithms maintain an approximate
summary, or a sketch, of the previously viewed k-mers and
counters at level w as t0(w). Using the assumption t0(w) ≈ r · p
we can derive the estimator of F0:</p>
      <p>w ln(t0(w)/r)</p>
      <p>Fˆ0 = 2 · ln 1 − 1r .</p>
      <p>
        To estimate the number of distinct k-mers F0 using this
formula, we choose level w = w∗, so that the number of
empty counters t0(w) at this level is closest to r/2. It has
been shown that selecting this level provides a bounded
error of Fˆ0 with a guaranteed probability [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Estimator of fi. The expected number of distinct k-mers
with abundance i hashed to level w is fi/2w. When a k-mer
is hashed into level w, the probability that it is stored in a
collision-free counter is (1 − 1r )F0/2w−1, which is the
probability that no other k-mer from level w will get hashed
into the same counter. Thus we can estimate the number
of collision-free counters with value i as
1 F0/2w−1
fi/2w · 1 − r
with each new k-mer the sketch is updated. When all the
k-mers are processed, the sketch can be analyzed to
provide the estimate of the k-mer abundance histogram.</p>
      <p>
        In 2002, Bar-Yossef et al. presented three algorithms
for estimating the number of distinct elements in a stream
m
(F0 = ∑i=1 fi) with theoretical guarantees [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In 2014,
Melsted and Halldórsson [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] implemented and extended
Algorithm 2 from the aforementioned paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and used it
for k-mer counting. Their algorithm KmerStream can also
estimate f1, the number of k-mers with abundance one.
      </p>
      <p>
        KmerStream was further improved by Sivadasan et al.
in 2016 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and their software Kmerlight estimates the
whole histogram ( f1, . . . , fm). As we focus on Kmerlight
in our work, we will describe it in detail.
2.1
      </p>
      <sec id="sec-3-1">
        <title>Kmerlight</title>
        <p>Kmerlight maintains a sketch of processed k-mers. Its
output are estimates Fˆ0, fˆ1, . . . , fˆm obtained from the final
content of the sketch. The scheme uses parameters W , r,
u and two hash functions g : Σk → {0, . . . , 2W − 1} and
h : Σk → {0, . . . , r − 1} × {0, . . . , u − 1}. In the analysis,
we will assume that g and h hash each string from Σk
uniformly and independently over their range of values.</p>
        <p>The sketch consists of W levels. Each level w has a hash
table with r counters Tw[0], . . . , Tw[r − 1]. Each counter
stores its value Tw[c].v (the abundance of a particular
kmer) and an auxiliary information Tw[c].p ∈ {0, . . . , u − 1}.</p>
        <p>To process an input k-mer x we first select its levelw as
the number of trailing zeroes in the binary representation
of g(x). As a result, the probability of selecting level w is
1/2w. Next, using has function h, the k-mer is hashed into
a pair (c, j) and processed as follows:
• If counter Tw[c] is empty, its value is increased to 1
and j is stored as an auxiliary information Tw[c].p.
• If Tw[c] is not empty, and Tw[c].p equals j, the counter
value Tw[c].v is incremented.
• Finally, if Tw[c].p 6= j, the counter is marked as dirty.</p>
        <p>Dirty counters are not modified by future updates.
Note that all occurrences of the same k-mer will be stored
in the same counter at the same level. The value of this
counter should correspond to the abundance of this k-mer.
Since two or more different k-mers may hash into the same
counter, collisions may occur. The auxiliary information
helps to detect some of these collisions.</p>
        <p>Estimator of F0. Since on average F0/2w distinct k-mers
are hashed into level w, the probability that a counter
at this level remains empty is approximately p = (1 −
1 )F0/2w . In this estimate and in all subsequent analyses,
r
we assume that the number of distinct k-mers at level w
is exactly F0/2w, although in fact it is a binomial random
variable with this number as the mean.</p>
        <p>The expected number of empty counters at level w is
thus r · p. Let us denote the number of observed empty
(1)
(2)
(3)
If we denote the number of observed collision-free
counters with value i as ti(w), we can derive the estimator of fi:
1 1−F0/2w
fˆi= ti(w) · 2w · 1 − r</p>
        <p>To estimate fi, the algorithm selects level w = wi∗ so that
it maximizes ti(w) – the number of observed collision-free
counters with value i.</p>
        <p>
          Undetected collisions. The value ti(w) is based on the
number of non-dirty counters with value i, but these include
both true positives (collision-free counters) and false
positives (counters with undetected collisions). The expected
number of false positive at one level is at most r/u, where
0, . . . u − 1 is the range of auxiliary information [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
Parameter u can be set to make false positives negligible; thus we
will assume that all collisions are being detected.
Median amplification. To decrease the variance of the
estimates and to use of multiprocessing, t independent
instances of Kmerlight’s sketch are run concurrently.
Estimate Fˆ0 is selected as the median of Fˆ(1), . . . , Fˆ0(t), and
0
final estimates of fi are also the medians of t instances.
Accuracy and complexity. Parameters r, u ant t provide
a trade-off between the memory and accuracy. Assuming
that W = Θ(log F0), the algorithm uses O(t · r · log(F0))
memory words, and processing of one k-mer requires O(t)
time. The authors have shown that the algorithm
computes estimates Fˆ0 and fˆi for sufficiently large fi ( fi ≥
F0/λ ) with a bounded relative error (1 − ε)F0 ≤ Fˆ0 ≤
(1 + ε)F0, (1 − ε) fi ≤ fˆi≤ (1 + ε) fi with probability
at least 1 − δ , when the parameters are set as follows:
t = O(log(λ /δ )), r = O( ελ2 ) and u = O( λεF20 ). Due to loose
constants, these asymptotic estimates cannot be used to set
parameters for obtaining desirable error bounds. The
accuracy of the algorithm was tested experimentally with
arbitrary parameters W = 64,t = 7, r ∈ {216, 218}, u = 213.
To study the character of approximation errors, we start
with several observations on generated data. We have first
generated a genome: a random sequence g1 . . . gL
consisting of L characters A,C, G, T , each with probability 1/4.
Next we have generated c · L/` reads of length ` = 100,
where c is a parameter describing the depth of coverage of
the genome by reads. For each read, we have uniformly
selected a random starting position s ∈ {1, . . . , L − ` + 1}.
The read then consists of characters gsgs+1 . . . gs+`−1.
Finally we have simulated sequencing errors by modifying
each base randomly with probability e. Our initial
experiment has used a single data set with parameters L = 106,
c = 50, and e = 0.02.
        </p>
        <p>
          For this data set, Figure 1 shows the exact k-mer
abundance histogram (k = 21) produced by Jellyfish [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and the
means and standard deviations of the estimates fˆifrom 50
Kmerlight runs on the same input. Kmerlight was run with
parameters W = 64, t = 7, r = 215, u = 213.
        </p>
        <p>Perhaps the most striking feature of this histogram is
that Kmerlight systematically overestimates values of fi.
In Section 2.3, we clarify the source of this bias and
present an unbiased estimator.</p>
        <p>Kmerlight guarantees precise estimates of those values
of fi that are close to F0. Unfortunately, in sequencing
data, sequencing errors often create many unique k-mers,
and thus we have f1 close to F0, while the remaining fi
are much smaller. In the extreme case of very low values
of fi, the probability that at least one k-mer with
abundance i becomes stored in a collision-free counter at any
level approaches to zero. If no k-mer survives the
collisions (ti(w) = 0) then fˆi= 0. In our experiment, Kmerlight
produced zero estimates for fi &lt; 500.</p>
        <p>
          Figure 1 shows that the absolute error ( fˆi− fi) and
variance of fˆidecrease with decreasing fi. However, relative
errors ( fˆi− fi)/ fi and their variance increase with
decreasing fi (data not shown). Kmerlight guarantees bounded
errors for high values of fi, but a theoretical quantitative
analysis of the error distribution was not presented in the
previous work [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We provide a quantitative estimate of
the variance in Section 2.4.
2.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Kmerlight Approximation Bias and Unbiased</title>
      </sec>
      <sec id="sec-3-3">
        <title>Version</title>
        <p>As we will explain in this section, the bias of Kmerlight
estimates towards higher values is caused by its selection
ˆ
of level wi∗ used to calculate fi. Level wi∗ is chosen to
maximize ti(w) – the number of collision-free counters with
value i at level w. We believe that the authors hoped to
minimize the variance of the estimator fˆiby including as
many k-mers in the estimate as possible.
ing the inequalities we get
Analytical w+. To understand which levels wi∗ are selected
by Kmerlight, we will analytically find the level wi+ that
maximizes E(ti(w)) – the expected number of collision-free
counters with value i.</p>
        <p>As shown in equation (2), E(ti(w)) = fi/2w ·
1 − 1r F0/2w−1. The value of wi+ at which E(ti(w))
is maximized can be obtained from inequalities
E(ti(wi+−1)) ≤ E(ti(wi+)) ≤ E(ti(wi++1)). By
manipulat1
4 ≤</p>
        <p>1 F0/2wi+
1 − r</p>
        <p>1
≤ 2 ,
(4)
and finally, we can calculatewi+:</p>
        <p>r r
lg F0 + lg lg r − 1 − 1 ≤ wi+ ≤ lg F0 + lg lg r − 1
.</p>
        <p>Symbol lg denotes the binary logarithm. Note these
inequalities have at most two integer solutions.</p>
        <p>The choice of level wi+ does not depend on values of i or
fi, but only on F0 and r. Since w1+ = w2+ = · · · = wm+, from
now on we will refer to this level simply by w+. This level
also maximizes the expected number of all non-empty
collision-free counters E(t(w)), where t(w) = ∑i=1 ti(w).
m
Explanation of bias. The number of collision-free
counters at levels w+ − 1 and w+ + 1 is typically similar to the
number of collision-free counters at level w+. There are
two times more k-mers hashed into level w+ − 1 than into
w+, but also more collisions happen at w+ − 1. These two
effects partially cancel each other and maintain similar
values of E(t(w)) for w = w+ − 1, w+, w+ + 1. As a result, any
of these levels can hold the maximal ti(w) and become
chosen by Kmerlight as its w∗. This typically leads to values
i
of ti(wi∗) higher than the expected value E(ti(w)) for fixed
level w = w∗. To obtain fˆi, value ti(wi∗) is multiplied by
i
2wi∗ · 1 − 1r 1−F0/2wi∗ . Thus biased ti(wi∗) also leads to
biased estimator fˆi.</p>
        <p>To illustrate this, Figure 2 shows values wi∗ for different
fi extracted from one Kmerlight run. The analytical level
w+ = 9 is selected most frequently, but Kmerlight often
chooses level 8 or 10 to maximize ti(w). Note that for 3 ≤
i ≤ 15 and i ≥ 40, the values of fi are low, and thus ti(w)
have a high relative variance. The maximal ti(w) can thus
be reached at levels more distant from w+ = 9.</p>
        <p>In order to further demonstrate the similarity of E(t(w))
at the levels close to w+, Figure 3 displays the theoretical
fractions of empty, collided, and non-empty collision-free
counters at different levels of the sketch. Focusing on a
single level w, let us denote as pc f the probability that
a single counter is non-empty and collision-free, pe the
probability that this counter is empty, and pc the
probability that it holds a collision. The number of distinct k-mers
hashed into one counter (X ) follows a binomial
distribution, X ∼ Bin(F0/2w, 1/r), and we can use this
distribution to compute pc f = P[X = 1], pe = P[X = 0], and pc =
P[X &gt; 1]. Note that the expected number of non-empty
collision-free counters at one level is E(t(w)) = r · pc f .</p>
        <p>The presented settings incidentally represent an
unfEa(vto(9r)a)b.leIf tshietureatwioans afgorreaKtemr deriflifgehret,ncebebceatwuseeenEE((tt(8(w))+)≈),
E(t(w+−1)) and E(t(w++1)), Kmerlight would choose the
level w+ much more often than the other levels and so it
would have a smaller chance of choosing ti(wi∗) &gt; E(t(w+)).
Removing bias from estimates of fi. Our observation
suggests a simple modification to remove the bias from
estimates of fi. In particular, we will use the level w+ that
maximizes the expected number of collision-free counters
E(t(w)), instead of the level wi∗ that maximizes the
observed number of collision-free counters ti(w).</p>
        <p>The modified Kmerlight creates sketches and estimates
F0 in the same way as the original algorithm. Then
using Fˆ0 and r, it calculates w+, as discussed above.
Finally, values of fi are estimated from the observed counts
of collision-free counters at level w+ by equation (3).</p>
        <p>Note that we do not prove analytically that this
estimator is unbiased. Simplifications in our analysis may cause
some small biases in the estimator, but we demonstrate
experimentally that the estimator works well on our
generated data. In Figure 4, we compare the original and
modified Kmerlight in 300 trials of both versions on the same
data as in Section 2.2. While the original Kmerlight
overestimates fi significantly, the modified Kmerlight achieves
E( fˆi) ≈ fi, without much change in the variance.
2.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation of Approximation Variance</title>
        <p>In this section, we estimate the variance of fˆi. We will
consider estimates obtained at a fixed levelw (for example
w = w+). Recall that E(ti(w)) = fi/2w · (1 − 1/r)F0/2w−1.
We will consider ti(w) to follow a binomial distribution
Bin( fi, ps), where ps = 1/2w · (1 − 1/r)F0/2w−1. This
simplification corresponds to a simple sampling process in
which we sample each of fi k-mers with probability ps,
and we discard each k-mer with probability 1 − ps. Note
that this approach assumes that each k-mer remains
collision free independently of others, which is not the case.</p>
        <p>Since a random variable following Bin(n, p) has
variance of np(1 − p), Var(ti(w)) = fi · ps · (1 − ps). The
estimate of fi is obtained as ti/ps, so</p>
        <p>Var( fˆi) = Var
ti
ps</p>
        <p>1
= p2 · Var(ti) =
s
fi · (1 − ps)
ps
(5)
Effect of medians. Kmerlight chooses the estimate fˆias
a median of estimates of t independent sketches: fˆi=
med( fˆi(1), . . . fˆi(t)). For a continuous random variable with
a density function f (x) and mean x¯, its sample median
from a sample of size n is asymptotically1 normal:
(6)
(7)
med(x) ∼˙ N x¯,</p>
        <p>1
4n f (x¯)2
.</p>
        <p>As the binomial distribution of ti(w) is a discrete
distribution, we will approximate Bin( fi, ps) with N(μ =
fi ps, σ 2 = fi ps(1 − ps)). The density function of normal
distribution in its mean μ is √21πσ2 .</p>
        <p>Using approximations (5), (6), variance of fˆiselected as
a median of t instances can be derived as follows:</p>
        <p>Var( fˆi) ≈ 24πt Var( fˆi(l)) = 2πt fi · (1p−s ps) .</p>
        <p>Experiments. We ran the modified Kmerlight in 300
trials on the data presented in Section 2.2. Figure 5 shows
histograms of values of fˆifor selected2 values of i. We
compare these histograms with two normal distributions.
The best normal fit is a Gaussian with its mean and
stanˆ
dard deviation obtained from the observed values of fi.
The plotted theoretical prediction uses the exact value of
fi as its mean and the variance is calculated according to
equation (7) using the exact values of fi and F0.</p>
        <p>The quality of normal approximation largely depends
on the true value of fi. According to our approximation,
the expected number of counters with value i at level w+ is
E(ti) = fi · ps. For this dataset, w+ = 9, and thus the
sam1
pling probability ps is approximately 1/29 · 2 ≈ 1/1000
when we use the upper bound from equation (4).</p>
        <p>
          For the lowest values of fi (i.e. f6, f10), we have E(ti) &lt;
1. So typically no k-mers survive the collisions at level
1Even for a sample of only 7 observations drawn from the normal
distribution, the relative error of this approximation is only about 6% [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>2To select values i = 6, 10, 13, 16, 32, 23, we have sorted the values
fi and selected each eighth value. We have also included f1, f2, since
they differ from other fi by orders of magnitude.
w+, and thus ti = 0 and fˆi= 0. Due to the use of medians,
at least one k-mer must survive in at least four instances to
produce ti ≥ 1. As a result, we obtain fˆi= 0 in all trials
even for f10 = 140. Since the estimates fˆiare always zero,
our estimates of variances for these fi are very imprecise.</p>
        <p>If the value fi is such that E(ti) is around 1 (i.e. f13, f16),
a very small number of k-mers hash into collision-free
counters at level wi∗ or w+. Therefore the estimator fˆican
reach only a limited set of discrete values ( fˆi= 0 if no
kmer survives collisions, fˆi= 1/ps ≈ 1000 if one k-mer is
in collision-free counter, fˆi= 2/ps ≈ 2000 if two k-mers
survive, . . . ), as it can be seen in Figure 5. The
approximation with normal distribution is not precise for these values
fi, since the distribution of fˆiis clearly discrete.</p>
        <p>Finally, for higher fi, where E(ti) 1, the estimator fˆi
takes on various values, and the approximation with
normal distribution seems reasonable, as it can be seen from
the bottom row of Figure 5.</p>
        <p>We have also applied Kolmogorov-Smirnov tests to test
the normality of fˆi. These tests reject normality for fi &lt;
20, 000 with our dataset of 300 trials at the significance
level of 5%. Overall, we conclude that for sufficiently high
values of fi the distribution of fˆican be approximated by
Gaussian with variance calculated by (7).</p>
        <p>Finally we present the comparison of Kmerlight’s
variance and its theoretical prediction in Figure 6. In this
experiment, our estimate of the standard deviation of
Kmerlight’s estimates has error less than 5% even for
lower fi with values around 1000 ≈ 1/ps., where the
normal approximation was still inadequate. The experiment
further suggests that an accurate estimate of variance for
the lowest values of fi &lt; 100 would be zero.</p>
        <p>
          Figure 6 also reveals a difference in variances between
the original and modified Kmerlight’s estimates for fi ∈
(100, 1000). The original Kmerlight searches for a level w
that maximizes ti(w) so even if only one k-mer with
abundance i survives the collisions at any level of the sketch,
ˆ
Kmerlight will use it to estimate fi. We did not include
this effect into the variance estimation, hence the estimates
for these fi are not precise for the original Kmerlight.
Sivadasan et al [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] provide theoretical bounds on
Kmerlight’s approximation errors, but they do not provide
methods for setting parameters in practical scenarios. In
this section, we show how to set the parameters in order to
achieve relative error bounded by ε with probability 1 − δ ,
under the assumption that fˆiis distributed according to the
normal approximation derived in the previous section.
        </p>
        <p>We first derive a two-sided prediction interval of fˆi.
Let u α2 be the critical value of the normal
distribution for significance level α2 . By standardization we get
( fˆi− fi)/σ ∼˙N(0, 1), and so we have</p>
        <p>α
P −u 2 ≤
fˆi− fi
σ</p>
        <p>α
≤ u 2
≈ 1 − α.</p>
        <p>If we set ε = u α2 σ / fi, we obtain the bounds for fˆi:</p>
        <p>P (1 − ε) fi ≤ fˆi≤ (1 + ε) fi ≈ 1 − α.</p>
        <p>To achieve bounded error with probability at least 1 − δ
simultaneously for m different values of i, we set α = mδ
following the Bonferroni correction.</p>
        <p>Standard deviation σ can be calculated by the equation
(7) using F0, fi, r,t. Note that σ depends logarithmically
on F0 and thus a rough estimate of F0 is sufficient. We
restrict the bounded precision only for sufficiently large fi
by setting fi = F0/λ (λ = 1000 for example). The
estimates of larger fi will be even more precise. Finally, by
using ε = u α2 σ / fi, we can calculate error bound ε for
a given probability δ and parameters r,t, λ . This allows
us to efficiently explore various values ofr, and t and their
influence on approximation errors. For example, we can
find parameters which minimize the approximation error
for a fixed memory limit.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Genome Size Estimation</title>
      <p>
        One motivation for computing k-mer abundance
histograms from sequencing data is to obtain genome size
estimates. Here we use CovEst software [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] developed in our
group to compare the accuracy of genome size estimates
produced from the exact and approximate histograms.
      </p>
      <p>
        We use data generated as described in Section 2.2, but
we vary parameters L (genome size), c (genome
coverage by reads), e (sequencing error rate). For each set of
genome parameters (c, e, L) we have generated 50 data
sets. Then we computed the exact histogram using
Jellyfish software [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and two approximate histograms using
the original and the modified version of Kmerlight. Finally
we ran CovEst on these three histograms using the model
without repeats, and we collected three estimates of
coverage cˆj, cˆok, cˆmk. In all experimental results, we focus on the
estimates of coverage cˆ. By dividing the sum of lengths of
all reads by cˆ, we can obtain estimates of genome size Lˆ.
      </p>
      <p>In the first experiment (top row of Figure 7) we
investigate the effect of increasing sequencing error rates on
the accuracy of coverage estimates. On exact histograms,
CovEst produces unbiased estimates of coverage with their
variance increasing with error rate. Estimates based on
approximate histograms are clearly less accurate but still
achieve a relatively good precision.</p>
      <p>Mean errors are consistently lower for modified
Kmerlight than the errors of the original Kmerlight. Pair
Student’s t-tests reject hypotheses mean(cˆok − cˆmk) = 0
for three of four presented datasets, with exception of
the dataset with e = 0.05 where the p-value is 0.3.
Thus we conclude that the estimates based on modified
Kmerlight’s histograms are significantly better than the
estimates based on original Kmerlight’s histograms. With
modified Kmerlight, CovEst produces estimates with all
errors bounded by 0.4%, 1%, 4%, 15% for sequencing
error rates of 0, 0.01, 0.05, 0.1 respectively, and we consider
these estimates sufficiently accurate.</p>
      <p>In the second experiment (middle row of Figure 7),
we study the accuracy for different genome coverage
values. All CovEst estimates cˆj in our experiment range in
30%, 6%, 0.3%, &lt; 0.1% intervals around the true coverage
for coverages of 0.5, 2, 10 and 50. The variance of cˆj
is comparable to variance of cˆok, cˆmk for two lower
coverages, but with increasing coverage, estimates based on
approximate histograms become less accurate. However,
since the maximal errors reach values of only 0.3, which
is less than 1% of the true coverage c = 50, we also
consider these estimates as useful. Modified Kmerlight
significantly outperforms the original Kmerlight on all four
presented datasets (verified by t-tests).</p>
      <p>Finally, with increasing genome size (bottom row of
Figure 7), CovEst’s estimates become more precise on
the exact histograms, and estimates on approximated
histograms maintain a roughly constant precision for the
examined genome sizes. As the coverage estimate
errors are bounded by 1% for all datasets, we also
consider the approximate histograms sufficiently accurate for
the genome size estimation. Modified Kmerlight reaches
smaller mean error on the two smaller genomes than the
original Kmerlight.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper, we have studied the character of
approximation errors in the k-mer approximation histograms
produced by Kmerlight [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We have discovered that
Kmerlight’s estimates of fˆiare biased towards higher
values and provided a new estimate without this bias. We
have demonstrated that for sufficiently large values of fi,
the distribution of estimates can be well approximated by
a normal distribution. This approximation can be used to
tune the parameters of the method.
      </p>
      <p>We have also demonstrated that approximate histograms
produced by Kmerlight are sufficiently accurate to produce
reasonable estimates of genome size, and that for most
settings our modified version of Kmerlight leads to more
accurate genome size estimation than the original Kmerlight.
An important avenue for further research is to compare the
accuracy CovEst results on real sequencing data.</p>
      <p>The use of approximate histograms allows significant
reduction in memory; for example for genome size L =
108, coverage c = 50 and error rate e = 0.1 (F0 = 3.6 · 109),
Kmerlight can produce an approximate histogram in only
83MB of memory, whereas Jellyfish needs 34Gb.</p>
      <p>Note that our modified Kmerlight uses only one of 64
Kmerlight’s levels to compute estimates for all fi, so it
naively seems that we could maintain only this level and
reduce the memory consumption by a factor of 64.
However, our version still uses the full 64 levels to compute Fˆ0,
which is needed to find the desired level w+. We could
try to estimate Fˆ0 separately, particularly, if two passes
over the data are allowed. For example, we could roughly
guess F0 from the size of sequencing data and use fewer
levels. It might be also sufficient to estimate F0 with
smaller hash tables. We did not inquire deeper into this
topic, but we believe that by such techniques the memory
consumption could be further decreased by a significant
factor.</p>
      <p>Also note that the k-mer abundance histogram and many
algorithms used for its computation can be generalized to
a histogram of any input items. Thus the applicability of
this topic goes beyond the field of bioinformatics.
Acknowledgments. This research was funded by VEGA
grants 1/0684/16 (BB) and 1/0719/14 (TV), and a grant
from the Slovak Research and Development Agency
APVV-14-0253.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bar-Yossef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.S.</given-names>
            <surname>Jayram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sivakumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Trevisan</surname>
          </string-name>
          .
          <article-title>Counting distinct elements in a data stream</article-title>
          .
          <source>In International Workshop on Randomization and Approximation Techniques (RANDOM)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . Springer,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hozza</surname>
          </string-name>
          , T. Vinarˇ, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Brejová</surname>
          </string-name>
          .
          <article-title>How Big is That Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra</article-title>
          .
          <source>In String Processing and Information Retrieval (SPIRE)</source>
          , pages
          <fpage>199</fpage>
          -
          <lpage>209</lpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kurtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narechania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ware</surname>
          </string-name>
          .
          <article-title>A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
          .
          <source>BMC Genomics</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>517</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Marçais</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Kingsford</surname>
          </string-name>
          .
          <article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>27</volume>
          (
          <issue>6</issue>
          ):
          <fpage>764</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melsted</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Halldórsson</surname>
          </string-name>
          .
          <article-title>Kmerstream: streaming algorithms for k-mer abundance estimation</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>30</volume>
          (
          <issue>24</issue>
          ):
          <fpage>3541</fpage>
          -
          <lpage>3547</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melsted</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pritchard</surname>
          </string-name>
          .
          <article-title>Efficient counting of kmers in DNA sequences using a bloom filter</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <fpage>333</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Rider</surname>
          </string-name>
          .
          <article-title>Variance of the median of small samples from several special populations</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>55</volume>
          (
          <issue>289</issue>
          ):
          <fpage>148</fpage>
          -
          <lpage>150</lpage>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sivadasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Goyal</surname>
          </string-name>
          .
          <article-title>Kmerlight: fast and accurate k-mer abundance estimation</article-title>
          .
          <source>CoRR, abs/1609.05626</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Trimble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shilts</surname>
          </string-name>
          , F. Meyer, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ochman</surname>
          </string-name>
          .
          <article-title>Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes</article-title>
          .
          <source>BMC Genomics</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>537</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>