1 Introduction

1613-0073

Approximate Abundance Histograms and Their Use for Genome Size Estimation

Mário Lipovský

Tomáš Vinarˇ

Bronˇa Brejová

0 0 Faculty of Mathematics , Physics, and Informatics , Comenius University , Mlynská dolina, 842 48 Bratislava , Slovakia

2017

1885 27 34

DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.

1 Introduction

A genome is a collection of DNA molecules storing genetic information in a cell. It can be represented as a set of long strings over the alphabet Σ = {A,C, G, T } (each string corresponding to one chromosome). In a DNA sequencing experiment, many reads are produced from a genome. These reads are short substrings obtained from random locations of the genome, potentially with some sequencing errors. For a genome to be analyzed, the DNA sequence of individual chromosomes must be first assembled from the reads, but the process of genome assembly is computationally demanding and error-prone.

However, it is possible to estimate the genome size and some other characteristics without the need of genome assembly [ 2, 9, 5, 8 ]. These estimates are computed from a summary statistic of reads called the k-mer abundance histogram. A k-mer is a substring of length exactly k and the histogram summarizes the number of occurrences of individual k-mers in the input set of reads.

Most of the histogram computing methods count the occurrences of each k-mer in the input data using hash tables or suffix arrays [ 6, 4, 3 ]. Consequently, their memory usage increases at least linearly with the number of processed distinct k-mers. To reduce the memory requirements, k-mer abundance histograms can be computed approximately. One of the newest such algorithms, Kmerlight [ 8 ], combines the techniques of sampling and hashing to maintain a sketch of k-mers and from the contents of the sketch computes an estimate of the histogram.

The approximate histograms were already used as an input for genome size estimation [ 8 ], however the impact of the approximation errors on the accuracy of the resulting estimates was not evaluated. In this paper, we study the character of errors of the Kmerlight’s histograms and their influence on the subsequent genome size estimation.

We start with an empirical study of the approximation errors of Kmerlight algorithm, and we discover that Kmerlight produces systematically biased estimates of some histograms. We explain the source of the bias and propose an unbiased modification of Kmerlight. Next we model the distribution of Kmerlight’s errors with the normal distribution, and we propose a formula that describes Kmerlight’s variance. We then experimentally test our theoretical model and explore its limitations.

Finally, we use CovEst software [ 2 ] to estimate the sizes of simulated genomes from approximate histograms produced by Kmerlight. We describe how different properties of the input influence the accuracy of the estimates, and we compare the estimates based on exact histograms to the estimates based on approximate histograms. 2

Errors in Kmerlight’s Approximate Abundance Histograms

A k-mer is a substring of length exactly k. A read of length r thus contains r − k + 1 k-mers. If a k-mer occurs i times among all input reads, we say its abundance is i. In this paper, we use the value k = 21 as in previous works [ 9, 2 ]. Definition 1. The k-mer abundance histogram is a sequence f = f1, f2, . . . fm, where fi is the number of k-mers that occur in the input set exactly i times, and m is the maximum observed abundance.

Counting the abundance of each individual k-mer appearing in a large input file requires large memory. As we are only interested in the histogram, the problem of kmer counting can be avoided, allowing us to estimate the histogram more efficiently. We can reduce the amount of required memory from dozens of gigabytes to hundreds of megabytes, allowing these computations to be performed on a personal computer rather than on a cluster.

We will consider streaming algorithms, which in general process a sequence of items (in our context k-mers) in a single pass using only a limited amount of memory and time. These algorithms maintain an approximate summary, or a sketch, of the previously viewed k-mers and counters at level w as t0(w). Using the assumption t0(w) ≈ r · p we can derive the estimator of F0:

w ln(t0(w)/r)

Fˆ0 = 2 · ln 1 − 1r .

To estimate the number of distinct k-mers F0 using this formula, we choose level w = w∗, so that the number of empty counters t0(w) at this level is closest to r/2. It has been shown that selecting this level provides a bounded error of Fˆ0 with a guaranteed probability [ 1 ].

Estimator of fi. The expected number of distinct k-mers with abundance i hashed to level w is fi/2w. When a k-mer is hashed into level w, the probability that it is stored in a collision-free counter is (1 − 1r )F0/2w−1, which is the probability that no other k-mer from level w will get hashed into the same counter. Thus we can estimate the number of collision-free counters with value i as 1 F0/2w−1 fi/2w · 1 − r with each new k-mer the sketch is updated. When all the k-mers are processed, the sketch can be analyzed to provide the estimate of the k-mer abundance histogram.

In 2002, Bar-Yossef et al. presented three algorithms for estimating the number of distinct elements in a stream m (F0 = ∑i=1 fi) with theoretical guarantees [ 1 ]. In 2014, Melsted and Halldórsson [ 5 ] implemented and extended Algorithm 2 from the aforementioned paper [ 1 ] and used it for k-mer counting. Their algorithm KmerStream can also estimate f1, the number of k-mers with abundance one.

KmerStream was further improved by Sivadasan et al. in 2016 [ 8 ], and their software Kmerlight estimates the whole histogram ( f1, . . . , fm). As we focus on Kmerlight in our work, we will describe it in detail. 2.1

Kmerlight

Kmerlight maintains a sketch of processed k-mers. Its output are estimates Fˆ0, fˆ1, . . . , fˆm obtained from the final content of the sketch. The scheme uses parameters W , r, u and two hash functions g : Σk → {0, . . . , 2W − 1} and h : Σk → {0, . . . , r − 1} × {0, . . . , u − 1}. In the analysis, we will assume that g and h hash each string from Σk uniformly and independently over their range of values.

The sketch consists of W levels. Each level w has a hash table with r counters Tw[0], . . . , Tw[r − 1]. Each counter stores its value Tw[c].v (the abundance of a particular kmer) and an auxiliary information Tw[c].p ∈ {0, . . . , u − 1}.

To process an input k-mer x we first select its levelw as the number of trailing zeroes in the binary representation of g(x). As a result, the probability of selecting level w is 1/2w. Next, using has function h, the k-mer is hashed into a pair (c, j) and processed as follows: • If counter Tw[c] is empty, its value is increased to 1 and j is stored as an auxiliary information Tw[c].p. • If Tw[c] is not empty, and Tw[c].p equals j, the counter value Tw[c].v is incremented. • Finally, if Tw[c].p 6= j, the counter is marked as dirty.

Dirty counters are not modified by future updates. Note that all occurrences of the same k-mer will be stored in the same counter at the same level. The value of this counter should correspond to the abundance of this k-mer. Since two or more different k-mers may hash into the same counter, collisions may occur. The auxiliary information helps to detect some of these collisions.

Estimator of F0. Since on average F0/2w distinct k-mers are hashed into level w, the probability that a counter at this level remains empty is approximately p = (1 − 1 )F0/2w . In this estimate and in all subsequent analyses, r we assume that the number of distinct k-mers at level w is exactly F0/2w, although in fact it is a binomial random variable with this number as the mean.

The expected number of empty counters at level w is thus r · p. Let us denote the number of observed empty (1) (2) (3) If we denote the number of observed collision-free counters with value i as ti(w), we can derive the estimator of fi: 1 1−F0/2w fˆi= ti(w) · 2w · 1 − r

To estimate fi, the algorithm selects level w = wi∗ so that it maximizes ti(w) – the number of observed collision-free counters with value i.

Undetected collisions. The value ti(w) is based on the number of non-dirty counters with value i, but these include both true positives (collision-free counters) and false positives (counters with undetected collisions). The expected number of false positive at one level is at most r/u, where 0, . . . u − 1 is the range of auxiliary information [ 8 ]. Parameter u can be set to make false positives negligible; thus we will assume that all collisions are being detected. Median amplification. To decrease the variance of the estimates and to use of multiprocessing, t independent instances of Kmerlight’s sketch are run concurrently. Estimate Fˆ0 is selected as the median of Fˆ(1), . . . , Fˆ0(t), and 0 final estimates of fi are also the medians of t instances. Accuracy and complexity. Parameters r, u ant t provide a trade-off between the memory and accuracy. Assuming that W = Θ(log F0), the algorithm uses O(t · r · log(F0)) memory words, and processing of one k-mer requires O(t) time. The authors have shown that the algorithm computes estimates Fˆ0 and fˆi for sufficiently large fi ( fi ≥ F0/λ ) with a bounded relative error (1 − ε)F0 ≤ Fˆ0 ≤ (1 + ε)F0, (1 − ε) fi ≤ fˆi≤ (1 + ε) fi with probability at least 1 − δ , when the parameters are set as follows: t = O(log(λ /δ )), r = O( ελ2 ) and u = O( λεF20 ). Due to loose constants, these asymptotic estimates cannot be used to set parameters for obtaining desirable error bounds. The accuracy of the algorithm was tested experimentally with arbitrary parameters W = 64,t = 7, r ∈ {216, 218}, u = 213. To study the character of approximation errors, we start with several observations on generated data. We have first generated a genome: a random sequence g1 . . . gL consisting of L characters A,C, G, T , each with probability 1/4. Next we have generated c · L/` reads of length ` = 100, where c is a parameter describing the depth of coverage of the genome by reads. For each read, we have uniformly selected a random starting position s ∈ {1, . . . , L − ` + 1}. The read then consists of characters gsgs+1 . . . gs+`−1. Finally we have simulated sequencing errors by modifying each base randomly with probability e. Our initial experiment has used a single data set with parameters L = 106, c = 50, and e = 0.02.

For this data set, Figure 1 shows the exact k-mer abundance histogram (k = 21) produced by Jellyfish [ 4 ] and the means and standard deviations of the estimates fˆifrom 50 Kmerlight runs on the same input. Kmerlight was run with parameters W = 64, t = 7, r = 215, u = 213.

Perhaps the most striking feature of this histogram is that Kmerlight systematically overestimates values of fi. In Section 2.3, we clarify the source of this bias and present an unbiased estimator.

Kmerlight guarantees precise estimates of those values of fi that are close to F0. Unfortunately, in sequencing data, sequencing errors often create many unique k-mers, and thus we have f1 close to F0, while the remaining fi are much smaller. In the extreme case of very low values of fi, the probability that at least one k-mer with abundance i becomes stored in a collision-free counter at any level approaches to zero. If no k-mer survives the collisions (ti(w) = 0) then fˆi= 0. In our experiment, Kmerlight produced zero estimates for fi < 500.

Figure 1 shows that the absolute error ( fˆi− fi) and variance of fˆidecrease with decreasing fi. However, relative errors ( fˆi− fi)/ fi and their variance increase with decreasing fi (data not shown). Kmerlight guarantees bounded errors for high values of fi, but a theoretical quantitative analysis of the error distribution was not presented in the previous work [ 8 ]. We provide a quantitative estimate of the variance in Section 2.4. 2.3

Kmerlight Approximation Bias and Unbiased Version

As we will explain in this section, the bias of Kmerlight estimates towards higher values is caused by its selection ˆ of level wi∗ used to calculate fi. Level wi∗ is chosen to maximize ti(w) – the number of collision-free counters with value i at level w. We believe that the authors hoped to minimize the variance of the estimator fˆiby including as many k-mers in the estimate as possible. ing the inequalities we get Analytical w+. To understand which levels wi∗ are selected by Kmerlight, we will analytically find the level wi+ that maximizes E(ti(w)) – the expected number of collision-free counters with value i.

As shown in equation (2), E(ti(w)) = fi/2w · 1 − 1r F0/2w−1. The value of wi+ at which E(ti(w)) is maximized can be obtained from inequalities E(ti(wi+−1)) ≤ E(ti(wi+)) ≤ E(ti(wi++1)). By manipulat1 4 ≤

1 F0/2wi+ 1 − r

1 ≤ 2 , (4) and finally, we can calculatewi+:

r r lg F0 + lg lg r − 1 − 1 ≤ wi+ ≤ lg F0 + lg lg r − 1 .

Symbol lg denotes the binary logarithm. Note these inequalities have at most two integer solutions.

The choice of level wi+ does not depend on values of i or fi, but only on F0 and r. Since w1+ = w2+ = · · · = wm+, from now on we will refer to this level simply by w+. This level also maximizes the expected number of all non-empty collision-free counters E(t(w)), where t(w) = ∑i=1 ti(w). m Explanation of bias. The number of collision-free counters at levels w+ − 1 and w+ + 1 is typically similar to the number of collision-free counters at level w+. There are two times more k-mers hashed into level w+ − 1 than into w+, but also more collisions happen at w+ − 1. These two effects partially cancel each other and maintain similar values of E(t(w)) for w = w+ − 1, w+, w+ + 1. As a result, any of these levels can hold the maximal ti(w) and become chosen by Kmerlight as its w∗. This typically leads to values i of ti(wi∗) higher than the expected value E(ti(w)) for fixed level w = w∗. To obtain fˆi, value ti(wi∗) is multiplied by i 2wi∗ · 1 − 1r 1−F0/2wi∗ . Thus biased ti(wi∗) also leads to biased estimator fˆi.

To illustrate this, Figure 2 shows values wi∗ for different fi extracted from one Kmerlight run. The analytical level w+ = 9 is selected most frequently, but Kmerlight often chooses level 8 or 10 to maximize ti(w). Note that for 3 ≤ i ≤ 15 and i ≥ 40, the values of fi are low, and thus ti(w) have a high relative variance. The maximal ti(w) can thus be reached at levels more distant from w+ = 9.

In order to further demonstrate the similarity of E(t(w)) at the levels close to w+, Figure 3 displays the theoretical fractions of empty, collided, and non-empty collision-free counters at different levels of the sketch. Focusing on a single level w, let us denote as pc f the probability that a single counter is non-empty and collision-free, pe the probability that this counter is empty, and pc the probability that it holds a collision. The number of distinct k-mers hashed into one counter (X ) follows a binomial distribution, X ∼ Bin(F0/2w, 1/r), and we can use this distribution to compute pc f = P[X = 1], pe = P[X = 0], and pc = P[X > 1]. Note that the expected number of non-empty collision-free counters at one level is E(t(w)) = r · pc f .

The presented settings incidentally represent an unfEa(vto(9r)a)b.leIf tshietureatwioans afgorreaKtemr deriflifgehret,ncebebceatwuseeenEE((tt(8(w))+)≈), E(t(w+−1)) and E(t(w++1)), Kmerlight would choose the level w+ much more often than the other levels and so it would have a smaller chance of choosing ti(wi∗) > E(t(w+)). Removing bias from estimates of fi. Our observation suggests a simple modification to remove the bias from estimates of fi. In particular, we will use the level w+ that maximizes the expected number of collision-free counters E(t(w)), instead of the level wi∗ that maximizes the observed number of collision-free counters ti(w).

The modified Kmerlight creates sketches and estimates F0 in the same way as the original algorithm. Then using Fˆ0 and r, it calculates w+, as discussed above. Finally, values of fi are estimated from the observed counts of collision-free counters at level w+ by equation (3).

Note that we do not prove analytically that this estimator is unbiased. Simplifications in our analysis may cause some small biases in the estimator, but we demonstrate experimentally that the estimator works well on our generated data. In Figure 4, we compare the original and modified Kmerlight in 300 trials of both versions on the same data as in Section 2.2. While the original Kmerlight overestimates fi significantly, the modified Kmerlight achieves E( fˆi) ≈ fi, without much change in the variance. 2.4

Evaluation of Approximation Variance

In this section, we estimate the variance of fˆi. We will consider estimates obtained at a fixed levelw (for example w = w+). Recall that E(ti(w)) = fi/2w · (1 − 1/r)F0/2w−1. We will consider ti(w) to follow a binomial distribution Bin( fi, ps), where ps = 1/2w · (1 − 1/r)F0/2w−1. This simplification corresponds to a simple sampling process in which we sample each of fi k-mers with probability ps, and we discard each k-mer with probability 1 − ps. Note that this approach assumes that each k-mer remains collision free independently of others, which is not the case.

Since a random variable following Bin(n, p) has variance of np(1 − p), Var(ti(w)) = fi · ps · (1 − ps). The estimate of fi is obtained as ti/ps, so

Var( fˆi) = Var ti ps

1 = p2 · Var(ti) = s fi · (1 − ps) ps (5) Effect of medians. Kmerlight chooses the estimate fˆias a median of estimates of t independent sketches: fˆi= med( fˆi(1), . . . fˆi(t)). For a continuous random variable with a density function f (x) and mean x¯, its sample median from a sample of size n is asymptotically1 normal: (6) (7) med(x) ∼˙ N x¯,

1 4n f (x¯)2 .

As the binomial distribution of ti(w) is a discrete distribution, we will approximate Bin( fi, ps) with N(μ = fi ps, σ 2 = fi ps(1 − ps)). The density function of normal distribution in its mean μ is √21πσ2 .

Using approximations (5), (6), variance of fˆiselected as a median of t instances can be derived as follows:

Var( fˆi) ≈ 24πt Var( fˆi(l)) = 2πt fi · (1p−s ps) .

Experiments. We ran the modified Kmerlight in 300 trials on the data presented in Section 2.2. Figure 5 shows histograms of values of fˆifor selected2 values of i. We compare these histograms with two normal distributions. The best normal fit is a Gaussian with its mean and stanˆ dard deviation obtained from the observed values of fi. The plotted theoretical prediction uses the exact value of fi as its mean and the variance is calculated according to equation (7) using the exact values of fi and F0.

The quality of normal approximation largely depends on the true value of fi. According to our approximation, the expected number of counters with value i at level w+ is E(ti) = fi · ps. For this dataset, w+ = 9, and thus the sam1 pling probability ps is approximately 1/29 · 2 ≈ 1/1000 when we use the upper bound from equation (4).

For the lowest values of fi (i.e. f6, f10), we have E(ti) < 1. So typically no k-mers survive the collisions at level 1Even for a sample of only 7 observations drawn from the normal distribution, the relative error of this approximation is only about 6% [ 7 ].

2To select values i = 6, 10, 13, 16, 32, 23, we have sorted the values fi and selected each eighth value. We have also included f1, f2, since they differ from other fi by orders of magnitude. w+, and thus ti = 0 and fˆi= 0. Due to the use of medians, at least one k-mer must survive in at least four instances to produce ti ≥ 1. As a result, we obtain fˆi= 0 in all trials even for f10 = 140. Since the estimates fˆiare always zero, our estimates of variances for these fi are very imprecise.

If the value fi is such that E(ti) is around 1 (i.e. f13, f16), a very small number of k-mers hash into collision-free counters at level wi∗ or w+. Therefore the estimator fˆican reach only a limited set of discrete values ( fˆi= 0 if no kmer survives collisions, fˆi= 1/ps ≈ 1000 if one k-mer is in collision-free counter, fˆi= 2/ps ≈ 2000 if two k-mers survive, . . . ), as it can be seen in Figure 5. The approximation with normal distribution is not precise for these values fi, since the distribution of fˆiis clearly discrete.

Finally, for higher fi, where E(ti) 1, the estimator fˆi takes on various values, and the approximation with normal distribution seems reasonable, as it can be seen from the bottom row of Figure 5.

We have also applied Kolmogorov-Smirnov tests to test the normality of fˆi. These tests reject normality for fi < 20, 000 with our dataset of 300 trials at the significance level of 5%. Overall, we conclude that for sufficiently high values of fi the distribution of fˆican be approximated by Gaussian with variance calculated by (7).

Finally we present the comparison of Kmerlight’s variance and its theoretical prediction in Figure 6. In this experiment, our estimate of the standard deviation of Kmerlight’s estimates has error less than 5% even for lower fi with values around 1000 ≈ 1/ps., where the normal approximation was still inadequate. The experiment further suggests that an accurate estimate of variance for the lowest values of fi < 100 would be zero.

Figure 6 also reveals a difference in variances between the original and modified Kmerlight’s estimates for fi ∈ (100, 1000). The original Kmerlight searches for a level w that maximizes ti(w) so even if only one k-mer with abundance i survives the collisions at any level of the sketch, ˆ Kmerlight will use it to estimate fi. We did not include this effect into the variance estimation, hence the estimates for these fi are not precise for the original Kmerlight. Sivadasan et al [ 8 ] provide theoretical bounds on Kmerlight’s approximation errors, but they do not provide methods for setting parameters in practical scenarios. In this section, we show how to set the parameters in order to achieve relative error bounded by ε with probability 1 − δ , under the assumption that fˆiis distributed according to the normal approximation derived in the previous section.

We first derive a two-sided prediction interval of fˆi. Let u α2 be the critical value of the normal distribution for significance level α2 . By standardization we get ( fˆi− fi)/σ ∼˙N(0, 1), and so we have

α P −u 2 ≤ fˆi− fi σ

α ≤ u 2 ≈ 1 − α.

If we set ε = u α2 σ / fi, we obtain the bounds for fˆi:

P (1 − ε) fi ≤ fˆi≤ (1 + ε) fi ≈ 1 − α.

To achieve bounded error with probability at least 1 − δ simultaneously for m different values of i, we set α = mδ following the Bonferroni correction.

Standard deviation σ can be calculated by the equation (7) using F0, fi, r,t. Note that σ depends logarithmically on F0 and thus a rough estimate of F0 is sufficient. We restrict the bounded precision only for sufficiently large fi by setting fi = F0/λ (λ = 1000 for example). The estimates of larger fi will be even more precise. Finally, by using ε = u α2 σ / fi, we can calculate error bound ε for a given probability δ and parameters r,t, λ . This allows us to efficiently explore various values ofr, and t and their influence on approximation errors. For example, we can find parameters which minimize the approximation error for a fixed memory limit. 3

Genome Size Estimation

One motivation for computing k-mer abundance histograms from sequencing data is to obtain genome size estimates. Here we use CovEst software [ 2 ] developed in our group to compare the accuracy of genome size estimates produced from the exact and approximate histograms.

We use data generated as described in Section 2.2, but we vary parameters L (genome size), c (genome coverage by reads), e (sequencing error rate). For each set of genome parameters (c, e, L) we have generated 50 data sets. Then we computed the exact histogram using Jellyfish software [ 4 ] and two approximate histograms using the original and the modified version of Kmerlight. Finally we ran CovEst on these three histograms using the model without repeats, and we collected three estimates of coverage cˆj, cˆok, cˆmk. In all experimental results, we focus on the estimates of coverage cˆ. By dividing the sum of lengths of all reads by cˆ, we can obtain estimates of genome size Lˆ.

In the first experiment (top row of Figure 7) we investigate the effect of increasing sequencing error rates on the accuracy of coverage estimates. On exact histograms, CovEst produces unbiased estimates of coverage with their variance increasing with error rate. Estimates based on approximate histograms are clearly less accurate but still achieve a relatively good precision.

Mean errors are consistently lower for modified Kmerlight than the errors of the original Kmerlight. Pair Student’s t-tests reject hypotheses mean(cˆok − cˆmk) = 0 for three of four presented datasets, with exception of the dataset with e = 0.05 where the p-value is 0.3. Thus we conclude that the estimates based on modified Kmerlight’s histograms are significantly better than the estimates based on original Kmerlight’s histograms. With modified Kmerlight, CovEst produces estimates with all errors bounded by 0.4%, 1%, 4%, 15% for sequencing error rates of 0, 0.01, 0.05, 0.1 respectively, and we consider these estimates sufficiently accurate.

In the second experiment (middle row of Figure 7), we study the accuracy for different genome coverage values. All CovEst estimates cˆj in our experiment range in 30%, 6%, 0.3%, < 0.1% intervals around the true coverage for coverages of 0.5, 2, 10 and 50. The variance of cˆj is comparable to variance of cˆok, cˆmk for two lower coverages, but with increasing coverage, estimates based on approximate histograms become less accurate. However, since the maximal errors reach values of only 0.3, which is less than 1% of the true coverage c = 50, we also consider these estimates as useful. Modified Kmerlight significantly outperforms the original Kmerlight on all four presented datasets (verified by t-tests).

Finally, with increasing genome size (bottom row of Figure 7), CovEst’s estimates become more precise on the exact histograms, and estimates on approximated histograms maintain a roughly constant precision for the examined genome sizes. As the coverage estimate errors are bounded by 1% for all datasets, we also consider the approximate histograms sufficiently accurate for the genome size estimation. Modified Kmerlight reaches smaller mean error on the two smaller genomes than the original Kmerlight. 4

Conclusion

In this paper, we have studied the character of approximation errors in the k-mer approximation histograms produced by Kmerlight [ 8 ]. We have discovered that Kmerlight’s estimates of fˆiare biased towards higher values and provided a new estimate without this bias. We have demonstrated that for sufficiently large values of fi, the distribution of estimates can be well approximated by a normal distribution. This approximation can be used to tune the parameters of the method.

We have also demonstrated that approximate histograms produced by Kmerlight are sufficiently accurate to produce reasonable estimates of genome size, and that for most settings our modified version of Kmerlight leads to more accurate genome size estimation than the original Kmerlight. An important avenue for further research is to compare the accuracy CovEst results on real sequencing data.

The use of approximate histograms allows significant reduction in memory; for example for genome size L = 108, coverage c = 50 and error rate e = 0.1 (F0 = 3.6 · 109), Kmerlight can produce an approximate histogram in only 83MB of memory, whereas Jellyfish needs 34Gb.

Note that our modified Kmerlight uses only one of 64 Kmerlight’s levels to compute estimates for all fi, so it naively seems that we could maintain only this level and reduce the memory consumption by a factor of 64. However, our version still uses the full 64 levels to compute Fˆ0, which is needed to find the desired level w+. We could try to estimate Fˆ0 separately, particularly, if two passes over the data are allowed. For example, we could roughly guess F0 from the size of sequencing data and use fewer levels. It might be also sufficient to estimate F0 with smaller hash tables. We did not inquire deeper into this topic, but we believe that by such techniques the memory consumption could be further decreased by a significant factor.

Also note that the k-mer abundance histogram and many algorithms used for its computation can be generalized to a histogram of any input items. Thus the applicability of this topic goes beyond the field of bioinformatics. Acknowledgments. This research was funded by VEGA grants 1/0684/16 (BB) and 1/0719/14 (TV), and a grant from the Slovak Research and Development Agency APVV-14-0253.

[1]

Bar-Yossef ,

T.S.

Jayram ,

Kumar ,

Sivakumar , and

Trevisan . Counting distinct elements in a data stream . In International Workshop on Randomization and Approximation Techniques (RANDOM) , pages 1 - 10 . Springer, 2002 .

[2]

Hozza , T. Vinarˇ, and

Brejová . How Big is That Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra . In String Processing and Information Retrieval (SPIRE) , pages 199 - 209 . Springer, 2015 .

[3]

Kurtz ,

Narechania ,

J. C.

Stein , and

Ware . A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes . BMC Genomics , 9 ( 1 ): 517 , 2008 .

[4]

Marçais and

Kingsford . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers . Bioinformatics , 27 ( 6 ): 764 , 2011 .

[5]

Melsted and

B. V.

Halldórsson . Kmerstream: streaming algorithms for k-mer abundance estimation . Bioinformatics , 30 ( 24 ): 3541 - 3547 , 2014 .

[6]

Melsted and

Pritchard . Efficient counting of kmers in DNA sequences using a bloom filter . BMC Bioinformatics , 12 ( 1 ): 333 , 2011 .

[7]

P. R.

Rider . Variance of the median of small samples from several special populations . Journal of the American Statistical Association , 55 ( 289 ): 148 - 150 , 1960 .

[8]

Sivadasan ,

Srinivasan , and

Goyal . Kmerlight: fast and accurate k-mer abundance estimation . CoRR, abs/1609.05626 , 2016 .

[9]

Williams ,

W. L.

Trimble ,

Shilts , F. Meyer, and

Ochman . Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes . BMC Genomics , 14 ( 1 ): 537 , 2013 .