1. Introduction

Local Enrichment of Genomic Annotations

Jakub Drobný

Broňa Brejová

0 0 Faculty of Mathematics , Physics and Informatics , Comenius University in Bratislava , Slovakia

2025

Genomic annotation identifies functional elements within a genome. Many annotations can be expressed as sets of non-overlapping intervals. To find a putative biological connection between two annotations, one may choose to compute how much their intervals overlap. Even in randomly generated annotations some overlap can be expected, and therefore we need to compare the observed overlap with a suitable null model. Many tools perform such analyses on the whole genome. In this work, we extend one of these algorithms to also identify regions with significant overlap within the genome. Our tool eMCDP is able to perform such analyses eficiently on any set of genomic windows, even overlapping ones. We use it to analyze human genome annotations, successfully identifying significantly enriched and depleted regions. The tool is available at https://github.com/jakubdrobny/e-mcdp.

genome annotation statistical significance Markov model

1. Introduction

In this work, we provide a new tool for computing statistical significance of genome annotation overlaps in individual regions of the genome. A genome is a collection of genetic information of an organism. In bioinformatics, it is represented as a set of sequences over the DNA alphabet {, , , } . The process of genome annotation seeks to identify functional elements or other important positions within a genome. Examples include genes, gene regulatory sequences, regions conserved in evolution, mutations within or between species, binding sites of various proteins, and chemical epigenetic marks on the DNA or associated proteins.

In this article, we represent each annotation (corresponding to one of these functional classes) as a set of non-overlapping intervals of a genomic sequence. To find a potential biological connection between two annotations, we may choose to compute the amount of overlap between the regions they cover and to determine if this overlap is significantly larger or smaller than would be expected by chance in a suitable null model of independent annotations.

Several diferent null models were explored in the literature, providing a tradeof between faithfulness to the observed data and computational eficiency [ 1, 2]. The simplest model characterizes an annotation just by the fraction of the genome it covers, assuming that in the null model each position is covered by the annotation independently with probability [3, 4]. This model allows simple computation of p-values for example by the Fisher exact test, but it does not capture the fact that annotation intervals may be long, making the independence assumption unrealistic. The other extreme are null models, which keep the number and individual lengths of intervals fixed and randomly shufle the intervals along the chromosomes and potentially also permute their order [5, 6, 7]. With a fixed order of intervals, it is possible to compute the p-value by dynamic programming with pseudo-polynomial running time, which is slow in practice [8]. When we allow reordering, the problem of p-value computation becomes NP hard [9], and in practice it is solved by sampling from the model.

Here, we consider a Markov chain null hypothesis [9], which lies between these two extremes. In this hypothesis, one of the annotations is generated by a two-state Markov chain. This model can capture mean lengths of the annotation intervals and gaps between them, while allowing eficient algorithms

CEUR Workshop

ISSN1613-0073

Q: Position for computing or approximating p-values [9, 10]. However, these algorithms compute a single p-value for the whole genome. In this work, we consider computing p-values for individual windows of the genome to explore if enrichment of one annotation within the other happens throughout the whole genome or if it is concentrated in specific regions.

We adapt the original MCDP algorithm [9] for computing the probability mass function of the overlap size under the Markov chain null hypothesis to work on a single window. To compute p-values for a set of windows, we can run this algorithm on each window separately. However, if the input windows overlap, this approach repeatedly recomputes shared parts of the windows. Therefore we also provide two algorithms seeking to avoid this redundancy. The first one splits the genome into sections at window boundaries, computes necessary quantities for each section and then combines sections in each window. The second one uses the segment tree data structure [11] to precompute some joins and reuse them. We implemented these algorithms in a tool called eMCDP and compared their running time with the original MCDP implementation [9]. We also show the efect of window size on the ability of the model to detect significant overlaps. Lastly, we use our software to analyze real annotations of the human genome, identifying several regions of significant enrichment and depletion.

The Genomic HyperBrowser by Sandve et al. [7] also provides users with window-based analysis of statistical significance, but their windows are required to be non-overlapping, and the software does not support the Markov chain null model.

2. Preliminaries and problem definition

Basic notation. To simplify notation, we will consider a single chromosome of length , which we represent as a sequence of nucleotides numbered 0, … , − 1 . We will discuss multiple chromosomes at the end of this section. A genomic interval is a half-open interval [ℓ, ) with 0 ≤ ℓ < ≤ describing a contiguous sequence of positions. An annotation is a set of non-overlapping genomic intervals. Its size || is defined as the number of its intervals and weight(A) denotes the sum of lengths of intervals in .

Colocalization statistics and p-values. In colocalization analysis, we are given two genomic annotations, denoted reference and query . We will measure their overlap by two statistics. The overlap statistic (, ) is defined as the number of intervals in intersecting with at least one interval in . The shared bases statistic (, ) is defined as the sum of the lengths of the intervals in the intersection of and . An example of both statistics can be seen in Figure 1.

To assess if the observed value of a statistic is significant, we will use the null hypothesis testing. We first select a suitable null hypothesis ℋ0, which assumes that there is no true association. Then we calculate the probability of the statistic achieving value or even more extreme than assuming the null hypothesis is true, also known as the p-value. We reject the null hypothesis if the p-value is lower than some predefined threshold , such as = 0.01 . In this case, we assume that the overlap between the two annotations under ℋ0 did not happen due to pure chance, and the two annotations might have an underlying biological connection.

The Markov chain null hypothesis. The Markov chain null hypothesis ℋ 0 keeps the reference annotation fixed and generates query annotation by a two-state Markov chain whose parameters are estimated from the observed query annotation [9]. State 0 of the Markov chain corresponds to nucleotides outside of the query annotation intervals and state 1 to those inside. For example, sequence of states ( 0, 1, 1, 0, 1, 0, 0, 1, 1 )represents annotation {[1, 3), [4, 5), [7, 9). }The Markov chain uses the following transition matrix: = ( 00 01)

10 11 = ( 1 − 01 ||− ℎ()−

|| −ℎ()−(1−) 1 − 10 ) , where is the probability of transitioning from state to state and is a binary indicator with value 1 if the last nucleotide of the chromosome is covered by . Matrix maximizes the likelihood of the query annotation being generated by the Markov chain. It preserves the expected lengths of gaps and intervals [9]. We will set the starting distribution equal to the stationary distribution of the transition matrix as in the MCDP2 software [ 10]. This means that the probability of state 1 is the same at any position of the sequence (but states are not independent).

The MCDP software [ 9] computes the full probability mass function (pmf) of the (, ) statistic under this null hypothesis, i.e. the values Pr ∼ℋ 0 2[)(. , The requir)ed=p]-vfaolrue=is0t,h…en, |c|o m.Iptuutesdesaas dynamic programming algorithm running in time (|| the sum of the tail of this distribution.

The newer version MCDP2 [10] includes several extensions. First, it also computes the pmf for the (, ) statistics by splitting each interval of into adjacent intervals of size 1 and using the algorithm for (, ) . However, this makes the running time quadratic in ℎ() rather than || , which is significantly slower for longer intervals. Second, MCDP2 can approximate the pmf by the normal distribution. The exact mean and variance of this distribution can be computed in the running time linear in || for the (, ) statistic and quasi-linear for the (, ) statistic, allowing fast computation of approximate p-values. Finally, it supports a richer parameterization of the null model, where the genome is segmented by the user into several classes of intervals called contexts and a separate transition matrix is estimated for each context. In this work, we will not consider such context-dependent models, and we will consider only the exact algorithm from MCDP.

Problem statement. Our goal is to provide algorithms for computing the p-values under the Markov chain null hypothesis not only for the genome as a whole but also for a collection of windows to explore local colocalization of and . Each window is a genomic interval = [ℓ, ) . Applying window on annotation results in a new annotation ′ = {[, ) | ∃[, ) ∈ ∶ [, ) = [, ) ∩ ∧ < } . In other words, we remove the intervals that are outside and resize the intervals that are only partially intersecting it. Now we can define the problem we are going to solve in this work.

Problem statement 1. We are given the length of the chromosome , reference annotation , query annotation and windows 0, … , −1 . The goal is to compute for each window the p-value of the overlap ( , ) under ℋ 0 , where annotations and are the result of applying window on annotations and respectively. For enrichment, the p-value is Pr ∼ℋ 0 [ ( , ) ≥ ( , )], and for depletion, it is Pr ∼ℋ 0 [ ( , ) ≤ ( , )]. The null hypothesis ℋ 0 uses the same transition matrix estimated from the entire for all windows.

We can easily adapt the algorithm of MCDP to run on a single window, but our goal is to provide algorithms that work more eficiently for overlapping windows by reusing some computations.

A typical example of overlapping windows are sliding windows defined by a window size and a step resulting in a set of windows {[0, ), [, + ), [2, 2 + ), … , [, + ), [( + 1), )} , where = ⌈( − )/⌉ − 1

. In contrast to non-overlapping bins used in previous work [7], sliding windows can better pinpoint locations exhibiting significant enrichment or depletion of overlaps between annotations. Our algorithms work for arbitrary window sets. In addition to sets with a regular structure, one may consider biologically-motivated window sets, such as topologically associating domains (TADs) [12] which have a hierarchical structure or regions delineated by recombination hotspots [13].

Multiple chromosomes.

Our notation and problem statement assumed that the annotations only contain intervals from a single chromosome, but in many cases, the annotations cover multiple chromosomes. To compute the overall p-value for the whole genome, it is usually assumed that the chromosomes are independent under the null model. The pmfs computed for individual chromosomes can be, therefore, combined by a straightforward convolution [8, 9]. In our problem, each genomic window is located on a single chromosome. Therefore we can apply our algorithms for each set of windows located on a single chromosome to obtain their p-values.

3. Algorithms

In our software, we implemented three algorithms. The naive algorithm runs an adapted version of the original MCDP algorithm for each window separately. The two other algorithms first split the chromosome into disjoint sections, with section boundaries at the starts and ends of all windows. Then we compute some quantities called multiprobs for each section, and we provide a procedure for joining two adjacent sections into a single section. The section-based algorithm then takes sections of each window and combines them sequentially to obtain the pmf for the whole window. The tree-based algorithm always does at most logarithmic number of section joins per window by applying the segment tree data structure [11].

The original MCDP and MCDP2 algorithms.

For completeness, we start by summarizing the original MCDP algorithm [9] with some minor modifications from MCDP2 [ 10]. At its core is a dynamic programming algorithm which gets as inputs a Markov chain transition matrix , the initial state distribution applied at position -1 and the reference annotation = {[ 1, 1), … , [ , )}ordered by interval starts. We will define 0 = 0. Denote the state sequence generated by the Markov chain as 0, … , −1 .

The algorithm computes quantities [, , ] for 0 ≤ ≤ and 0 ≤ ≤ and ∈ {0, 1} . This quantity is defined as the probability that if we run the Markov chain to generate 0, … , −1 (i.e. ifnishing at the end of the -th interval of ), it will intersect exactly intervals of and finish in state .

The base cases are defined for = 0 as [0, 0, ] = and [0, , ] = 0 for > 0 . To simplify the main recurrence, [, −1, ] is also defined as 0. The main recurrence for > 0 is computed as follows: [, , ] =

[ − 1, , 0] ℎ + [ − 1, , 1] ℎ + [ − 1, − 1, 0] + [ − 1, − 1, 1] ℎ (, 1, ), (, 0, ) (, 1, ) ℎ (, 0, ) where ℎ (, , ) is defined as the probability that −1 = and = 0 for all such that ≤ < , hit the -th reference interval. Similarly, ℎ (, , ) is defined as the probability that given that −1 −1 = . This corresponds to the case that the generated sequence of states does not −1 = and case that the generated sequence of states hits the -th reference interval. Quantities ℎ = 1 for at least one such that ≤ < , given that −1 −1 = . This corresponds to the (, , ) and ℎ (, , ) can be computed in ( 1 ) time [9], leading to the overall (|| 2)running time of the whole dynamic programming algorithm. The desired pmf for the whole chromosome is obtained as Pr( (, ) = ) = [, , 0] + [, , 1] .

The MCDP2 algorithm [10] computes the mean and variance of the selected statistic without computing the full pmf. It computes a data structure called plumbus for each reference interval and for the gaps between them. A plumbus stores the mean and variance for all combinations of the starting and ending states of the Markov chain at the endpoints the interval. It also stores the probability distribution of the ending state for diferent values of the starting state. Such plumbuses can be then combined in constant time into plumbuses for longer intervals, leading to overall linear running time for the (, ) statistic. Our multiprobs data structure. We will now introduce the multiprobs data structure inspired by the plumbus data structure from MCDP2, but adapted for computing the full pmf.

Let us now consider a genomic interval = [, ) and let ′ be the annotation obtained by applying window on . The multiprobs data structure for interval is a table ,, for , ∈ {0, 1} and 0 ≤ ≤ | ′|. Value ,, is the probability Pr( ( ′, ) = , −1 = | −1 = ), i.e. the probability of overlapping intervals of ′ and ending in state at position − 1 , given that we started in state at position − 1 .

To compute multiprobs, we run the MCDP algorithm twice, with starting distributions ( 1, 0 )and ( 0, 1 )at position − 1 to compute 0,, and 1,, respectively. We modify the base case to apply to position − 1 instead of -1, but we also need to add computation of the probability of ending in state at position − 1 , whereas MCDP ignores the states of the Markov chain after the end of the last interval, as they do not influence the overlap. In our case, these probabilities are needed to correctly join adjacent intervals. This extension can be easily done in ( 1 ) time. The overall running time is (| ′|2).

To obtain the pmf for window from its multiprobs assuming the Markov chain starts in its stationary distribution , we use the law of total probability to marginalize over the starting state and ending state :

Pr( ( ′, ) = ) = ∑ ∑ ⋅ ,, .

∈{0,1} ∈{0,1} Joining adjacent multiprobs. Consider two adjacent intervals 1 = [, ) and 2 = [, ) . Given multiprobs 1 and 2 of these intervals, we want to compute multiprobs for interval = [, ) . Let ′ be the annotation obtained by applying on and for ∈ {1, 2} let be the annotation obtained by applying on .

First let us assume that no interval of spans the boundary point . This means that for any ′, ( ′, ′) = ( 1, ′) + ( 2, ′). To compute ,, , we marginalize over the state at the boundary point and over 1 which is the value of the statistics in 1.

Thanks to the Markov property, events in window 2 are independent of how the state −1 was reached, given that −1 = . Namely, assuming −1 = , events ( 2, ) = − 1 and −1 = are independent of events ( 1, ) = 1 and −1 = . This allows us to obtain the following equation for ,, : ,, =

min(,| 1|) ∑ ∑ ∈{0,1} 1=max(0,−| 2|)

1 2 ,, 1 ⋅ ,,− 1 ( 1 )

Now let us assume that there is an interval [ℓ, ) ∈ ′ which spans the boundary point between 1 and 2, i.e. ℓ < < . Formula ( 1 ) cannot be directly applied in this case because ( ′, ′)is either ( 1, ′) + ( 2, ′)or ( 1, ′) + ( 2, ′) − 1, depending on whether ′ overlaps with [ℓ, ).

To overcome this issue, we split into three parts: = [, ℓ) , = [ℓ, ) and = [ , ) . Notice that no interval in annotation crosses either of the boundary points ℓ and . We can therefore join , and into by two applications of formula ( 1 ).

As a special case, we will sometimes construct multiprobs for an empty interval = [, ) . In this case, we need to consider only = 0 and we set ,,0 = 1 for = and ,,0 = 0 for ≠ . By joining such empty with an adjacent multiprobs ′ for interval ′ = [, ) ′, and therefore such is an identity element for the operation of joining two multiprobs. or ′ = [, ) , we obtain again

Note that the operation of joining multiprobs has a time complexity of ((| 1| + 1)(| 2| + 1) ).The multiprobs for can be computed in ( 1 ) because it involves only a single interval of .

Sections.

To deal with overlapping input windows, we use section data structure. We split the whole chromosome into intervals at each position that starts or ends one of the windows and build one section for each such interval. For each section, we keep the following data.

• Interval = [, ) that the section describes. • Interval = [ , ) ⊆ that is obtained from by removing the parts covered by the reference intervals spanning the boundaries of , if such intervals exist. If no interval spans the boundaries of , then = . In the special case that the entire is covered by a single interval that also spans both boundaries, will be an empty interval [, ) . We also keep a pair of booleans indicating if a particular boundary of is spanned by an interval. • Annotations ′ and ′ obtained by applying window on and , respectively. • The value of the statistic = ( • Multiprobs for interval .

′, ′).

Given two adjacent sections for intervals 1 = [, ) and 2 = [, ) , we can easily build a section for interval = [, )

. We have already described the process of joining the corresponding multiprobs.

The value of statistic is added from sections 1 and 2 and potentially adjusted for double counting of the interval spanning boundary . Annotations ′, ′ and intervals , can be also easily combined from the two sections.

Algorithms.

The running time of our algorithms can be expressed in terms of parameters = || , = ||

and being the number of input windows.

All three algorithms first sort intervals of in ( log ) time and estimate the parameters of the Markov model from annotation in ()

time.

The naive algorithm runs the original MCDP algorithm for each window separately, which takes (

The section-based and tree-based algorithms instead continue by computing initial sections. Section boundaries are found by sorting 2 endpoints of input windows. We then compute the subsets of annotations and corresponding to each section, which can be done by a linear pass through sorted annotations. Computation of the overlap statistic in all sections also takes linear time. Computing multiprobs for a section with reference intervals takes (( + 1)2)time. The sum sections is at most + 2 , because each section boundary can split a reference interval into two parts. Overall the running time of computing multiprobs for all sections is at most (( + )) because ∑ for all ∑ ( + 1)2 ≤ ( + 1) ∑ ( + 1) ≤ ( + 1)( + 4) .

The section-based algorithm then joins sections of each window sequentially. The total number of joining operations is thus ( 2). However, considering that one joining operation takes (( + 1)) time, the running time of all joining operations for all windows is (( + )) .

The tree-based algorithm instead builds a segment tree [11, 14]. This is a binary tree which has sections at its leaves. Each internal node contains a multiprobs for the union of its two children intervals. The tree has height ( log ) and at each depth it covers the entire chromosome. Therefore multiprobs joining at each depth takes (( + )) time, resulting in (( + ) log ) total time. Then we can express each window as a union of a logarithmic number of sections from the segment tree and join these sections together in (( +

log ))total time for all windows. We perform () join operations for building the tree and ( log ) for computing results for individual windows.

Although the tree-based algorithm does only ( log ) joining operations compared to ( 2)for the section-based one, the overall running time is cubic in the term = + for all three algorithms, because it is dominated by the largest joining operations. 100 1000 Number of reference intervals 10000

4. Experiments

We implemented all three algorithms in a C++ software tool eMCDP available at https://github. com/jakubdrobny/e-mcdp. The reproducibility data is available at https://github.com/jakubdrobny/ e-mcdp-reproducibility. In addition to the three variants of the algorithm, eMCDP can also be used to perform whole-genome analysis as with the previous MCDP software [ 9] and the MCDP2 software [10]. When we compute p-values for multiple windows of a genome, our software also outputs p-values adjusted by the Bonferroni multiple testing correction, where the number of tests is the number of windows in a single run of the tool.

In this section, we evaluate our software in terms of running time and its power to detect enrichments, and we apply it to real annotations of the human genome. All experiments have been run on a laptop with Intel© Core™ i7-9750H CPU at 2.60GHz x 12 and 32 GB DDR4 RAM.

Comparison with the original MCDP implementation. Figure 2 shows the running time comparison of the MCDP [9], MCDP2 [10] and eMCDP with diferent reference annotation sizes. Here, no window set it used, and all three algorithms analyse the whole genome. For this experiment, we used the 01-synthetic data sets from the MCDP2 study [10]. These data sets contain randomly generated reference and query annotations along a single chromosome of length 100Mbp, with all intervals having length 500bp. All query annotations have 50000 intervals, while the size of the reference annotation varies from 10 to 20000. For each reference size, 20 pairs of inputs were averaged. MCDP was run only for inputs with || ≤ 5000 due to its slow running time. We see that eMCDP is orders of magnitude (∼100x) faster than the original MCDP algorithm [9] and for small annotations of sizes up to 5000 is faster than even the linear-time MCDP2 algorithm [10], which does not compute the exact p-values. Comparison of algorithms. Here, we compare the performance of the naive, section-based and tree-based variants with a sliding window size of 1Mbp and a window step of 50kbp, that is, each window is made up of 20 non-overlapping sections. For this experiment, we again used the 01-synthetic data set from the MCDP2 study [10]. We ran the tree-based algorithm on reference annotations of sizes up to 5000 and the naive and section-based variants on reference annotations of sizes up to 20000. Figure 3 shows that the tree-based variant is substantially slower than the other two algorithms. This is likely due to the large overhead of the segment tree which is built for the whole chromosome and thus contains also costly multiprobs for sections longer than the longest window. We can also observe that the section-based variant slightly outperforms the naive variant for large )10.0 s ( e m i t n u r e g a r e v A1.0 naive segment-based tree-based 10 100 1000 Number of reference intervals 10000

Enrichment detection. We now examine the power of eMCDP to detect local colocalizations. We generated annotations along a genome consisting of two chromosomes of length 1Mbp with genome coverage = 0.1 and a fixed interval length ℓ = 100. The annotations were generated dependently along the first chromosome and independently along the second chromosome. To generate an annotation with a coverage and length ℓ, we simply iterated over all chromosomal positions and at each position the probability of a new interval starting is /ℓ . If a new interval was generated, we continued the process from the last position of the newly generated interval instead of the next position. To produce independent annotations, we used this method to generate both annotations.

To generate dependent annotations with a dependency factor ≥ 1 , we first generate . To generate , we again iterate over all positions and at each position we first check if it is covered . If it is, the probability of a new interval starting at this position is multiplied by the dependency factor . At positions not covered by an interval from the reference annotation, the probability of a new interval starting is multiplied by 1/ .

We ran the software with a window size of 1Mbp, therefore covering each chromosome with a single window. We generated pairs of annotations with dependency factors ranging from 1, i.e. the annotations are truly independent to a dependency factor of 2. For each dependency factor we generated 10 pairs of reference and query annotations and took the median p-value of the 10 tests. Figure 4 shows that the significance of the overlap of the annotations along the first chromosome is increasing with the increasing dependency factor. The overlap of the annotations along the second chromosome is not significant, as expected.

Efect of the window size. To examine the efect of the window size on the ability of our software to detect a significant overlap, we generated annotations along a genome consisting of a single chromosome of length 1Mbp with a coverage of 0.1. The annotations were generated independently along most of the chromosome with the exception of the range of positions 400kbp to 600kbp, where the dependency factor was set to 10 to make the overlap of the reference and query annotations really significant. We generated 10 pairs of reference and query annotation and ran our software with diferent window sizes in the range of 10kbp to 800kbp and a window step of 5kbp or 10% of the window size, whichever was smaller. For each window in each test, we plot the median adjusted p-value across all 10 tests.

The results are shown in Figure 5, separately for windows shorter and longer than the target region of length 200kbp. For long windows, our detection ability increases as we decrease the window size to match the size of the region. This is due to the fact that longer windows cover more positions of the genome where the annotations are independent, and thus they have a higher chance of overlap occurring due to pure chance.

If we consider windows shorter than the target region, our ability to detect the significance is the best with window sizes slightly smaller than the size of the region with the significant overlap. Very small windows contain only a small number of intervals, and thus the probability of overlap occurring due to pure chance increases.

A real data set. To evaluate eMCDP on real annotations, we used the Zarrei et al. (2015) [15] data set from the original MCDP study [9]. This data set compares copy number losses (23438 intervals) and exons of the Refseq-annotated genes in the human genome. We consider exons of all non-coding genes (9975 intervals after merging overlaps), exons of all protein coding genes (208558 intervals after merging) and exons of all genes (216813 intervals after merging). The annotation of copy number losses was used as and one of the three gene annotations as . We tested for both enrichment and depletion with sliding windows of size 1Mbp and step 100kbp. We used p-value threshold of 0.01 after Bonferroni multiple testing correction.

Table 1 shows that significant enrichment was only detected when comparing the annotation of copy number losses with the annotation of exons of all non-coding genes. This enrichment was found in two overlapping windows located on chromosome 15, which contain a dense cluster of 76 CD Box snoRNA non-coding genes which overlap 5 loss intervals (see Figure 6). The expected number of overlaps in these windows is only 0.2 with standard deviation 0.4. Both the Zarrei [15] and MCDP [9] studies reported enrichment when comparing the annotations of copy number losses and exons of all non-coding genes. However, the counts in this region are relatively low to be a significant contribution the overall enrichment. We tried increasing the window size to 10Mbp, since there were at most reference 13 intervals present in any of the 1Mbp windows, but that resulted in no windows with significant enrichment.

When analyzing depletion, we detected significant results for almost all chromosomes when comparing the annotations of copy number losses and exons of all protein coding genes. Both the Zarrei [15] and MCDP [9] studies detected significant depletion when comparing these annotations, a finding which our results also support. An example of a depleted region on chromosome 21 is shown in Figure 7. The annotation of exons of all protein coding genes is depleted around position 18Mbp from chromosome start. This regions contains a long loss interval (1.3Mbp). As a result, the expected number of overlaps is 1 with the standard deviation 10−10, but the actual number of overlaps is 0.

5. Conclusion

In this work, we developed three algorithms for colocalization analysis of two genomic annotations in genomic windows. This allows the user to determine if any enrichment or depletion found on the whole-genome level is concentrated in specific genomic regions.

All three algorithms compute the full probability mass function for each window. Although two of the algorithms try to reduce redundant computations in overlapping windows, the running time is dominated by the quadratic term needed to combine the largest sections, and thus these changes do not reduce the running time. Indeed, the algorithm using segment trees was by far the slowest. This can be explained by the algorithm computing the full tree, even long intervals that will never be used and are costly to compute. An improvement would be to compute multiprobs in the tree on demand so that these redundant multiprobs are never computed.

On the other hand, the tree-based approach would be very eficient if we replaced computation of the multiprobs with the plumbuses from MCDP2 [10], which can be joined in constant time. Thus the reduction of the number of join operations from ( 2)to ( log ) would bring time savings. However, normal approximation of p-values computed using plumbuses is appropriate only for data sets with a large number of intervals, so that this approach would be applicable only to very long windows.

Our algorithms can work for any set of input windows. Nonetheless, typical windows required by an user are likely to be sliding windows of constant size and step. In this special case, one could replace the segment tree with sliding window algorithms appropriate for an arbitrary associative operation, Scale 2 Mb hg19 chr21: 15,000,000 15,500,000 16,000,000 16,500,000 17,000,000 17,500,000 18,000,000 18,500,000 19,000,000 19,500,000

Copy number losses CNV losses

Exons of protein coding genes Prot. coding

15.4035 _ Depletion p-value, -log10 such as the De-Amortized Banker’s Aggregator [17], which would require only () joining operations for all windows combined.

Another desired extension of our current implementation would be the addition of context-dependent models from MCDP2. These models can avoid artifacts stemming from assembly gaps, and they can also help to study the influence of confounding factors, such as the GC content.

Acknowledgments

This research was supported in part by the Slovak Grant Agency VEGA grants 1/0538/22 and 1/0140/25 and by the Slovak Research and Development Agency grant APVV-22-0144. The research was also supported by the EU Horizon 2020 programme under the Marie Skłodowska-Curie grant agreement No. 872539 (PANGAIA).

Declaration on Generative AI

During the preparation of this work, the authors used Generative AI tool Gemini 2.5 Pro in Google AI Studio in order to: Grammar and spelling check. After using these tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [8] S. Sarmashghi, V. Bafna, Computing the statistical significance of overlap between genome annotations with iStat, Cell Systems 8 (2019) 523–529.e4. [9] A. Gafurov, B. Brejová, P. Medvedev, Markov chains improve the significance computation of overlapping genome annotations, Bioinformatics 38 (2022) i203–i211. [10] A. Gafurov, T. Vinař, P. Medvedev, B. Brejová, Eficient analysis of annotation colocalization accounting for genomic contexts, in: Research in Computational Molecular Biology, Springer Nature Switzerland, Cham, 2024, pp. 38–53. [11] J. L. Bentley, Solutions to Klee’s rectangle problems, Unpublished manuscript (1977) 282–300. [12] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, B. Ren, Topological domains in mammalian genomes identified by analysis of chromatin interactions, Nature 485 (2012) 376–380. [13] S. Myers, L. Bottolo, C. Freeman, G. McVean, P. Donnelly, A fine-scale map of recombination rates and hotspots across the human genome, Science 310 (2005) 321–324. [14] M. De Berg, O. Cheong, M. van Kreveld, M. Overmars, Computational geometry: algorithms and applications, Springer, 2008. [15] M. Zarrei, J. R. MacDonald, D. Merico, S. W. Scherer, A copy number variation map of the human genome, Nature Reviews Genetics 16 (2015) 172–183. [16] G. Perez, G. P. Barber, A. Benet-Pages, et al., The UCSC Genome Browser database: 2025 update,

Nucleic Acids Research 53 (2024) D1243–D1249. [17] K. Tangwongsan, M. Hirzel, S. Schneider, In-order sliding-window aggregation in worst-case constant time, The VLDB Journal 30 (2021) 933–957.

[1]

M. G.

Dozmorov , Epigenomic annotation-based interpretation of genomic data: From enrichment analysis to machine learning , Bioinformatics 33 ( 2017 ) 3323 - 3330 .

[2]

Kanduri ,

Bock ,

Gundersen , E. Hovig,

G. K.

Sandve , Colocalization analyses of genomic elements: Approaches, recommendations and challenges , Bioinformatics 35 ( 2019 ) 1615 - 1624 .

[3]

M. G.

Dozmorov ,

L. R.

Cara ,

C. B.

Giles ,

J. D.

Wren , GenomeRunner web server: regulatory similarity and diferences define the functional impact of SNP sets , Bioinformatics 32 ( 2016 ) 2256 - 2263 .

[4]

N. C.

Shefield , C. Bock, LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor , Bioinformatics 32 ( 2016 ) 587 - 589 .

[5]

Gel ,

Diez-Villanueva ,

Serra ,

Buschbeck ,

M. A.

Peinado ,

Malinverni , regioneR: an R/ Bioconductor package for the association analysis of genomic regions based on permutation tests , Bioinformatics 32 ( 2016 ) 289 - 291 .

[6]

Heger ,

Webber ,

Goodson ,

C. P.

Ponting , G. Lunter, GAT: a simulation framework for testing the association of genomic intervals , Bioinformatics 29 ( 2013 ) 2046 - 2048 .

[7]

G. K.

Sandve ,

Gundersen ,

Rydbeck , et al., The genomic HyperBrowser: Inferential genomics at the sequence level , Genome Biology 11 ( 2010 ).