<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Local Enrichment of Genomic Annotations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jakub Drobný</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Broňa Brejová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Genomic annotation identifies functional elements within a genome. Many annotations can be expressed as sets of non-overlapping intervals. To find a putative biological connection between two annotations, one may choose to compute how much their intervals overlap. Even in randomly generated annotations some overlap can be expected, and therefore we need to compare the observed overlap with a suitable null model. Many tools perform such analyses on the whole genome. In this work, we extend one of these algorithms to also identify regions with significant overlap within the genome. Our tool eMCDP is able to perform such analyses eficiently on any set of genomic windows, even overlapping ones. We use it to analyze human genome annotations, successfully identifying significantly enriched and depleted regions. The tool is available at https://github.com/jakubdrobny/e-mcdp.</p>
      </abstract>
      <kwd-group>
        <kwd>genome annotation</kwd>
        <kwd>statistical significance</kwd>
        <kwd>Markov model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this work, we provide a new tool for computing statistical significance of genome annotation overlaps
in individual regions of the genome. A genome is a collection of genetic information of an organism. In
bioinformatics, it is represented as a set of sequences over the DNA alphabet {, , ,  }
. The process of
genome annotation seeks to identify functional elements or other important positions within a genome.
Examples include genes, gene regulatory sequences, regions conserved in evolution, mutations within
or between species, binding sites of various proteins, and chemical epigenetic marks on the DNA or
associated proteins.</p>
      <p>In this article, we represent each annotation (corresponding to one of these functional classes) as a set
of non-overlapping intervals of a genomic sequence. To find a potential biological connection between
two annotations, we may choose to compute the amount of overlap between the regions they cover and
to determine if this overlap is significantly larger or smaller than would be expected by chance in a
suitable null model of independent annotations.</p>
      <p>Several diferent null models were explored in the literature, providing a tradeof between faithfulness
to the observed data and computational eficiency [ 1, 2]. The simplest model characterizes an annotation
just by the fraction  of the genome it covers, assuming that in the null model each position is covered
by the annotation independently with probability  [3, 4]. This model allows simple computation of
p-values for example by the Fisher exact test, but it does not capture the fact that annotation intervals
may be long, making the independence assumption unrealistic. The other extreme are null models,
which keep the number and individual lengths of intervals fixed and randomly shufle the intervals
along the chromosomes and potentially also permute their order [5, 6, 7]. With a fixed order of intervals,
it is possible to compute the p-value by dynamic programming with pseudo-polynomial running time,
which is slow in practice [8]. When we allow reordering, the problem of p-value computation becomes
NP hard [9], and in practice it is solved by sampling from the model.</p>
      <p>Here, we consider a Markov chain null hypothesis [9], which lies between these two extremes. In this
hypothesis, one of the annotations is generated by a two-state Markov chain. This model can capture
mean lengths of the annotation intervals and gaps between them, while allowing eficient algorithms</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>R:</p>
      <p>Q:
Position
for computing or approximating p-values [9, 10]. However, these algorithms compute a single p-value
for the whole genome. In this work, we consider computing p-values for individual windows of the
genome to explore if enrichment of one annotation within the other happens throughout the whole
genome or if it is concentrated in specific regions.</p>
      <p>We adapt the original MCDP algorithm [9] for computing the probability mass function of the overlap
size under the Markov chain null hypothesis to work on a single window. To compute p-values for a
set of windows, we can run this algorithm on each window separately. However, if the input windows
overlap, this approach repeatedly recomputes shared parts of the windows. Therefore we also provide
two algorithms seeking to avoid this redundancy. The first one splits the genome into sections at
window boundaries, computes necessary quantities for each section and then combines sections in
each window. The second one uses the segment tree data structure [11] to precompute some joins and
reuse them. We implemented these algorithms in a tool called eMCDP and compared their running
time with the original MCDP implementation [9]. We also show the efect of window size on the ability
of the model to detect significant overlaps. Lastly, we use our software to analyze real annotations of
the human genome, identifying several regions of significant enrichment and depletion.</p>
      <p>The Genomic HyperBrowser by Sandve et al. [7] also provides users with window-based analysis of
statistical significance, but their windows are required to be non-overlapping, and the software does
not support the Markov chain null model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries and problem definition</title>
      <p>Basic notation. To simplify notation, we will consider a single chromosome of length  , which we
represent as a sequence of nucleotides numbered 0, … ,  − 1 . We will discuss multiple chromosomes at
the end of this section. A genomic interval is a half-open interval [ℓ,  ) with 0 ≤ ℓ &lt;  ≤  describing a
contiguous sequence of positions. An annotation  is a set of non-overlapping genomic intervals. Its
size || is defined as the number of its intervals and weight(A) denotes the sum of lengths of intervals
in  .</p>
      <p>Colocalization statistics and p-values. In colocalization analysis, we are given two genomic
annotations, denoted reference  and query  . We will measure their overlap by two statistics. The
overlap statistic  (, ) is defined as the number of intervals in  intersecting with at least one interval
in  . The shared bases statistic (, ) is defined as the sum of the lengths of the intervals in the
intersection of  and  . An example of both statistics can be seen in Figure 1.</p>
      <p>To assess if the observed value  of a statistic is significant, we will use the null hypothesis testing.
We first select a suitable null hypothesis ℋ0, which assumes that there is no true association. Then we
calculate the probability of the statistic achieving value  or even more extreme than  assuming the
null hypothesis is true, also known as the p-value. We reject the null hypothesis if the p-value is lower
than some predefined threshold  , such as  = 0.01 . In this case, we assume that the overlap between
the two annotations under ℋ0 did not happen due to pure chance, and the two annotations might have
an underlying biological connection.</p>
      <p>
        The Markov chain null hypothesis. The Markov chain null hypothesis ℋ 0 keeps the reference
annotation  fixed and generates query annotation    by a two-state Markov chain whose
parameters are estimated from the observed query annotation  [9]. State 0 of the Markov chain corresponds to
nucleotides outside of the query annotation intervals and state 1 to those inside. For example, sequence
of states (
        <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1 ref1">0, 1, 1, 0, 1, 0, 0, 1, 1</xref>
        )represents annotation {[1, 3), [4, 5), [7, 9). }The Markov chain uses the
following transition matrix:
 
=
( 00  01)
      </p>
      <p>10  11
= (
1 −  01
||−
ℎ()−</p>
      <p>||
−ℎ()−(1−)
1 −  10
) ,
where   is the probability of transitioning from state  to state  and  is a binary indicator with value 1
if the last nucleotide of the chromosome is covered by  . Matrix   maximizes the likelihood of the
query annotation  being generated by the Markov chain. It preserves the expected lengths of gaps and
intervals [9]. We will set the starting distribution equal to the stationary distribution of the transition
matrix   as in the MCDP2 software [ 10]. This means that the probability of state 1 is the same at any
position of the sequence (but states are not independent).</p>
      <p>The MCDP software [ 9] computes the full probability mass function (pmf) of the  (, ) statistic
under this null hypothesis, i.e. the values Pr  ∼ℋ 0 2[)(. , The  requir)ed=p]-vfaolrue=is0t,h…en, |c|o m.Iptuutesdesaas
dynamic programming algorithm running in time  (||
the sum of the tail of this distribution.</p>
      <p>The newer version MCDP2 [10] includes several extensions. First, it also computes the pmf for the
(, ) statistics by splitting each interval of  into adjacent intervals of size 1 and using the algorithm
for  (, ) . However, this makes the running time quadratic in  ℎ() rather than || , which is
significantly slower for longer intervals. Second, MCDP2 can approximate the pmf by the normal
distribution. The exact mean and variance of this distribution can be computed in the running time
linear in || for the  (, ) statistic and quasi-linear for the (, ) statistic, allowing fast computation
of approximate p-values. Finally, it supports a richer parameterization of the null model, where the
genome is segmented by the user into several classes of intervals called contexts and a separate transition
matrix is estimated for each context. In this work, we will not consider such context-dependent models,
and we will consider only the exact algorithm from MCDP.</p>
      <p>Problem statement. Our goal is to provide algorithms for computing the p-values under the Markov
chain null hypothesis not only for the genome as a whole but also for a collection of windows to explore
local colocalization of  and  . Each window is a genomic interval  = [ℓ,  ) . Applying window  on
annotation  results in a new annotation  ′ = {[,  ) | ∃[, ) ∈  ∶ [,  ) = [, ) ∩  ∧  &lt;  } . In
other words, we remove the intervals that are outside  and resize the intervals that are only partially
intersecting it. Now we can define the problem we are going to solve in this work.</p>
      <p>Problem statement 1. We are given the length of the chromosome  , reference annotation  , query
annotation  and windows  0, … ,  −1 . The goal is to compute for each window   the p-value of the
overlap  (  ,   ) under ℋ 0 , where annotations   and   are the result of applying window   on
annotations  and  respectively. For enrichment, the p-value is Pr  ∼ℋ 0 [ (  ,    ) ≥  (  ,   )],
and for depletion, it is Pr  ∼ℋ 0 [ (  ,    ) ≤  (  ,   )]. The null hypothesis ℋ 0 uses the same
transition matrix   estimated from the entire  for all windows.</p>
      <p>We can easily adapt the algorithm of MCDP to run on a single window, but our goal is to provide
algorithms that work more eficiently for overlapping windows by reusing some computations.</p>
      <p>A typical example of overlapping windows are sliding windows defined by a window size  and a step
 resulting in a set of windows {[0,  ), [,  +  ), [2, 2 +  ), … , [,  +  ), [( + 1), )}
, where
 = ⌈( −  )/⌉ − 1</p>
      <p>. In contrast to non-overlapping bins used in previous work [7], sliding windows can
better pinpoint locations exhibiting significant enrichment or depletion of overlaps between annotations.
Our algorithms work for arbitrary window sets. In addition to sets with a regular structure, one may
consider biologically-motivated window sets, such as topologically associating domains (TADs) [12]
which have a hierarchical structure or regions delineated by recombination hotspots [13].</p>
      <sec id="sec-2-1">
        <title>Multiple chromosomes.</title>
        <p>Our notation and problem statement assumed that the annotations only
contain intervals from a single chromosome, but in many cases, the annotations cover multiple
chromosomes. To compute the overall p-value for the whole genome, it is usually assumed that the
chromosomes are independent under the null model. The pmfs computed for individual chromosomes
can be, therefore, combined by a straightforward convolution [8, 9]. In our problem, each genomic
window is located on a single chromosome. Therefore we can apply our algorithms for each set of
windows located on a single chromosome to obtain their p-values.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Algorithms</title>
      <p>In our software, we implemented three algorithms. The naive algorithm runs an adapted version of
the original MCDP algorithm for each window separately. The two other algorithms first split the
chromosome into disjoint sections, with section boundaries at the starts and ends of all windows. Then
we compute some quantities called multiprobs for each section, and we provide a procedure for joining
two adjacent sections into a single section. The section-based algorithm then takes sections of each
window and combines them sequentially to obtain the pmf for the whole window. The tree-based
algorithm always does at most logarithmic number of section joins per window by applying the segment
tree data structure [11].</p>
      <sec id="sec-3-1">
        <title>The original MCDP and MCDP2 algorithms.</title>
        <p>For completeness, we start by summarizing the
original MCDP algorithm [9] with some minor modifications from MCDP2 [ 10]. At its core is a dynamic
programming algorithm which gets as inputs a Markov chain transition matrix  , the initial state
distribution  applied at position -1 and the reference annotation  = {[ 1,  1), … , [  ,   )}ordered by
interval starts. We will define  0 = 0. Denote the state sequence generated by the Markov chain as
 0, … ,  −1 .</p>
        <p>The algorithm computes quantities  
[, , ]
for 0 ≤  ≤  and 0 ≤  ≤ 
and  ∈ {0, 1} . This
quantity is defined as the probability that if we run the Markov chain to generate  0, … ,    −1 (i.e.
ifnishing at the end of the  -th interval of  ), it will intersect exactly  intervals of  and finish in state  .</p>
        <p>The base cases are defined for  = 0 as   [0, 0, ] =   and  
[0, , ] = 0
for  &gt; 0 . To simplify the
main recurrence,   [, −1, ] is also defined as 0. The main recurrence for  &gt; 0 is computed as follows:
  [, , ] =</p>
        <p>
          [ − 1, , 0] ℎ
+   [ − 1, , 1] ℎ
+   [ − 1,  − 1, 0]
+   [ − 1,  − 1, 1] ℎ (, 1, ),
(, 0, )
(, 1, )
ℎ (, 0, )
where  ℎ
(, , ) is defined as the probability that 
  −1 =  and   = 0 for all  such that   ≤  &lt;   ,
hit the  -th reference interval. Similarly,  ℎ (, , ) is defined as the probability that 
given that   −1 −1 =  . This corresponds to the case that the generated sequence of states does not
  −1 =  and

case that the generated sequence of states hits the  -th reference interval. Quantities  ℎ
 = 1 for at least one  such that   ≤  &lt;   , given that   −1 −1 =  . This corresponds to the
(, , )
and  ℎ (, , ) can be computed in  (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) time [9], leading to the overall  (|| 2)running time of the
whole dynamic programming algorithm. The desired pmf for the whole chromosome is obtained as
Pr( (,    ) = ) =   [, , 0] +   [, , 1] .
        </p>
        <p>The MCDP2 algorithm [10] computes the mean and variance of the selected statistic without
computing the full pmf. It computes a data structure called plumbus for each reference interval and for the gaps
between them. A plumbus stores the mean and variance for all combinations of the starting and ending
states of the Markov chain at the endpoints the interval. It also stores the probability distribution of the
ending state for diferent values of the starting state. Such plumbuses can be then combined in constant
time into plumbuses for longer intervals, leading to overall linear running time for the  (, ) statistic.
Our multiprobs data structure. We will now introduce the multiprobs data structure inspired by
the plumbus data structure from MCDP2, but adapted for computing the full pmf.</p>
        <p>Let us now consider a genomic interval  = [, ) and let  ′ be the annotation obtained by applying
window  on  . The multiprobs data structure for interval  is a table  ,, for ,  ∈ {0, 1} and
0 ≤  ≤ | ′|. Value  ,, is the probability Pr( ( ′,    ) = ,  −1 =  |  −1 = ), i.e. the probability
of    overlapping  intervals of  ′ and ending in state  at position  − 1 , given that we started in
state  at position  − 1 .</p>
        <p>
          To compute multiprobs, we run the MCDP algorithm twice, with starting distributions (
          <xref ref-type="bibr" rid="ref1">1, 0</xref>
          )and
(
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          )at position  − 1 to compute  0,, and  1,, respectively. We modify the base case to apply to
position  − 1 instead of -1, but we also need to add computation of the probability of ending in state  at
position  − 1 , whereas MCDP ignores the states of the Markov chain after the end of the last interval, as
they do not influence the overlap. In our case, these probabilities are needed to correctly join adjacent
intervals. This extension can be easily done in  (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) time. The overall running time is  (| ′|2).
        </p>
        <p>To obtain the pmf for window  from its multiprobs   assuming the Markov chain starts in its
stationary distribution  , we use the law of total probability to marginalize over the starting state  and
ending state  :</p>
        <p>Pr( ( ′,   
) = ) = ∑ ∑   ⋅  ,, .</p>
        <p>∈{0,1} ∈{0,1}
Joining adjacent multiprobs. Consider two adjacent intervals  1 = [, ) and  2 = [, ) . Given
multiprobs   1 and   2 of these intervals, we want to compute multiprobs   for interval  = [, ) .
Let  ′ be the annotation obtained by applying  on  and for  ∈ {1, 2} let   be the annotation obtained
by applying   on  .</p>
        <p>First let us assume that no interval of  spans the boundary point  . This means that for any  ′,
 ( ′,  ′) =  ( 1,  ′) +  ( 2,  ′). To compute  ,, , we marginalize over the state  at the boundary
point  and over  1 which is the value of the statistics in  1.</p>
        <p>Thanks to the Markov property, events in window  2 are independent of how the state  −1 was
reached, given that  −1 =  . Namely, assuming  −1 =  , events  ( 2,    ) =  −  1 and  −1 = 
are independent of events  ( 1,    ) =  1 and  −1 =  . This allows us to obtain the following
equation for  ,, :
 ,, =</p>
        <p>min(,| 1|)
∑ ∑
∈{0,1}  1=max(0,−| 2|)</p>
        <p>
          1  2
 ,, 1 ⋅  ,,− 1
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>
          Now let us assume that there is an interval [ℓ,  ) ∈  ′ which spans the boundary point  between
 1 and  2, i.e. ℓ &lt;  &lt;  . Formula (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) cannot be directly applied in this case because  ( ′,  ′)is either
 ( 1,  ′) +  ( 2,  ′)or  ( 1,  ′) +  ( 2,  ′) − 1, depending on whether  ′ overlaps with [ℓ,  ).
        </p>
        <p>
          To overcome this issue, we split  into three parts:   = [, ℓ) ,   = [ℓ,  ) and   = [ , ) . Notice that
no interval in annotation  crosses either of the boundary points ℓ and  . We can therefore join    ,
   and    into   by two applications of formula (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ).
        </p>
        <p>As a special case, we will sometimes construct multiprobs for an empty interval  = [, ) . In this
case, we need to consider only  = 0 and we set  ,,0 = 1 for  =  and  ,,0 = 0 for  ≠  . By joining
such empty   with an adjacent multiprobs   ′ for interval  ′ = [, )
  ′, and therefore such   is an identity element for the operation of joining two multiprobs.
or  ′ = [, ) , we obtain again</p>
        <p>
          Note that the operation of joining multiprobs has a time complexity of  ((| 1| + 1)(| 2| + 1) ).The
multiprobs for   can be computed in  (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) because it involves only a single interval of  .
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Sections.</title>
        <p>To deal with overlapping input windows, we use section data structure. We split the whole
chromosome into intervals at each position that starts or ends one of the windows and build one section
for each such interval. For each section, we keep the following data.</p>
        <p>• Interval  = [, ) that the section describes.
• Interval  = [  ,   ) ⊆  that is obtained from  by removing the parts covered by the reference
intervals spanning the boundaries of  , if such intervals exist. If no interval spans the boundaries
of  , then  =  . In the special case that the entire  is covered by a single interval that also spans
both boundaries,  will be an empty interval [, ) . We also keep a pair of booleans indicating if a
particular boundary of  is spanned by an interval.
• Annotations  ′ and  ′ obtained by applying window  on  and  , respectively.
• The value of the statistic  =  (
• Multiprobs for interval  .</p>
        <p>′,  ′).</p>
        <p>Given two adjacent sections for intervals  1 = [, ) and  2 = [, ) , we can easily build a section
for interval  = [, )</p>
        <p>. We have already described the process of joining the corresponding multiprobs.</p>
        <p>The value of statistic  is added from sections  1 and  2 and potentially adjusted for double counting of
the interval spanning boundary  . Annotations  ′,  ′ and intervals  ,  can be also easily combined
from the two sections.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Algorithms.</title>
        <p>The running time of our algorithms can be expressed in terms of parameters  = || ,
 = ||</p>
        <p>and  being the number of input windows.</p>
        <p>All three algorithms first sort intervals of  in  ( log ) time and estimate the parameters of the
Markov model from annotation  in  ()</p>
        <p>time.</p>
        <p>The naive algorithm runs the original MCDP algorithm for each window separately, which takes
 (</p>
        <p>The section-based and tree-based algorithms instead continue by computing initial sections.
Section boundaries are found by sorting 2 endpoints of input windows. We then compute the subsets of
annotations  and  corresponding to each section, which can be done by a linear pass through sorted
annotations. Computation of the overlap statistic in all sections also takes linear time. Computing
multiprobs for a section with   reference intervals takes  ((  + 1)2)time. The sum
sections is at most  + 2 , because each section boundary can split a reference interval into two parts.
Overall the running time of computing multiprobs for all sections is at most  (( + ))
because
∑  for all

∑ (  + 1)2 ≤ ( + 1) ∑ (  + 1) ≤ ( + 1)( + 4) .</p>
        <p>The section-based algorithm then joins sections of each window sequentially. The total number of
joining operations is thus  ( 2). However, considering that one joining operation takes  ((  + 1))
time, the running time of all joining operations for all windows is  (( + ))
.</p>
        <p>The tree-based algorithm instead builds a segment tree [11, 14]. This is a binary tree which has
sections at its leaves. Each internal node contains a multiprobs for the union of its two children intervals.
The tree has height  ( log ) and at each depth it covers the entire chromosome. Therefore multiprobs
joining at each depth takes  (( + ))
time, resulting in  (( + )
log ) total time. Then we can
express each window as a union of a logarithmic number of sections from the segment tree and join
these sections together in  (( +</p>
        <p>log ))total time for all windows. We perform  () join operations
for building the tree and  ( log ) for computing results for individual windows.</p>
        <p>Although the tree-based algorithm does only  ( log ) joining operations compared to  ( 2)for
the section-based one, the overall running time is cubic in the term  =  + 
for all three algorithms,
because it is dominated by the largest joining operations.
100 1000
Number of reference intervals
10000</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We implemented all three algorithms in a C++ software tool eMCDP available at https://github.
com/jakubdrobny/e-mcdp. The reproducibility data is available at https://github.com/jakubdrobny/
e-mcdp-reproducibility. In addition to the three variants of the algorithm, eMCDP can also be used to
perform whole-genome analysis as with the previous MCDP software [ 9] and the MCDP2 software
[10]. When we compute p-values for multiple windows of a genome, our software also outputs p-values
adjusted by the Bonferroni multiple testing correction, where the number of tests is the number of
windows in a single run of the tool.</p>
      <p>In this section, we evaluate our software in terms of running time and its power to detect enrichments,
and we apply it to real annotations of the human genome. All experiments have been run on a laptop
with Intel© Core™ i7-9750H CPU at 2.60GHz x 12 and 32 GB DDR4 RAM.</p>
      <p>Comparison with the original MCDP implementation. Figure 2 shows the running time
comparison of the MCDP [9], MCDP2 [10] and eMCDP with diferent reference annotation sizes. Here, no
window set it used, and all three algorithms analyse the whole genome. For this experiment, we used
the 01-synthetic data sets from the MCDP2 study [10]. These data sets contain randomly generated
reference and query annotations along a single chromosome of length 100Mbp, with all intervals having
length 500bp. All query annotations have 50000 intervals, while the size of the reference annotation
varies from 10 to 20000. For each reference size, 20 pairs of inputs were averaged. MCDP was run only
for inputs with || ≤ 5000 due to its slow running time. We see that eMCDP is orders of magnitude
(∼100x) faster than the original MCDP algorithm [9] and for small annotations of sizes up to 5000 is
faster than even the linear-time MCDP2 algorithm [10], which does not compute the exact p-values.
Comparison of algorithms. Here, we compare the performance of the naive, section-based
and tree-based variants with a sliding window size of 1Mbp and a window step of 50kbp, that is,
each window is made up of 20 non-overlapping sections. For this experiment, we again used the
01-synthetic data set from the MCDP2 study [10]. We ran the tree-based algorithm on reference
annotations of sizes up to 5000 and the naive and section-based variants on reference annotations of
sizes up to 20000. Figure 3 shows that the tree-based variant is substantially slower than the other
two algorithms. This is likely due to the large overhead of the segment tree which is built for the whole
chromosome and thus contains also costly multiprobs for sections longer than the longest window.
We can also observe that the section-based variant slightly outperforms the naive variant for large
)10.0
s
(
e
m
i
t
n
u
r
e
g
a
r
e
v
A1.0
naive
segment-based
tree-based
10
100 1000
Number of reference intervals
10000</p>
      <p>Enrichment detection. We now examine the power of eMCDP to detect local colocalizations. We
generated annotations along a genome consisting of two chromosomes of length 1Mbp with genome
coverage  = 0.1 and a fixed interval length ℓ = 100. The annotations were generated dependently along
the first chromosome and independently along the second chromosome. To generate an annotation
with a coverage  and length ℓ, we simply iterated over all chromosomal positions and at each position
the probability of a new interval starting is /ℓ . If a new interval was generated, we continued the
process from the last position of the newly generated interval instead of the next position. To produce
independent annotations, we used this method to generate both annotations.</p>
      <p>To generate dependent annotations with a dependency factor  ≥ 1 , we first generate  . To generate
 , we again iterate over all positions and at each position we first check if it is covered  . If it is, the
probability of a new interval starting at this position is multiplied by the dependency factor  . At
positions not covered by an interval from the reference annotation, the probability of a new interval
starting is multiplied by 1/ .</p>
      <p>We ran the software with a window size of 1Mbp, therefore covering each chromosome with a single
window. We generated pairs of annotations with dependency factors ranging from 1, i.e. the annotations
are truly independent to a dependency factor of 2. For each dependency factor we generated 10 pairs
of reference and query annotations and took the median p-value of the 10 tests. Figure 4 shows that
the significance of the overlap of the annotations along the first chromosome is increasing with the
increasing dependency factor. The overlap of the annotations along the second chromosome is not
significant, as expected.</p>
      <p>Efect of the window size. To examine the efect of the window size on the ability of our software to
detect a significant overlap, we generated annotations along a genome consisting of a single chromosome
of length 1Mbp with a coverage of 0.1. The annotations were generated independently along most of
the chromosome with the exception of the range of positions 400kbp to 600kbp, where the dependency
factor  was set to 10 to make the overlap of the reference and query annotations really significant. We
generated 10 pairs of reference and query annotation and ran our software with diferent window sizes
in the range of 10kbp to 800kbp and a window step of 5kbp or 10% of the window size, whichever was
smaller. For each window in each test, we plot the median adjusted p-value across all 10 tests.</p>
      <p>The results are shown in Figure 5, separately for windows shorter and longer than the target region
of length 200kbp. For long windows, our detection ability increases as we decrease the window size
to match the size of the region. This is due to the fact that longer windows cover more positions of
the genome where the annotations are independent, and thus they have a higher chance of overlap
occurring due to pure chance.</p>
      <p>If we consider windows shorter than the target region, our ability to detect the significance is the
best with window sizes slightly smaller than the size of the region with the significant overlap. Very
small windows contain only a small number of intervals, and thus the probability of overlap occurring
due to pure chance increases.</p>
      <p>A real data set. To evaluate eMCDP on real annotations, we used the Zarrei et al. (2015) [15] data
set from the original MCDP study [9]. This data set compares copy number losses (23438 intervals)
and exons of the Refseq-annotated genes in the human genome. We consider exons of all non-coding
genes (9975 intervals after merging overlaps), exons of all protein coding genes (208558 intervals after
merging) and exons of all genes (216813 intervals after merging). The annotation of copy number losses
was used as  and one of the three gene annotations as  . We tested for both enrichment and depletion
with sliding windows of size 1Mbp and step 100kbp. We used p-value threshold of 0.01 after Bonferroni
multiple testing correction.</p>
      <p>Table 1 shows that significant enrichment was only detected when comparing the annotation of
copy number losses with the annotation of exons of all non-coding genes. This enrichment was found
in two overlapping windows located on chromosome 15, which contain a dense cluster of 76 CD Box
snoRNA non-coding genes which overlap 5 loss intervals (see Figure 6). The expected number of
overlaps in these windows is only 0.2 with standard deviation 0.4. Both the Zarrei [15] and MCDP
[9] studies reported enrichment when comparing the annotations of copy number losses and exons
of all non-coding genes. However, the counts in this region are relatively low to be a significant
contribution the overall enrichment. We tried increasing the window size to 10Mbp, since there were at
most reference 13 intervals present in any of the 1Mbp windows, but that resulted in no windows with
significant enrichment.</p>
      <p>When analyzing depletion, we detected significant results for almost all chromosomes when
comparing the annotations of copy number losses and exons of all protein coding genes. Both the Zarrei [15]
and MCDP [9] studies detected significant depletion when comparing these annotations, a finding which
our results also support. An example of a depleted region on chromosome 21 is shown in Figure 7. The
annotation of exons of all protein coding genes is depleted around position 18Mbp from chromosome
start. This regions contains a long loss interval (1.3Mbp). As a result, the expected number of overlaps
is 1 with the standard deviation 10−10, but the actual number of overlaps is 0.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we developed three algorithms for colocalization analysis of two genomic annotations
in genomic windows. This allows the user to determine if any enrichment or depletion found on the
whole-genome level is concentrated in specific genomic regions.</p>
      <p>All three algorithms compute the full probability mass function for each window. Although two
of the algorithms try to reduce redundant computations in overlapping windows, the running time is
dominated by the quadratic term needed to combine the largest sections, and thus these changes do not
reduce the running time. Indeed, the algorithm using segment trees was by far the slowest. This can be
explained by the algorithm computing the full tree, even long intervals that will never be used and are
costly to compute. An improvement would be to compute multiprobs in the tree on demand so that
these redundant multiprobs are never computed.</p>
      <p>On the other hand, the tree-based approach would be very eficient if we replaced computation of
the multiprobs with the plumbuses from MCDP2 [10], which can be joined in constant time. Thus the
reduction of the number of join operations from  ( 2)to  ( log ) would bring time savings. However,
normal approximation of p-values computed using plumbuses is appropriate only for data sets with a
large number of intervals, so that this approach would be applicable only to very long windows.</p>
      <p>Our algorithms can work for any set of input windows. Nonetheless, typical windows required by an
user are likely to be sliding windows of constant size and step. In this special case, one could replace
the segment tree with sliding window algorithms appropriate for an arbitrary associative operation,
Scale 2 Mb hg19
chr21: 15,000,000 15,500,000 16,000,000 16,500,000 17,000,000 17,500,000 18,000,000 18,500,000 19,000,000 19,500,000</p>
      <p>Copy number losses
CNV losses</p>
      <p>Exons of protein coding genes
Prot. coding</p>
      <p>15.4035 _ Depletion p-value, -log10
such as the De-Amortized Banker’s Aggregator [17], which would require only  () joining operations
for all windows combined.</p>
      <p>Another desired extension of our current implementation would be the addition of context-dependent
models from MCDP2. These models can avoid artifacts stemming from assembly gaps, and they can
also help to study the influence of confounding factors, such as the GC content.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported in part by the Slovak Grant Agency VEGA grants 1/0538/22 and 1/0140/25
and by the Slovak Research and Development Agency grant APVV-22-0144. The research was also
supported by the EU Horizon 2020 programme under the Marie Skłodowska-Curie grant agreement No.
872539 (PANGAIA).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Generative AI tool Gemini 2.5 Pro in Google
AI Studio in order to: Grammar and spelling check. After using these tool, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.
[8] S. Sarmashghi, V. Bafna, Computing the statistical significance of overlap between genome
annotations with iStat, Cell Systems 8 (2019) 523–529.e4.
[9] A. Gafurov, B. Brejová, P. Medvedev, Markov chains improve the significance computation of
overlapping genome annotations, Bioinformatics 38 (2022) i203–i211.
[10] A. Gafurov, T. Vinař, P. Medvedev, B. Brejová, Eficient analysis of annotation colocalization
accounting for genomic contexts, in: Research in Computational Molecular Biology, Springer
Nature Switzerland, Cham, 2024, pp. 38–53.
[11] J. L. Bentley, Solutions to Klee’s rectangle problems, Unpublished manuscript (1977) 282–300.
[12] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, B. Ren, Topological domains in
mammalian genomes identified by analysis of chromatin interactions, Nature 485 (2012) 376–380.
[13] S. Myers, L. Bottolo, C. Freeman, G. McVean, P. Donnelly, A fine-scale map of recombination rates
and hotspots across the human genome, Science 310 (2005) 321–324.
[14] M. De Berg, O. Cheong, M. van Kreveld, M. Overmars, Computational geometry: algorithms and
applications, Springer, 2008.
[15] M. Zarrei, J. R. MacDonald, D. Merico, S. W. Scherer, A copy number variation map of the human
genome, Nature Reviews Genetics 16 (2015) 172–183.
[16] G. Perez, G. P. Barber, A. Benet-Pages, et al., The UCSC Genome Browser database: 2025 update,</p>
      <p>Nucleic Acids Research 53 (2024) D1243–D1249.
[17] K. Tangwongsan, M. Hirzel, S. Schneider, In-order sliding-window aggregation in worst-case
constant time, The VLDB Journal 30 (2021) 933–957.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Dozmorov</surname>
          </string-name>
          ,
          <article-title>Epigenomic annotation-based interpretation of genomic data: From enrichment analysis to machine learning</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>33</volume>
          (
          <year>2017</year>
          )
          <fpage>3323</fpage>
          -
          <lpage>3330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kanduri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gundersen</surname>
          </string-name>
          , E. Hovig,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Sandve</surname>
          </string-name>
          ,
          <article-title>Colocalization analyses of genomic elements: Approaches, recommendations and challenges</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>35</volume>
          (
          <year>2019</year>
          )
          <fpage>1615</fpage>
          -
          <lpage>1624</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Dozmorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Cara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Wren</surname>
          </string-name>
          ,
          <article-title>GenomeRunner web server: regulatory similarity and diferences define the functional impact of SNP sets</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>32</volume>
          (
          <year>2016</year>
          )
          <fpage>2256</fpage>
          -
          <lpage>2263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Shefield</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Bock, LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>32</volume>
          (
          <year>2016</year>
          )
          <fpage>587</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diez-Villanueva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buschbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Peinado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Malinverni</surname>
          </string-name>
          , regioneR: an R/
          <article-title>Bioconductor package for the association analysis of genomic regions based on permutation tests</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>32</volume>
          (
          <year>2016</year>
          )
          <fpage>289</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Heger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Webber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goodson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Ponting</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Lunter, GAT: a simulation framework for testing the association of genomic intervals</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>29</volume>
          (
          <year>2013</year>
          )
          <fpage>2046</fpage>
          -
          <lpage>2048</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Sandve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gundersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rydbeck</surname>
          </string-name>
          , et al.,
          <article-title>The genomic HyperBrowser: Inferential genomics at the sequence level</article-title>
          ,
          <source>Genome Biology</source>
          <volume>11</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>