=Paper= {{Paper |id=Vol-3606/66 |storemode=property |title=Enhancing Scalability of Distributed SNPs Calling Pipelines Using Cluster-Driven Partitioning Strategy |pdfUrl=https://ceur-ws.org/Vol-3606/paper66.pdf |volume=Vol-3606 |authors=Lorenzo Di Rocco,Umberto Ferraro Petrillo,Giorgio Grani,Alessandro La Ferlita,Yan Qi,Emanuel Di Nardo,Simon Mewes,Ould el Moctar,Angelo Ciaramella,Claudia Diamantini,Alex Mircoli,Domenico Potena,Simone Vagnoni,Claudia Cavallaro,Vincenzo Cutello,Mario Pavone,Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/RoccoPG23 }} ==Enhancing Scalability of Distributed SNPs Calling Pipelines Using Cluster-Driven Partitioning Strategy== https://ceur-ws.org/Vol-3606/paper66.pdf
                                Enhancing Scalability of Distributed SNPs Calling
                                Pipelines Using Cluster-Driven Partitioning Strategy
                                Lorenzo Di Rocco1,∗ , Umberto Ferraro Petrillo1 and Giorgio Grani1
                                1
                                    Dipartimento di Scienze Statistiche, Università di Roma “La Sapienza”, P.le Aldo Moro 5, I-00185 Rome, Italy


                                                                         Abstract
                                                                         The genetic composition of individuals within the same species may differ due to mutations in their
                                                                         genomes. Variants calling procedures are pivotal to detect these polymorphisms. Typically, variants
                                                                         calling algorithms use the De Bruijn graph data structure for storing the k-nucleotide long strings
                                                                         sequenced from the genomes of multiple individuals. Subsequently, mutations in an individual can be
                                                                         identified by searching for divergent paths (known as bubbles) in the corresponding De Bruijn graph.
                                                                             The analysis of a very large De Bruijn graph can be very computationally intensive. In a recent
                                                                         proposal, a novel pipeline was proposed to build and analyze very large De Bruijn graphs using distributed
                                                                         computing. This novel approach has shown competitive results in the specific case of calling SNPs.
                                                                         However, experimental results have shown that scalability is limited due to the overhead of traversing
                                                                         paths scattered across different computers.
                                                                             In this work, we address one of the main problems that prevent distributed De Bruijn graphs from
                                                                         performing well in practice. Namely, we analyze how a bad distributed partitioning of a De Bruijn graph
                                                                         may hinder the efficiency of the SNPs detection process. Then, we introduce a distributed partitioning
                                                                         strategy based on a hierarchical clustering method able to overcome this problem. We also provide
                                                                         an experimental analysis showing how this approach allows for a relevant performance boost of the
                                                                         considered pipeline and ensures for better scalability.

                                                                         Keywords
                                                                         Graph Partitioning, Distributed Computing, Computational Genomics




                                1. Introduction
                                Advances in New Generation Sequencing (NGS) technologies have led to an explosion in the
                                amount of genomics data at affordable costs. Because of this massive production of biological
                                datasets, pivotal tasks in Bioinformatics have become computationally expensive.
                                  One of the most relevant tasks is variants calling. We use this term to denote a procedure
                                aimed at detecting variants between the genomes of two or more different individuals. Reference-
                                based algorithms are commonly used to detect the presence of mutations with respect to the
                                average genome of the corresponding species. However, high-quality reference genomes are
                                not available for all species and for all applications. For these reasons, it is useful to consider


                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                ∗
                                    Corresponding author.
                                Envelope-Open lorenzo.dirocco@uniroma1.it (L. Di Rocco); umberto.ferraro@uniroma1.it (U. Ferraro Petrillo);
                                g.grani@uniroma1.it (G. Grani)
                                Orcid 0000-0001-8744-7048 (L. Di Rocco); 0000-0002-4308-5126 (U. Ferraro Petrillo); 0000-0001-6049-0062 (G. Grani)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
reference-free algorithms. These allow comparison of individuals by evaluating reads sequenced
from the corresponding genomes, without relying on an average species sequence.
   The state-of-the-art reference-free methods are usually based on the exploration of a graph-
based data structure, called De Bruijn (DB) graph, that is built from an input collection of reads.
The presence of a polymorphism is then detected by finding some specific types of divergent
paths in the corresponding DB graph. However, the computational and memory requirements of
these algorithms can be prohibitive, making them unsuitable for processing large DB on a single
workstation. Some alternative methods that have emerged recently therefore use distributed
computing to develop reference-free pipelines for variants calling, able to take advantage of the
large availability of memory and computational resources of a computer cluster.
   Despite the theoretical advantages, analyzing a DB graph on a distributed architecture poses
significant challenges. For example, the data traffic required to explore a region of the graph
where each vertex of a path is on a different node of the underlying distributed system can
create excessive overhead that negatively affects algorithm performance and scalability.
   In this work, we focus on the specific problem of reducing the communication overhead to
pay for exploring distributed DB graphs by considering a recent distributed pipeline presented
in [1], that aims at calling a specific type of polymorphisms, namely the Single Nucleotide
Polymorphisms (SNPs). The solution we present is based on a cluster-driven partitioning
strategy used in the initial creation of a DB graph. We expect that this will significantly reduce
the amount of data exchanged between nodes during the invocation of SNPs, which should have
a positive impact on the efficiency and scalability of the pipeline.
   We also present the results of an experimental analysis performed on a real dataset to compare
our improved partitioning scheme with the default partitioning strategy used by [1].


2. Background
2.1. De Bruijn graphs
The DNA structure consists of two antiparallel strands comprising nucleotide bases, commonly
represented by the alphabet Σ = {𝐴, 𝐶, 𝐺, 𝑇 }. The strands run in opposite directions, and the
bases in the corresponding positions are paired according to specific rules (A-T and C-G).
Consequently, a genomic sequence is usually encoded as a single string of characters drawn
from the alphabet Σ, as the reverse strand is easily obtainable. However, modern sequencing
technologies do not return the entire genome in one continuous sequence. Instead, they generate
huge amounts of nucleotide fragments, called reads. These reads are randomly sequenced from
both strands, and the coverage level determines the number of reads associated with a particular
genomic region.
   Consequently, reconstructing a genome becomes a complex task due to the chaotic reads
samples that need to be assembled. The assembly phase is extremely expensive from the
computational viewpoint since it need to face the time complexity of computing the matching
score for each pair of reads.
   Assembly techniques based on k-mers and the DB graph have revolutionized genomic analy-
sis. The k-mer-based approach involves splitting the genomic reads into smaller overlapping
Figure 1: The DB graph obtained from two sequenced reads (𝑅1 and 𝑅2 ), when 𝑘 = 3. The set of vertices
represent all the distinct 3-mers extracted from 𝑅1 and 𝑅2 , while the edges connect the overlapping
3-mers. The SNP (marked in red) between 𝑅1 and 𝑅2 generates two parallel paths connecting CGT with
GAT. The substring GAT occurs twice, determining a loop in the graph and two branching paths.


subsequences of length k (called k-mers) and building the corresponding DB graph. The ver-
tices of a DB graph represent the distinct k-mers extracted from the reads sample, while the
edges connect the k-mers that overlap for 𝑘 − 1 characters. The DB graph provides a compact
and efficient representation of the genome, and traversing its path of overlapping sequences
enhances an efficient reconstruction of the original genomic sequence.
   DB graphs play also crucial role in variants calling. Variants calling refers to the process of
identifying genetic variations between different individuals, such as single nucleotide polymor-
phisms (SNPs) and insertions/deletions (indels), in a given genomic dataset. The structure of
the DB graph enables efficient and accurate detection of variants. Indeed, when dealing with
sequencing data referring to two or more individuals, the presence of polymorphisms generate
divergent paths, called bubbles. There exist different type of bubbles according to corresponding
polymorphisms. In the case of isolated SNPs, we have simple bubble. Figure 1 shows an example.
In real world scenario the topology of graph is extremely complex and the detection of the
divergent is not straightforward.

2.2. Large-scale variants calling
DB graphs tend to become very large and, consequently, difficult to store, when based on
genomic datasets returned by modern sequencing technologies. A possible solution to this
problem consists of introducing compressed data structures for storing a DB graph. This problem
is addressed in [2, 3], where a concise representation of a colored DB graph is obtained by
using data structures for string compression and indexing, as the positional Burrows-Wheeler
transform [4].
   Further solutions take into account a probabilistic representation of the DB graph. Among
the proposed techniques, 𝐷𝑖𝑠𝑐𝑜𝑆𝑛𝑝 [5] is one that outperforms many others in terms of space
and time efficiency, when used to retrieve isolated SNPs on genomes of complex organisms. It
features an efficient storage layout where only the nodes of an input DB graph are stored using
a cascade of Bloom Filters [6]. The edges are found by properly querying these filters to obtain
paths corresponding to isolated SNPs, even though the Bloom Filters may return non existing
paths due to false positive matches.
   Despite the potential advantages, the usage of distributed computing for large-scale variants
calling analysis still remains a relatively unexplored field. To the best of our knowledge, [1]
is the only pipeline that leverages a distributed representation of a DB graph to perform SNPs
calling.

2.3. Distributed Computing
Computations on distributed systems enable the processing of huge collections of data. In
distributed computing architectures, a number of computers (called nodes) work together
to solve a single problem. The nodes coordinate their hardware resources using messages
exchanged over a network.

2.3.1. MapReduce
The MapReduce (MR) paradigm [7] is a programming model designed for implementing dis-
tributed algorithms. It considers an input dataset of key-value pairs distributed across a computer
cluster and provides a flow that combines two types of functions:

    • map functions. They process each key-value pair to obtain a new (possibly empty) set of
      key-value pairs.
    • reduce functions. They first collect on a same node all pairs sharing the same key and, then,
      they aggregate all values in this collection, returning a (possibly empty) set of key-value
      pairs

   Apache Spark [8] is the standard framework for developing MapReduce algorithms to be
executed on a distributed computing architecture. Resilient Distributed Datasets (RDDs) are
the core data structures of Spark. RDDs are an abstract representation of the key-value pairs
distributed across the nodes of the distributed system.

2.3.2. Apache Spark for distributed graph processing
Apache Spark also provides libraries (built on top of RDDs) dedicated to specific tasks. The
GraphX API [9] is designed for large-scale graph processing on a Spark cluster. This API
introduces a distributed representation of a graph built on top of two RDDs, storing respectively
the vertices and the edges of a graph, with their associated properties. In details, each vertex is
associated with an automatically-generated unique id (vid) and (optionally) a set user-defined
attributes. The edges are modeled as triplets containing the source vertex id, the destination
vertex id, a virtual partition identifier (pid) and (optionally) a set user-defined attributes.Vertices
and edges are scattered across the nodes of a computer cluster, according to their vid/pid. The
GraphX API employs a default partition strategy and generates the so-called VertexMap RDD
that maps each vertex to the list of partitions containing its incident edges. Leveraging on the
vertices and the edges identifiers, GraphX allows also to consider user-defined graph partition
strategies that may minimize the traffic data and balance the workload of the computing nodes.
GraphX allows to traverse and to analyze a graph, by means of algorithms written using a
special-purpose message-passing paradigm.


3. A MapReduce Pipeline for Isolated SNPs Calling
The distributed pipeline we consider in this work has been recently proposed in [1]. It has
been designed according to the MapReduce paradigm (see Section 2.3.1) and it is a distributed
reformulation of the 𝐷𝑖𝑠𝑐𝑜𝑆𝑛𝑝 algorithm [5], with some relevant improvements. We recall
that 𝐷𝑖𝑠𝑐𝑜𝑆𝑛𝑝 uses a cascade of Bloom filters to store the DB graph. This allows to maintain
very large graphs in a small amount of memory, but it has the disadvantage of leading to the
generation of paths that were not present in the original graph (i.e., chimeric sequences) and,
therefore, have to be eliminated in a subsequent step.
   In a distributed system, the amount of available memory is usually much larger, since we can
leverage on the computational resources of all nodes of the system. Therefore, in the approach
presented in [1], the input DB graph is explicitly represented, with no compression at all, thanks
to a distributed representation, thus avoiding the creation of useless chimeric sequences.
   Specifically, the proposed algorithm is formulated in three steps, each made of a sequence
of distributed transformations. In the first step, a DB graph 𝒢 is created from from an initial
distributed collection 𝒞 of reads. In the second step, 𝒢 is processed by a distributed algorithm
that searches for all simple bubbles. In the third step, all isolated bubbles, i.e., the SNPs, are
returned in the output. More details on the algorithm follow.

3.1. Step 1: Graph Creation
Given two FASTA/FASTQ files, each containing a set of reads extracted from two different
individuals, their contents are loaded in memory using the FASTdoop library [10]. Then, (𝑘 + 1)-
mers are extracted from each read and stored in a RDD distributed data structure. Finally, a
reduce transformation is executed to count the frequency of each distinct (𝑘 + 1)-mer. The
results are stored in a new RDD, containing all the (𝑘 + 1)-mers that have been found in the
input reads collection, with their associated overall frequencies.
   A map transformation is used to filter out (𝑘 +1)-mers with low coverage. Then, after merging
all samples, a DB distributed graph is created using the GraphX library (see Section 2.3.2). Next,
we extract consecutive k-mers from the (𝑘 + 1)-mers to feed the distributed collection of vertices
of the DB graph. Then, (𝑘 + 1)-mers are added to the distributed collection storing the edges of
the DB graph. The edges connect the vertices corresponding to the two consecutive k-mers.
This process is also performed for the backward strand of each (𝑘 + 1)-mer.

3.2. Step 2: Centrality Indices Evaluation
In this step, two centrality indices telling the number of incoming and outgoing edges for each
vertex 𝑣, respectively denoted inDegree and outDegree, are evaluated.
   To do this, for each vertex 𝑣, the number of triples containing 𝑣 in the distributed representation
of 𝒢 is counted, distinguishing the cases where 𝑣 is the source vertex from those where 𝑣 is the
destination vertex. Then, for each vertex, these counts are computed by a distributed reduce
transformation to obtain the corresponding centrality indices.

3.3. Step 3: Simple Bubbles Detection
In this step, a distributed algorithm is executed over 𝒢 to find all paths that represent simple
bubbles.
   Initially, vertices with an outDegree greater than 1 are considered as starting vertices and
are marked as enabled. In the first iteration, each enabled triplet sends a path-building message
to its adjacent vertices, including information about the current path (which initially contains
only the starting point) and the iteration number.
   In subsequent iterations, the triplets containing vertices that received messages in the previous
iteration are enabled. When a vertex receives a path-building message from a neighbor, it updates
the received message by adding its own identity and increasing the iteration number. The
vertex stores a copy of the iteration number in its state and broadcasts the updated message to
its adjacent vertices.
   To handle branching bubbles, the algorithm modifies the transmission of path-building
messages. In the first 𝑘-1 iterations, the message is only transmitted to destination vertices
with one incoming edge and one outgoing edge. Once the number of iterations reaches 𝑘, the
messages are forwarded without checking the degrees of the destination vertex. This change
is because at iteration 𝑘, the destination vertex is expected to be the end vertex of the bubble,
allowing multiple incoming and outgoing edges.
   At the end of the execution, after running 𝑘 iterations, the algorithm identifies vertices with
iteration number 𝑘 + 1 that have received two messages. These messages contain paths starting
from the same vertex and with the same length. These vertices represent terminal nodes of
sequence pairs containing simple bubbles and are considered as isolated SNPs. They are encoded
as a distributed collection of string pairs.


4. Our Proposal
4.1. Motivation
When using a distributed approach, one expects that the solution time of a given problem could
be reduced by just employing more computational resources. However, the ability to efficiently
exploit the computational capability of a distributed system is not always a given. This is either
due to the inherent non-parallelism of the problem being solved, or to issues involving the
algorithm itself, its implementation and the way it interacts with the underlying distributed
system. This is the case of the pipeline presented in [1]. As shown in Figure 2, the experiments
therein presented have shown a promising level of scalability when exploiting an increasing
number of computing units. However, the reduction in execution times is gradually less and
less pronounced.
   A more detailed investigation revealed that this lack of scalability is mainly due to the
suboptimal strategy GraphX uses for partitioning the vertices of the DB graph across different
Figure 2: Execution times required by [1] for isolated SNPs calling on a dataset including 32 GB of
reads sequenced from the chromosome 7 of two human individuals, using an increasing number of
computing units.


nodes of a distributed system, resulting in a very high number of x-cross paths, i.e. paths of a
distributed graph where vertices are placed over two or more nodes of the distributed system.
   In this scenario, the advantages of adding further computing nodes may be burdened by the
communication overhead introduced by the message-passing mechanism required to explore
the distributed graph. This overhead includes both the exchange of data between computing
units within a node and across different nodes. The latter case is significantly more expensive
in terms of network communication times and scalability level.
   Our work focuses on assessing the positive impact on the performance of the considered
algorithm coming from the adoption of a partitioning strategy different from the standard one
available with GraphX. Our expectation is that the usage of an improved scheduling strategy
would significantly reduce, when not eliminating at all, the number of x-cross paths. In turn,
this would lead to a significant reduction in the communication overhead between different
computing nodes and/or computing units, thus allowing for shorter overall execution times. To
achieve this, our proposal leverages a hierarchical clustering approach to group reads referring
to the same genomic region, enabling an intra-cluster solution for bubbles detection.

4.2. Cluster-driven Partitioning Strategy
The strategy we propose to improve the partitioning of a DB graph over a distributed system
is based on the assumption that clustering the reads according to an appropriate similarity
metric isolates specific genomic regions, allowing to perform the bubble detection procedure
mostly within them. Consequently, assigning these clusters to different independent computing
units for parallel processing is expected to require less communication overhead than using the
standard partitioning strategy, thus enhancing the scalability of the DB graph exploration in a
distributed environment.
   It requires two steps. In the Hierarchical Clustering step, the collection of input reads about
two distinct individuals are clustered using a bottom-up approach and a greedy stop criterion.
In the Clusters Binning step, the clusters returned in the previous step are grouped into 𝐵 distinct
bins scattered across the computational units of the distributed architecture. The generation of
the bins is driven by the application of the longest processing time first rule (i.e., 𝐿𝑃𝑇 rule)[11, 12],
to avoid overloading certain nodes. In the following, we will explain the proposed partitioning
strategy in more detail.

4.2.1. Hierarchical Clustering
Given an input set of reads sequenced from two individuals, we use a hierarchical bottom-up
algorithm to group those reads that are likely to cover the same genomic region.
   An algorithm of this type implements an iterative procedure that progressively merges
similar elements into larger clusters to eventually obtain a dendrogram, i.e., a hierarchical
tree-like structure representing the clustering relationships. In the initialization step, each read
is considered as a cluster with a single element. Then, in the 𝑖-th iteration, for each cluster, the
read closest to the others in the same cluster is selected as representative. Then, the pairwise
similarity matrix between all clusters is calculated and the two closest clusters are merged. This
process continues until obtaining a single large cluster that contains all the reads in the input
dataset.
   Creating the entire dendrogram can be very computationally intensive when working with
large datasets. Moreover, solutions with a small number of clusters should be avoided, as our
strategy aims to detect clusters consisting of very similar reads that refer to the same genomic
region. To achieve this, we incorporated a greedy stopping criterion into the algorithm. Namely,
our hierarchical iterative process stops when the similarity between the two clusters to be
merged at the 𝑖-th iteration is less than a certain threshold 𝑐. In this way, the choice of the
number of clusters is also automated without the need for a global evaluation of the dendrogram.
If the threshold 𝑐 is sufficiently large, hierarchical clustering yields a large number of clusters
containing a relatively small number of reads. However, if 𝑐 is too large, reads related to the
same genomic section may be assigned to different clusters, causing many bubbles to be missed
during the detection step.

4.2.2. Choice of a similarity measure
We recall that a genomic read is simply a string of characters from the alphabet Σ = {𝐴, 𝐶, 𝐺, 𝑇 }.
Our clustering framework, therefore, requires a similarity measure that can capture the under-
lying similarity patterns between strings and is not extremely sensitive to slight variations in
characters. This helps to ensure that reads sequenced from multiple individuals covering the
same genomic regions are clustered together, despite the presence of SNPs in the same cluster.
   In this work, we consider the Jaccard index, also known as Jaccard similarity coefficient. This
measure is widely used in computational genomics for assessing similarity between nucleotide
strings and for clustering [13, 14]. The Jaccard index quantifies the similarity between two
reads with respect to their shared k-mers. To calculate this coefficient, it is required to extract
the corresponding sets of k-mers from a pair of reads and calculate the percentage of common
k-mers. Mathematically, the Jaccard index can be expressed as follows:
                                                      |𝐴 ∩ 𝐵|
                                         𝐽 (𝐴, 𝐵) =
                                                      |𝐴 ∪ 𝐵|
where:
     • 𝐴 and 𝐵 represent the sets of k-mers extracted from the two genomic reads;
     • |𝐴| and |𝐵| represent, respectively, the number of distinct k-mers in 𝐴 and 𝐵;
     • |𝐴 ∩ 𝐵| represents the number of shared k-mers between 𝐴 and 𝐵;
     • |𝐴 ∪ 𝐵| represents the total number of unique k-mers in the union between 𝐴 and 𝐵.
The Jaccard index ranges between 0 and 1. The more it is larger, the more the reads are similar.
When it is equal to 1, the reads are identical. Since the Jaccard index evaluates the presence or
the absence of shared k-mers, it is robust to small variations in the reads being compared.

4.2.3. Binning the clusters
The clusters returned in the previous step need to be distributed among the computational units
of the distributed system to perform bubble detection in parallel. In this way, each computing
unit processes a bin of clusters according to the available resources. However, the number of
reads changes within the cluster, and some clusters may be overloaded due to the presence
of highly repetitive genomic regions. For this reason, the binning strategy is controlled by an
appropriate scheduling model, to ensure a balanced workload on the computational units.
   We consider this problem as a scheduling problem with identical parallel machines and
no preemption. The objective is to minimize the maximum completion time (makespan) by
assigning each of the 𝑛 jobs, characterized by their respective processing times {𝑝𝑗 ,𝑗 = 1, … , 𝑛},
to one of the 𝑚 parallel identical machines. In our context, these jobs correspond to the tasks
associated with building and exploring the local DB graph belonging to a cluster of reads. The
processing time of each job is estimated based on the number of k-mers extracted from the reads
within the same cluster. The machines, on the other hand, represent the available computational
units. This problem, referred to as 𝑃||𝐶𝑚𝑎𝑥 , is known to be NP-hard and can be formulated
mathematically as follows:


min 𝐶𝑚𝑎𝑥                                                                                         (1)
          𝑁
𝐶𝑚𝑎𝑥 ≥ ∑𝑗=1 𝑝𝑗 𝑥𝑖𝑗 ∀𝑖 ∈ {1..𝑚}                                                                   (2)
  𝑚
∑𝑖=1 𝑥𝑖𝑗 = 1 ∀𝑗 ∈ {1..𝑁 }                                                                        (3)
𝑥𝑖𝑗 ∈ {0, 1}   ∀𝑖 ∈ {1..𝑚} 𝑗 ∈ {1..𝑁 }                                                           (4)
where 𝐶𝑚𝑎𝑥 is the 𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛, 𝑁 is the number of jobs, 𝑚 is the number of machines, 𝑝𝑗 is the
processing time of job 𝑗, 𝑥𝑖𝑗 is the decision variable. The value is 1 if the job 𝑗 is assigned to
machine 𝑖, otherwise it is 0. Given the complexity of this problem, many heuristic solutions
have been proposed in literature, as the 𝐿𝑃𝑇 rule. It involves sorting jobs in a non-increasing
order based on their processing times and assigning each job iteratively to the machine that
currently has the minimum completion time. The completion time of a machine is determined
by summing the processing times of all the jobs assigned to that particular machine.
5. Experimental Analysis
We conducted an experimental analysis to assess the quality of the proposed clustering-driven
partitioning strategy and evaluate its impact on the scalability.

5.1. Dataset
We downloaded the FASTA file corresponding to chromosome 7 of the average human genome
assembly GRCh37.p13 [15]. From this FASTA file, we simulated a sample of reads for our
experiments. Additionally, we used SAMtools [16] to extract the sequencing data referring
to chromosome 7 of the European individual NA12878 from the 30x downsampled alignment
provided by the Genome-in-a-Bottle Consortium [17]. Then, we built a distributed DB graph
from the two sets of reads fixing 𝑘 = 31, as described in Section 3.1.
   The individual NA12878 is accompanied by a VCF file containing the set of high-confidence
variants with respect to the average GRCh37.p13 genome. From this VCF file, we selected 10, 000
SNPs for our analysis. Thereafter, we extracted the portions of the DB graph corresponding to
these SNPs, to compare the GraphX default partitioning strategy with the one outcoming from
our cluster-driven partitioning strategy. From now on, we refer to the filtered DB graph used in
our analysis as DB1e4.

5.2. Computing Environment
We conducted our experiments on a high-performance computing system that consists of eight
computing nodes. Each node runs on the Linux operating system and is equipped with two
AMD Epyc 7452 processors, 64 compute cores, and 256 GB of RAM. For our experiments, we
employed the 3.1.3 Apache Spark version and the 2.7 Apache Hadoop version.

5.3. Evaluating default partitioning strategy
First, we evaluated the GraphX default partitioning strategy on DB1e4. Thanks to the informa-
tion provided by the VCF file, containing mutations about the individual NA12878, with respect
to the GRCh37.p13 version of the average human genome, we easily identified the bubbles
occurring in the graph. For each bubble, we determined the number of computing units that
need to communicate in order to traverse each edge of the bubble. Since we set 𝑘 = 31, each
bubble may be split into up to 31 different computing units.
   In this case, the histogram shows that there is no bubble that can be traversed by staying on
a single computing unit. Instead, it is required to involve at least two computing units while,
the most frequent case, requires the involvement of three computing units.

5.4. Scalability
We compared the time-performance of the distributed pipeline described in Section 3, when
using the GraphX default partitioning strategy against version based on our cluster-driven
partitioning scheme, in terms of scalability.
Figure 3: The histogram illustrates the distribution of the bubbles, in terms of the number (in percentage)
of computing units to involve for traversing them, according to the default GraphX partitioning strategy.




  First, we performed the SNPs calling procedure on DB1e4 using the aforementioned pipeline
and the default GraphX partitioning strategy, while increasing the number of computing units.
Then, we repeated the same experiment, but using our hierarchical clustering framework and
binning strategy. In this last case, we considered for each run a number of bins that was 4 times
the number of allocated computing units. The resulting bins were then loaded into memory on
the distributed computing architecture, to perform the bubble detection procedure.
  Figure 4 compares the scalability deriving from the application of the two partitioning ap-
proaches, highlighting how our proposed strategy significantly improves the ability to leverage
on an increasing number of computing resources.


6. Conclusions
In this work, we focused on a performance bottleneck affecting Spark-based distributed pipelines
used for SNPs detection and due to the bad performance of the default partitioning strategy
employed by the Spark framework when creating distributed graphs. Thus, we have proposed
an improved partitioning strategy, based on a hierarchical clustering algorithm. According to
our experimental results, the proposed strategy is able to significantly improve to enhance the
scalability of the considered distributed computing pipeline for SNPs calling. In future directions,
it will be pivotal to explore different similarity measures and algorithms for genomic reads
clustering, as the Jaccard index and hierarchical clustering algorithms can be computationally
expensive when dealing with huge-scale analyses. Furthermore, ongoing research is evaluating
strategies for detecting more complex mutations and dealing with genomic regions characterized
by a high mutation intensity, whose corresponding reads may not fall within the same cluster.
Figure 4: Time performances of the SNPs calling procedure executed on the DB1e4 graph, when
considering the GraphX default partitioning and the cluster-driven partitioning strategies. The execution
times are normalized and reported as a function of the employed computing units.




Acknowledgments
The authors would like to thank the Department of Statistical Sciences of University of Rome
- La Sapienza for computing time on the TeraStat cluster. This work was partially supported
by Università di Roma - La Sapienza Research Project 2021 “Caratterizzazione, sviluppo e
sperimentazione di algoritmi efficienti”. It was also supported in part by INdAM – GNCS Project
2023 “Approcci computazionali per il supporto alle decisioni nella Medicina di Precisione”.


References
 [1] D. Geiger, C. Meek, Structured variational inference procedures and their realizations (as
     incol), in: Proceedings of Tenth International Workshop on Artificial Intelligence and
     Statistics, The Barbados, The Society for Artificial Intelligence and Statistics, 2005.
 [2] M. D. Muggli, A. Bowe, N. R. Noyes, P. S. Morley, K. E. Belk, R. Raymond, T. Gagie, S. J.
     Puglisi, C. Boucher, Succinct colored de bruijn graphs, Bioinformatics 33 (2017) 3181–3187.
 [3] G. Holley, P. Melsted, Bifrost: highly parallel construction and indexing of colored and
     compacted de bruijn graphs, Genome biology 21 (2020) 1–20.
 [4] R. Durbin, Efficient haplotype matching and storage using the positional burrows–wheeler
     transform (pbwt), Bioinformatics 30 (2014) 1266–1272.
 [5] R. Uricaru, G. Rizk, V. Lacroix, E. Quillery, O. Plantard, R. Chikhi, C. Lemaitre, P. Peterlongo,
     Reference-free detection of isolated snps, Nucleic acids research 43 (2015) e11–e11.
 [6] K. Salikhov, G. Sacomoto, G. Kucherov, Using cascading bloom filters to improve the
     memory usage for de brujin graphs, in: International Workshop on Algorithms in Bioin-
     formatics, Springer, 2013, pp. 364–376.
 [7] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commu-
     nications of the ACM 51 (2008) 107–113.
 [8] Apache Software Foundation, Apache Spark, (Available from: http://spark.apache.org),
     2016.
 [9] R. S. Xin, J. E. Gonzalez, M. J. Franklin, I. Stoica, Graphx: A resilient distributed graph
     system on spark, in: First international workshop on graph data management experiences
     and systems, 2013, pp. 1–6.
[10] U. Ferraro Petrillo, G. Roscigno, G. Cattaneo, R. Giancarlo, Fastdoop: a versatile and
     efficient library for the input of fasta and fastq files for mapreduce hadoop bioinformatics
     applications, Bioinformatics 33 (2017) 1575–1577.
[11] R. L. Graham, Bounds on Multiprocessing Timing Anomalies., SIAM Journal on Applied
     Mathematics 17 (1969) 416–429.
[12] L. Amorosi, L. D. Rocco, U. F. Petrillo, Scheduling k-mers counting in a distributed
     environment, in: Optimization in Artificial Intelligence and Data Sciences: ODS, First
     Hybrid Conference, Rome, Italy, September 14-17, 2021, Springer, 2022, pp. 73–83.
[13] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, E. Solomonik,
     Communication-efficient jaccard similarity for high-performance distributed genome
     comparisons, in: 2020 IEEE International Parallel and Distributed Processing Symposium
     (IPDPS), IEEE, 2020, pp. 1122–1132.
[14] Z. Rasheed, H. Rangwala, A map-reduce framework for clustering metagenomes, in: 2013
     IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd
     Forum, IEEE, 2013, pp. 549–558.
[15] D. M. Church, V. A. Schneider, T. Graves, K. Auger, F. Cunningham, N. Bouk, H.-C. Chen,
     R. Agarwala, W. M. McLaren, G. R. Ritchie, et al., Modernizing reference genome assemblies,
     PLoS biology 9 (2011) e1001091.
[16] P. Danecek, J. K. Bonfield, J. Liddle, J. Marshall, V. Ohan, M. O. Pollard, A. Whitwham,
     T. Keane, S. A. McCarthy, R. M. Davies, et al., Twelve years of samtools and bcftools,
     Gigascience 10 (2021) giab008.
[17] J. M. Zook, B. Chapman, J. Wang, D. Mittelman, O. Hofmann, W. Hide, M. Salit, Integrating
     human sequence data sets provides a resource of benchmark snp and indel genotype calls,
     Nature biotechnology 32 (2014) 246–251.