<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Accurate isoform discovery with isoquant using long reads, Nature Biotechnology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1093/bioinformatics/bth408</article-id>
      <title-group>
        <article-title>Weighted de novo clustering of third-generation transcriptomic datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Denti</string-name>
          <email>denti1@uniba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoshihiro Shibuya</string-name>
          <email>yoshihiro.shibuya@pasteur.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Applied Informatics, Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pasteur Institute</institution>
          ,
          <addr-line>25-28 Rue du Dr Roux, 75015 Paris</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <volume>41</volume>
      <issue>2023</issue>
      <fpage>915</fpage>
      <lpage>918</lpage>
      <abstract>
        <p>The ability of third-generation sequencing technologies to sequence end-to-end transcripts is creating new opportunities to enhance our understanding of the transcriptomic landscape in eukariotic organisms. In this context, a common task is the de novo clustering of long transcriptomic reads, that is, split a long read sample into smaller samples (one per gene family) that can then be more easily analyzed. Solving this computational problem in a reference- and annotation-free fashion is of the utmost importance when the organism under investigation is not well studied and complete gene annotations are not available. To this end, we present SolidClust, an approach for the de novo clustering of long reads. SolidClust is heavily inspired by its predecessor, isOnClust3: both algorithms are greedy algorithms based on the notion of high-confidence extend upon this basis by introducing the notion of solid high-confidence our experimental evaluation of real datasets, SolidClust is able to achieve comparable or higher clustering accuracy w.r.t. isOnClust3 while drastically reducing the memory requirements. SolidClust is freely available at https://github.com/ldenti/solidclust.</p>
      </abstract>
      <kwd-group>
        <kwd>Long reads</kwd>
        <kwd>Transcriptomic</kwd>
        <kwd>De novo clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        such as aging [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and tissue diferentiation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Many approaches have been designed to analyze transcription and splicing using short-reads
sequencing technologies [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10, 11, 12</xref>
        ], whereas long-reads can fully cover the transcripts [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], creating
new opportunities to thoroughly analyze the transcriptomic landscape and enhance our understanding
of it. Many reference- and annotation-based approaches have been proposed to analyze long-reads
RNA samples [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ]. However, these approaches rely on the availability of accurate
reference genomes and complete gene annotations and can be hardly applied to under-studied organisms.
To overcome this limitation, reference- and annotation-free approaches have been proposed. These
approaches do not rely on any prior knowledge and can thus work directly on input reads. In this
context, a well-established computational problem is the de novo clustering of reads by gene locus (or
gene family). In other words, the goal of this problem is to split a read sample into multiple smaller
samples, each roughly referring to the same set of genes. By doing so, the analysis of a large sample
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
(e.g., transcript reconstruction) can be reduced to multiple independent analyses of smaller samples.
Recent advancements in third-generation sequencing technologies have enhanced the quality of the
reads, simplifying this clustering problem, but have also increased the sequencing throughput and the
size of the generated read sample. These technologies are now capable of generating tens of millions of
accurate reads in a single experiment [18]. This poses a scalability challenge. Among the several de
novo approaches proposed in the literature [19, 20, 21, 22], only the latest isOnClust3 [23] is capable
of analyzing large long-read samples produced by third-generation sequencing technologies.</p>
      <p>Heavily inspired by isOnClust3, we introduce here SolidClust, a new approach for the de novo
clustering of third-generation transcriptomic datasets. Similarly to isOnClust3, SolidClust uses
high-confidence minimizers to sort and cluster reads in a greedy fashion, adding a read to an already
created cluster if they share a suficient fraction of minimizers. Although SolidClust follows the
same general greedy idea proposed in [23], it enhances its predecessor in several ways, here intuitively
described. For more details on SolidClust, we refer the reader to Section 2. On the algorithmic side,
SolidClust tags each minimizer as solid minimizer depending on how many times it has been seen
globally among all clusters and how many times it has been included in the cluster currently being
compared to. Only solid minimizers are considered when comparing a read to a cluster to decide if
the read shares enough minimizer with the cluster. Indeed, if a minimizer has been seen several times
among all created clusters and too few times in the cluster currently under investigation, it is considered
an erroneus (non solid) minimizer and filtered out. This additional check allows SolidClust to cope
with two situations that can arise when a minimizer is the product of sequencing errors: (i) when a
minimizer is too frequent among most of the clusters and (ii) when a minimizer occurs many times in a
single cluster but fewer times in the current cluster under investigation. The use of solid minimizers
allows SolidClust to be more robust to minimizers that occur in the read by chance due to sequencing
errors. A second algorithmic diference between isOnClust3 and SolidClust is that the latter do not
consider repetitions when extracting minimizers from a read. In such a way, SolidClust clustering is
less impacted by minimizers that are too frequent, such as those produced by long stretches of the same
nucleotide, which can be observed in the datasets produced by third generation sequencing technologies
due to biological and technical factors. Finally, our new eficient C implementation reduces the memory
requirement by more than 66% while maintaining very high clustering accuracy and clustering speed,
as demonstrated by our experimental evaluation on real datasets, described in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>Our work is a modification of isOnClust3 [23], but with extended functionality and better time/memory
tradeofs. For these reasons, and to keep the paper self-contained, we first report the main ideas behind
isOnClust3 in Section 2.2. We then introduce SolidClust in Section 2.3, highlighting the novelties of
our approach compared to its predecessor.</p>
      <sec id="sec-2-1">
        <title>2.1. Preliminaries</title>
        <sec id="sec-2-1-1">
          <title>2.1.1. Minimizers</title>
          <p>Let  be a set of strings over alphabet Σ = {, , ,  } . We call  -mers the substrings of length 
obtainable by sliding a window over strings in  .</p>
          <p>Minimizers are (left-most) minimal  -mers over windows of  consecutive  -mers for a given order
  [24, 25]. Since the minimum  -mer can only change when a new minimum is found, or the old one
goes out of the window, minimizers are an efective sampling scheme with a wide range of applications
in bioinformatics [26, 27, 28]. In practice, a fast, non-cryptographic random hash function is used to
define the order   (random minimizer schemes [29]).</p>
          <p>In the following, we will use canonical minimizers by treating each  -mer and its reverse complement
as the same object by taking the (lexicographic) minimum between the two.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. High confidence seeds (HCS)</title>
          <p>By reusing isOnClust3 terminology we define High Confidence Seeds (HCS) as minimizers with
probability higher than a given threshold. A minimizer’s probability can be easily computed as the product
of all its bases probabilities, encoded as Phred quality scores (Q scores) in FASTQ files following the
relation:
 = −10 log10( )
(1)
Quality values are further encoded in FASTQ files by representing them as ASCII letters of code  + 33 .
In the following we will use the terms seeds and minimizers interchangeably so that HCS can also be
read as high confidence minimizers.
2.2. isOnClust3
isOnClust3 is a greedy algorithm designed for the de novo clustering of third-generation RNA
sequencing datasets, such as those generated by Oxford Nanopore or PacBio platforms. These datasets
are characterized by long read lengths and comparatively high error rates, presenting unique challenges
for accurate sequence clustering. The algorithm employs a greedy strategy, which does not guarantee
globally optimal results, but provides substantial gains in computational eficiency and scalability,
making it suitable for large-scale transcriptome analyses. The overall workflow can be divided into
three principal stages: read sketching, clustering, and an optional cluster merging phase intended to
refine the final clusters.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.2.1. Read sketching</title>
          <p>In the initial stage, each read is transformed into a compact representation to facilitate eficient
comparison. This is achieved by extracting a set of canonical minimizers, which are representative  -mers
selected from overlapping windows across the read. To ensure strand consistency, the canonical form is
defined as the lexicographically smaller of the forward sequence and its reverse complement. These
minimizer sets are subsequently filtered according to base-level quality scores, typically represented as
Phred scores [30]. The quality of a  -mer is computed as the product of the base-wise quality
probabilities, enabling the removal of low-confidence  -mers that are more likely to arise from sequencing
errors. Only  -mers whose quality exceeds a user-defined threshold  are retained. This filtering step
reduces noise while preserving the most informative features of each read. The filtered reads are then
sorted in descending order of sketch size, such that longer and higher-quality reads are prioritized
during clustering. This ordering increases the likelihood that the earliest clusters will be seeded by
representative, high-confidence reads. The processed reads are written to a temporary FASTQ file to
allow sequential disk-based streaming in the subsequent stage.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>2.2.2. Clustering</title>
          <p>In the clustering stage, the sorted reads are processed sequentially to form clusters in a greedy fashion.
The first read encountered becomes the seed for the initial cluster, with its High Confidence Seeds (HCS)
– the quality-filtered minimizers – serving as the defining elements of that cluster. Each subsequent
read is compared against all clusters generated thus far. The comparison is asymmetric: the read is
represented by all of its minimizers, whereas each cluster is represented solely by its HCS. For each
cluster, the fraction of shared seeds between the read and the cluster’s HCS is calculated. If this fraction
meets or exceeds a user-defined similarity threshold  ( 2 in the original paper [23]), the read is assigned
to that cluster, and its HCS are incorporated into the cluster’s seed set. If no existing cluster meets the
similarity criterion, a new cluster is initiated using the read’s HCS. Because the clustering is performed
in a single pass without re-evaluating earlier assignments, the method is computationally eficient, but
sensitive to errors in the early stages of processing.</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>2.2.3. Cluster merging</title>
          <p>The optional cluster merging stage is designed to address over-clustering that may arise from the
greedy nature of the algorithm or from noise in the sequencing data. Clusters are processed in order
of increasing size, ensuring that smaller clusters are evaluated first for potential merging into larger,
more established clusters. For each candidate pair, the fraction of shared HCS is computed; if this
proportion exceeds a merging threshold, the clusters are combined, and their reads are pooled into a
single, consolidated cluster. This procedure is repeated iteratively until no further merges are possible.
The merging step serves as a corrective measure, mitigating the impact of early misclassifications and
producing a final clustering that more accurately reflects the true transcriptomic structure of the dataset.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. SolidClust</title>
        <p>We call our implemetation SolidClust since, like its predecessor, it follows the overall steps outlined in
Section 2.2, but on the other hand, it significantly deviates in many important ways. The main diference
is that our algorithm works on sketched sequences only, thus eliminating the need to load and sort
the whole input file. Reads are sketched as sets of HCS, which are then sorted by (decreasing) size and
stored on disk. Not having access to the original reads means we only compare the HCS of sketches
to clusters. This is in contrast to the original implementation where all seeds of a read (including
repetitions) are compared to the HCS contained in clusters. From our preliminary investigation of the
data, considering repetitions produces an unexpected sorting of the reads, with reads containing long
stretches of As (or Ts) being placed very high in the ordering (although most minimizers were simple
stretches of a single nucleotide).</p>
        <p>The other main diference is the way we compute similarities between reads and clusters. The
original isOnClust3 computes a simple Jaccard similarity between reads and clusters (albeit allowing
multiple copies of the same minimizers in the reads), whereas SolidClust only allows certain seeds to
contribute to the similarity. Our filtering strategy is based on the past history of the clustering process.
At the time of adding read  from  , let  and  be the sets of minimizers seen so far, and the set of
clusters built. For each pair of ( ,  ) with  ∈  and  ∈  , we count how many times minimizer  has
been added to  . For ease of visualisation, one can imagine a matrix  of counters, with minimizers as
rows, and clusters as columns.</p>
        <p>A minimizer   in  is solid w.r.t. a given cluster   if its count is a tangible fraction of all the counts
associated to   (i.e. a row in  ). In practice, for each minimizer we check if the ratio    ,  /Σ    ,  is
above a user-defined threshold  . If yes, we consider   to be a valid minimizer for similarity computation.
See Algorithm 1 for an overview. Clustering is then performed as before, with sketched reads merged
to the best cluster if the number of shared (solid) minimizers is above threshold  . Cluster merging is
left as a future development and it is not used in any of the comparisons presented in Section 3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>To evaluate SolidClust, we considered 3 real PacBio datasets with varying coverages. The datasets used
in our experiments are described in Table 1. To put our results in perspective, we compared SolidClust
with isOnClust3. We decided to not include RATTLE [21] and GeLuster [22] in our evaluation since
from [23], isOnClust3 resulted the most accurate and the only approach capable of analyzing recent
long-read RNA-Seq datasets. The objective of our experimental evaluation was twofold: (i) to evaluate
the impact of the new  parameter and to establish the best value for the SolidClust parameters (
and  ) that allow maximizing its clustering accuracy, and (ii) to compare the accuracy and eficiency
of SolidClust with those of isOnClust3. Our experimental evaluation can be reproduced using the
Snakemake workflow available at https://github.com/ldenti/solidclust.</p>
      <p>vector of 0 of size | |;
for minimizer   in  do
if   already seen before then
 ← Σ
foreach   in row  [
  [</p>
      <p>][  ];
 ][  ] ÷ 
if  [
end
 ] do
&gt;  then
ℎ[  ] ← ℎ[  ] + 1;
Data: A list  of sketches sorted by decreasing size
Result: Clustering  of 
for minimizer  in [0] do</p>
      <p>][] + 1 ;
while  &lt; | | do
 ← []
ℎ ←</p>
      <p>end
end

;
 [
 [
 [
 [
end
  ←  [
 ← argmax(ℎ[])</p>
      <p>;
 ][  ]/ | |;
if ratio &gt;  then</p>
      <p>Add  to [] ;
if   in  then
if  in  [
 [
 [
 [
else
end</p>
      <p>] then
 ][] ←  [
 ] ←  ;
 ][] ← 1 ;
 ] ←  ;
 ][] ← 1 ;
else
end
else
.ℎ([])
 ← | | − 1;</p>
      <p>;
for minimizer   in  do
 ] ←  [
 ][] ← 1 ;</p>
      <p>] ∪  ;
end
end
 ←  + 1 ;
end</p>
      <p>Algorithm 1: SolidClust algorithm.</p>
      <p>Dataset</p>
      <p>PB
ALZ
HG002</p>
      <p>No. reads</p>
      <sec id="sec-3-1">
        <title>3.1. Evaluation criteria</title>
        <p>We evaluated the accuracy of the clustering in the same way as proposed in [23]. For completeness, we
intuitively describe here the methodology used to create the ground truth and the evaluation metrics.</p>
        <p>The ground truth (i.e., the true clusters) is computed using a reference-guided clustering. We
considered the T2T reference genome [31] and we used read alignments computed with minimap2 [26]
as a proxy to assign a class (i.e., the ground truth cluster) to each read. Although such a proxy can result
in misclassifications due to misalignments and mapping artifacts, it is the benchmarking approach
used in many previous studies [23]. A read is denoted as “unclassified” if it could not be aligned. A
true cluster is defined as “non-singleton” (NS) if it contains more than one read, “singleton” otherwise.
Table 1 reports the results of this ground truth creation for the datasets considered in our evaluation.</p>
        <p>
          As in [23], we used the V-measure [32] and the Adjusted Rand Index [33] to evaluate the clustering
accuracy. We will give here the intuition behind these metrics, referring the reader to the corresponding
papers for their formal definition. The V-measure (V) is computed as the harmonic mean between
homogeneity (h) and completeness (c). A clustering is homogeneous if its clusters contain reads coming
from the same class whereas it is complete if reads coming from the same class are assigned to the
same cluster. All these metrics range from 0 to 1. A homogeneity of 1 indicates that all clusters
contain only members of a single class whereas a completeness of 1 indicates that all members of a
given class are assigned to the same cluster. In other terms, homogeneity penalizes over-clustering
and completeness penalizes under-clustering. The Adjusted Rand Index (ARI), instead, measures the
percentage of correct pairings of elements (reads), corrected for chance agreements. The ARI evaluates
the similarity between two diferent clusterings and provides a score in the range [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ], where 1
indicates perfect agreements, 0 indicates random clustering, and negative values indicate a clustering
worse than random. These measures are computed using the same python script used in the evaluation
described in the isOnClust3 paper.
3.2. Impact of parameters  and 
We first focus on the smallest dataset ( PB) and we evaluated the impact of the parameters  and  on the
clustering accuracy of SolidClust. To this end, we ran SolidClust with diferent combinations of
the two parameters, namely  ∈ {0.1, 0.25, 0.5} and  ∈ {0, 0.1, 0.25, 0.33, 0.5, 0.9} . We decided to focus on
these two parameters since  is the new parameter introduced in SolidClust and  is highly afected
by the choice of  . Indeed, by considering only solid minimizers, we may end up with a lower similarity
between a read and a cluster, since we might reduce the number of shared minimizers (numerator in
the Jaccard similarity formula) but the total number of minimizers from the read (the denominator)
does not change. We did not test the other parameters (minimizer size  , window length  , and quality
threshold  ) since we believe that they do not have a great impact on the results of SolidClust. As
in [23], we set  = 15 ,  = 51 , and  = 0.98 (since the dataset is a PacBio dataset). To allow for a better
comparison, we also run the original isOnClust3 varying its similarity threshold parameter  , which is
independent of the sequencing technology and, by default, is set to 0.5. Figure 1 reports the results of
this analysis in terms of V-measure and ARI. Table 2 reports the full results, including completeness,
homogeneity, and number of clusters reported by each approach.
        </p>
        <p>Overall, in terms of V-measure,  and  do not greatly afect the accuracy of SolidClust. SolidClust
always achieves a very high V-measure, proving that its clustering is complete and homogeneous. The
homogeneity of SolidClust clustering starts to drop when  = 0.5 and  &gt; 0.25 , lowering the V-measure.
This is somehow expected since using a high value for both parameters reduce the likelihood for a read
to be included in an already created cluster, as shown by the number of clusters reported by SolidClust
(Table 2). The same trend can be observed when considering the ARI values (which drops from 0.810
to 0.153 when moving from  = 0.25 to  = 0.33 in the case of  = 0.5 ). However, the choice of the
parameter  seems to have a greater impact on this measure. Indeed, increasing the  parameter results
in a higher ARI. This holds up to the point (combination of  and  ) where ARI remains stable or drops
significantly.</p>
        <p>When comparing SolidClust to isOnClust3, we can clearly see that isOnClust3 achieves higher
accuracy when  is set to 0.5 (that is the default value). However, even in this case, SolidClust is able
to achieve higher clustering accuracy for some combinations of the  and  parameters. Remarkably,
SolidClust is able to achieve good clustering accuracy even with low value of the  parameter (when the
parameter  is correctly set). Overall, when using the default value for the  parameter in isOnClust3
and correctly setting the two SolidClust parameters (that is not a straightforward task), the two
approaches achieve comparable results.</p>
        <p>Finally, we notice that the clustering accuracy of isOnClust3 and SolidClust (ran with  = 0 )
greatly difers, especially for smaller values of  . We recall that when  is set to 0, SolidClust considers
all minimizers from the reads and not only solid minimizers. Although we were expecting the two
approaches to yield similar results in this setting, we believe that the diferent way the two approaches
handle repetitions (i.e., those minimizers repeated more than once in a read) is the reason behind
this discrepancy. This is an additional proof that, although SolidClust follows the same greedy
methodology of isOnClust3, SolidClust can be considered a novel method that exhibits its own
pecularities (described in Section 2 and here experimentally validated).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Clustering accuracy and eficiency</title>
        <p>We then focus on the bigger datasets and we evaluated the clustering accuracy and eficiency of
SolidClust. We first considered the ALZ dataset and ran SolidClust using  ∈ {0.25, 0.5} and 
∈ {0.1, 0.25, 0.5}. We decided to limit our analysis to these values since we believe they are the values
that could provide the best accuracy results, based on the results presented in the previous section.
Table 3 reports the results of this analysis.</p>
        <p>The best clustering accuracy is achieved by SolidClust when  is set to 0.25 and  to 0.1. With this
combination, SolidClust is able to achieve a V-measure of 0.878 (+0.017 w.r.t. the original isOnClust3
.874
.923
.930
.928
.924
.898
.908
.927
.928
.926
.924
.887
.920
.927
.917
.871
.846
.794</p>
        <p>c
.934
.945
.936
.928
.915
.845
.932
.939
.917
.908
.900
.815
.918
.880
.858
.776
.735
.659</p>
        <p>h
.821
.902
.925
.927
.933
.957
.885
.914
.938
.945
.950
.974
.923
.980
.985
.992
.995
.999
.185
.485
.646
.643
.668
.703
.307
.496
.647
.692
.700
.740
.339
.590
.810
.153
.105
.035
59
921</p>
        <p>980
ran with  = 0.5 ) and an ARI of 0.701 (+0.038). Surprisingly, all other tested combinations of  and
 provide worse results. In particular, the number of singleton clusters computed by SolidClust is
extremely high. This is in contrast to what we saw when considering the smaller PB dataset. We
therefore believe that the choice of the two parameters should also depend on the expected coverage of
the input sample and the expected number of clusters (e.g., how many genes have been sequenced).
Remarkably, the best run of SolidClust ( = 0.25 ,  = 0.1 ) was twice as fast as the original isOnClust3
while requiring 1/3 of the RAM (11GB instead of 37GB). We note that the eficiency of
SolidClust
seems directly proportional to the number of clusters computed. The more clusters, the longer the
running time and the higher the memory requirement. However, it is not fully clear how the number
of clusters produced in output is afected by the two tested parameters  and  . Investigating this
relationship is a compelling future direction that we plan to explore.</p>
        <p>We finally analyzed the biggest dataset ( HG002), comprising 37 milion reads. Following the results on
the ALZ dataset, we ran SolidClust setting  to 0.25 and  to 0.1 and 0.25. Unfortunately, we did not
manage to run isOnClust3 on our cluster due to, we suspect, some I/O issue (the process got stuck in
uninterruptible sleep state). For this reason, we report the results (accuracy and eficiency) presented
in the original paper [23]. Results of this analysis are presented in Table 4. SolidClust has been able
to achieve the best clustering accuracy, but this time there is no clear winner between the two tested
combinations of  and  . Indeed, one of the two achieved the highest V-measure whereas the other
achieved the highest ARI. In any case, both combinations achieved a very high ARI compared to the
original isOnClust3 (+0.233/+0.257). However, similarly to the ALZ dataset, setting  to 0.25 produced
a very high number of singleton clusters and increased the running times of SolidClust, making
it slower than isOnClust3. For this reason, we believe that the best combination of parameters is 
= 0.25 and  = 0.1 . In this setting, SolidClust has been able to achieve very high clustering accuracy
while resulting 1.6 times faster and 4.85 times less memory intensive than isOnClust3. We note that
.878
.843
.771
.804
.733
.696</p>
        <p>V
.881
.868</p>
        <p>c
.787
.731
.628
.673
.579
.534
.994
.996
.999
.999</p>
        <p>Time</p>
        <p>RAM
increasing the value of  and  parameters substantially impacts the running times of SolidClust but
not its memory requirements.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>We presented SolidClust, an approach for the de novo clustering of third-generation transcriptomic
datasets. Heavily inspired by its predecessor (isOnClust3), SolidClust implements a greedy algorithm
that iteratively merges reads based on the fraction of minimizers shared between a read and the already
created clusters. The main novelty of SolidClust is the usage of solid minimizers: when a read is
compared to a cluster, SolidClust does not use all minimizers of the read but it filters out all minimizers
that potentially come from sequencing errors, thus considered non solid. As shown in our experimental
evaluation, SolidClust is able to achieve very high clustering accuracy and is as fast as its predecessor
for comparable number of clusters while requiring less memory.</p>
      <p>In this work, we considered 3 PacBio datasets and we analyzed the impact of two of SolidClust
parameters ( and  ) on its accuracy and eficiency. Future works will be devoted to evaluate
SolidClust
on ONT datasets which comprise longer and (usually) less accurate reads than the PacBio datasets used
in this evaluation. An additional compelling future direction consists in devising an automated way to
select the best values for the parameters  and  . Although we believe that this choice mainly depends
on the sequencing technology (hence the expected error rate), the coverage of the datasets, and the
expected number of clusters (i.e., the number of sequenced genes), being able to provide automatic
parameter selection requires further investigation and additional experiments.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research work has received funding from the European Union’s Horizon programme under
the Horizon Europe grant agreement (ASVA-CGR No. 101180581 to L.D.). This work has also been
supported by the European Union’s Horizon 2020 research and innovation programme under the Marie
Skłodowska-Curie grant agreement PANGAIA No. 872539 (Y.S.).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Writefull and GPT-4o mini in order to: (i)
Paraphrase and reword and (ii) Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sveen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kilpinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ruusulehto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Skotheim</surname>
          </string-name>
          ,
          <article-title>Aberrant rna splicing in cancer; expression changes and driver mutations of splicing factor genes</article-title>
          ,
          <source>Oncogene</source>
          <volume>35</volume>
          (
          <year>2016</year>
          )
          <fpage>2413</fpage>
          -
          <lpage>2427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Bonnal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>López-Oreja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Valcárcel</surname>
          </string-name>
          ,
          <article-title>Roles and mechanisms of alternative splicing in cancer-implications for care</article-title>
          ,
          <source>Nature reviews Clinical oncology 17</source>
          (
          <year>2020</year>
          )
          <fpage>457</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Biamonti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Amato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Belloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Matteo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Infantino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pradella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ghigna</surname>
          </string-name>
          ,
          <article-title>Alternative splicing in alzheimer's disease</article-title>
          ,
          <source>Aging clinical and experimental research 33</source>
          (
          <year>2021</year>
          )
          <fpage>747</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhadra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heintz</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Mair</surname>
          </string-name>
          ,
          <article-title>Alternative splicing in aging and longevity</article-title>
          ,
          <source>Human genetics 139</source>
          (
          <year>2020</year>
          )
          <fpage>357</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Holste</surname>
          </string-name>
          , G. Kreiman,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Burge</surname>
          </string-name>
          ,
          <article-title>Variation in alternative splicing across human tissues</article-title>
          ,
          <source>Genome biology 5</source>
          (
          <year>2004</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Dewey</surname>
          </string-name>
          ,
          <article-title>Rsem: accurate transcript quantification from rna-seq data with or without a reference genome</article-title>
          ,
          <source>BMC bioinformatics 12</source>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Trapnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pertea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Rinn</surname>
          </string-name>
          , L. Pachter,
          <article-title>Diferential gene and transcript expression analysis of rna-seq experiments with tophat and cuflinks</article-title>
          ,
          <source>Nature protocols 7</source>
          (
          <year>2012</year>
          )
          <fpage>562</fpage>
          -
          <lpage>578</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-x.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Henry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Xing, rmats: robust and lfexible detection of diferential alternative splicing from replicate rna-seq data</article-title>
          ,
          <source>Proceedings of the national academy of sciences 111</source>
          (
          <year>2014</year>
          )
          <fpage>E5593</fpage>
          -
          <lpage>E5601</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pertea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Pertea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Antonescu</surname>
          </string-name>
          , T.-
          <string-name>
            <surname>C. Chang</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          <string-name>
            <surname>Mendell</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          <string-name>
            <surname>Salzberg</surname>
          </string-name>
          ,
          <article-title>Stringtie enables improved reconstruction of a transcriptome from rna-seq reads</article-title>
          ,
          <source>Nature biotechnology 33</source>
          (
          <year>2015</year>
          )
          <fpage>290</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Denti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Beretta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Vedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Previtali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <article-title>Asgal: aligning rna-seq data to a splicing graph to detect novel alternative splicing events</article-title>
          ,
          <source>BMC bioinformatics 19</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Sibbesen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Eizenga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Novak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sirén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Garrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Paten</surname>
          </string-name>
          ,
          <article-title>Haplotypeaware pantranscriptome analyses using spliced pangenome graphs</article-title>
          ,
          <source>Nature Methods</source>
          <volume>20</volume>
          (
          <year>2023</year>
          )
          <fpage>239</fpage>
          -
          <lpage>247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ciccolella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Della</given-names>
            <surname>Vedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Kuria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          , L. Denti,
          <article-title>Diferential quantification of alternative splicing events on spliced pangenome graphs</article-title>
          ,
          <source>PLOS Computational Biology</source>
          <volume>20</volume>
          (
          <year>2024</year>
          )
          <article-title>e1012665</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Byrne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Volden</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Vollmers, Realizing the potential of full-length transcriptome sequencing</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society B</source>
          <volume>374</volume>
          (
          <year>2019</year>
          )
          <fpage>20190097</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kovaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Zimin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Pertea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Razaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pertea</surname>
          </string-name>
          ,
          <article-title>Transcriptome assembly from long-read rna-seq alignments with stringtie2</article-title>
          ,
          <source>Genome biology 20</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Kingsford, Quantifying the benefit ofered by transcript assembly with scallop-lr on single-molecule long reads</article-title>
          ,
          <source>Genome biology 20</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Orabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McConeghy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chauve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hach</surname>
          </string-name>
          ,
          <article-title>Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>51</volume>
          (
          <year>2023</year>
          )
          <fpage>e11</fpage>
          -
          <lpage>e11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>A. D. Prjibelski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mikheenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joglekar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smetanin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jarroux</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          <string-name>
            <surname>Lapidus</surname>
          </string-name>
          , H. U. Tilgner,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>