<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Dragičević);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SequenceAligner: A High-Performance Tool for Large-Scale All-Versus-All Pairwise Sequence Alignment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jakov Dragičević</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Otović</string-name>
          <email>erik.otovic@uniri.hr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goran Mauša</string-name>
          <email>goran.mausa@uniri.hr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Maribor, Slovenia</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Rijeka, Center for Artificial Intelligence and Cybersecurity</institution>
          ,
          <addr-line>Rijeka</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Rijeka, Faculty of Engineering</institution>
          ,
          <addr-line>Rijeka</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Sequence alignment is an indispensable technique in bioinformatics. It facilitates the comparative analysis of biological sequences for evolutionary studies, drug discovery, protein function prediction, as well as data ifltering based on similarity and results analysis in machine learning-based methodologies. Despite the range of available alignment libraries and tools, researchers still face significant challenges when implementing large-scale all-versus-all pairwise sequence alignment workflows, particularly when a complete solution is required to process datasets that exceed available memory. The SequenceAligner software ofers an open-source solution tailored for high-performance all-versus-all sequence alignment, emphasizing convenience, eficiency, scalability, and accessibility. It provides exact dynamic programming solutions for the Needleman-Wunsch, Smith-Waterman, and Gotoh algorithms with configurable substitution matrices and gap penalty models delivered as a complete workflow rather than merely a library. Moreover, the algorithms are implemented for both CPUs and CUDAcompatible GPUs, enabling eficient parallel sequence alignment. Performance benchmarks on three peptide datasets demonstrate that SequenceAligner achieves alignment rates ranging from 3.2 to 22.9 million sequence pairs per second on a consumer-grade AMD Ryzen 7 5700X3D CPU, and from 27.4 to 80.8 million pairs per second with CUDA acceleration on a consumer-grade NVIDIA GeForce RTX 4060 GPU. Notably, the software maintains memory eficiency even when the similarity matrix exceeds the available RAM or VRAM. Finally, the software underscores modularity and extensibility through its simple C99 codebase, rendering it accessible for both research applications and educational purposes. SequenceAligner is publicly available in a GitHub repository at https://github.com/jakovdev/SequenceAligner.</p>
      </abstract>
      <kwd-group>
        <kwd>Biological sequences</kwd>
        <kwd>Sequence alignment</kwd>
        <kwd>Sequence similarity</kwd>
        <kwd>High-performance computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sequence alignment forms the cornerstone of computational biology, enabling researchers to identify
evolutionary relationships, predict protein structure and function, and discover conserved motifs across
biological sequences [1, 2]. As biological databases continue to expand exponentially, with repositories
like UniProt [3] containing over 200 million protein sequences and GenBank [4] housing billions of
nucleotide sequences, the computational demands for comprehensive sequence analysis have grown
proportionally. This growth necessitates eficient computational workflows capable of processing
millions of sequence comparisons while maintaining accuracy and algorithmic flexibility.</p>
      <p>Contemporary bioinformatics research increasingly requires a comprehensive analysis of sequence
similarity through pairwise sequence alignment, which involves aligning and comparing two biological
sequences to identify regions that may indicate functional, structural, or evolutionary relationships.
Proteomics research requires all-versus-all comparisons for protein family classification and evolutionary
relationship identification [ 5]. Drug discovery applications utilize all-versus-all similarity matrices for
target identification and side efect prediction [</p>
      <sec id="sec-1-1">
        <title>6]. Comparative genomics studies depend on precise alignment scores for ortholog detection across species [7].</title>
        <p>CEUR</p>
        <p>ceur-ws.org</p>
        <p>Sequence alignment also plays a crucial role in data preparation for machine learning-based peptide
activity prediction. To prevent biased predictions caused by large clusters of highly similar sequences,
the dataset is filtered prior to model training by removing redundant peptides, ensuring that no two
peptides exceed a customary similarity threshold [8, 9, 10, 11, 12]. Percentage sequence identity (   ),
shown in Equation 1, is often used to quantify the similarity between two sequences by aligning them
and computing the percentage of matching residues ( ℎ ) with respect to the length of the shorter
sequence, where sequence lengths are indicated by  1 and  2. Furthermore, the application of
pairwise sequence alignment is not limited only to the data preparation phase, as it can also support the
analysis of in silico generated sequences. For example, recently it has been used in a pipeline combining
machine learning and a genetic algorithm to analyze generated peptide sequences and identify motifs
within them [13].</p>
        <p>=</p>
        <p>ℎ
( 1,  2)</p>
        <p>Although some applications, such as dataset filtration for machine learning, can tolerate heuristic
approximations to avoid exhaustive all-versus-all alignments and improve execution time, exact
alignments remain essential in other areas of bioinformatics [2]. The fundamental computational challenge
in all-versus-all pairwise sequence alignment lies in its inherent complexity. To align sequences of
lengths  and  , commonly used dynamic programming algorithms require  () time and space
complexity. When extended to all-versus-all comparisons of  sequences, this complexity scales to
 ( 2 ⋅  2̄) where  ̄ represents the average sequence length. For large datasets, such implementations
quickly becomes computationally prohibitive when no optimization strategies are implemented.</p>
        <p>
          While modern CPU and GPU architectures ofer substantial performance potential through
vectorization, parallelization, and advanced memory hierarchies, taking advantage of these features requires
low-level optimization and detailed knowledge of both hardware and algorithm design [14, 15].
Consequently, efectively utilizing these hardware features requires specialized knowledge and careful
implementation that many researchers lack the time or expertise to develop independently. Therefore,
the barrier to entry for implementing high-performance all-versus-all alignment workflows remains
substantial. Researchers must navigate complex decisions regarding algorithm selection, optimization
strategies, memory management, and result storage while ensuring correctness and reproducibility.
This complexity often forces compromises between computational feasibility and analysis completeness,
limiting the scope of possible research investigations.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>The landscape of sequence alignment tools presents researchers with a fragmented ecosystem in
which no single solution addresses the complete workflow requirements for large-scale all-versus-all
analyses. This fragmentation highlights the critical importance of open-source software development
in computational biology, where transparency, reproducibility, and community-driven improvements
are essential for scientific progress.</p>
      <p>Available alignment libraries ofer robust foundational components, but integrating them into
complete and eficient workflows often requires substantial additional programming efort from users.
Parasail [16], for example, is a high-quality implementation of vectorized pairwise alignment
algorithms, featuring numerous substitution matrices and multi-language APIs. While Parasail delivers
high performance for individual alignments, it focuses solely on core alignment routines.</p>
      <p>BioPython [17] and Rust-Bio [18] provide alignment functionality as part of broader bioinformatics
frameworks, ofering seamless integration with existing computational pipelines. The Python library
scikit-bio [19] follows a similar approach, providing data structures, algorithms, and educational
resources as part of a general-purpose bioinformatics toolkit. While these libraries ofer convenient
interfaces and broad functionality, they prioritize versatility over the specialized performance required
for large-scale, all-versus-all analyses.</p>
      <p>The challenge extends beyond algorithmic implementation to encompass the complete computational
workflow. Researchers implementing all-versus-all alignment workflows must address sequence file
parsing and validation, memory management for sequences, multithreading the all-versus-all workflow,
progress monitoring for long-running computations and similarity matrix storage that may exceed
available RAM. Building these components from scratch using existing libraries requires substantial
software development expertise and time that could be better spent on biological research. Existing
libraries also lack comprehensive support for heterogeneous computing environments. While some
libraries provide CPU optimizations, GPU acceleration is limited only to pairwise alignments and as
such it does not scale well with large all-versus-all operations. This limitation prevents researchers from
fully utilizing available computational resources without implementing complex device management
logic.</p>
      <p>The algorithmic flexibility requirements in research often necessitate modifications to scoring schemes,
gap penalty models, or alignment strategies. While high-performance libraries may provide extensive
configuration options, they typically cannot accommodate novel algorithmic variants without significant
modification to library internals. This limitation creates a trade-of between performance and flexibility
that constrains research possibilities.</p>
      <p>Memory management represents another critical challenge for large-scale analyses. Libraries
optimized for individual alignments may not provide strategies for handling similarity matrices that exceed
available system memory. Researchers must implement custom solutions for memory-mapped storage,
result streaming, or distributed computation approaches to handle datasets that produce gigabytes or
terabytes of results.</p>
      <p>The fragmented nature of current tools creates additional barriers through inconsistent interfaces,
conflicting dependencies, and complex build processes. Researchers must navigate multiple APIs,
manage dependency compatibility issues, and often require specialized build environments to compile and
deploy their workflows. These technical barriers particularly afect researchers in resource-constrained
environments or those without dedicated computational support.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SequenceAligner Tool</title>
      <p>SequenceAligner addresses the limitations of the aforementioned tools through a comprehensive
approach that prioritizes four core design principles: convenience through complete workflow
integration, eficiency through specialized all-versus-all optimizations, scalability through adaptive
memory management, and adaptability through simple codebase design. The implementation provides a
ready-to-use solution capable of utilizing modern CPU and CUDA-compatible GPU architectures that
eliminates the need for researchers to assemble complex toolchains while maintaining the flexibility
necessary for scientific research. Even though the developed tool is primarily intended for sequence
similarity–based dataset filtration in machine learning pipelines for peptide activity prediction, where
peptide sequences are typically short and consist of up to 50 amino acid residues, it imposes no
limitation on sequence length and can also be employed in other applications. SequenceAligner and
scripts necessary to replicate the results from this study are publicly available in a GitHub repository
at https://github.com/jakovdev/SequenceAligner.</p>
      <sec id="sec-3-1">
        <title>3.1. Convenience: Complete Workflow Integration</title>
        <p>Unlike software libraries that provide only algorithmic components, SequenceAligner delivers a complete
end-to-end workflow optimized for all-versus-all sequence alignment. The software handles sequence
ifle parsing with automatic validation, memory-eficient sequence storage using custom pool allocators,
comprehensive result management with HDF5 [20] output format, progress monitoring with detailed
performance metrics, and cross-platform deployment with automated build scripts.</p>
        <p>The software provides comprehensive substitution matrix support with 65 amino acid matrices and 2
nucleotide matrices sourced from Parasail [16]. A Python utility script automates the extraction and
conversion of these matrices into C-compatible format, enabling easy integration and potential reuse in
other implementations. The supported matrices include the complete BLOSUM series [21] (BLOSUM30
through BLOSUM100), the PAM series [22] (PAM10 through PAM500), as well as nucleotide matrices
DNAFULL and NUC44 from BLAST [23]. The tool also ofers full support for configurable gap penalty
models. Its flexible implementation can serve as a foundation for researchers wishing to extend, modify,
or implement custom scoring schemes or alignment variants.</p>
        <p>In addition to alignment, the software includes built-in functionality for similarity-based dataset
ifltering, enabling the removal of redundant sequences to reduce bias in machine learning models for
peptide activity prediction.</p>
        <p>The filtering algorithm operates by iteratively processing sequences and comparing each candidate
sequence against all previously accepted sequences in the filtered set. For each comparison,    is
computed on-demand by counting exact character matches between the sequences and dividing by the
length of the shorter sequence. A candidate sequence is added to the filtered set only if its similarity
to all existing sequences in the set remains below a user-defined threshold. This guarantees that the
resulting dataset contains no pair of sequences with similarity above the threshold. The algorithm has a
time complexity of  ( 2) in the worst case, where  is the number of input sequences. The procedure
is inherently sequential, as the decision for each sequence depends on the previously selected sequences.
The simple character-by-character comparison is eficiently vectorized using SIMD instructions when
available, maintaining reasonable performance for practical dataset sizes.</p>
        <p>The command-line interface enables flexible parameter configuration, with sensible default settings
for typical use cases and well-documented options for advanced customization. Built-in parameter
validation helps ensure correct configuration, and detailed runtime reporting provides insight into
computational performance and resource usage.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Eficiency: Specialized All-Versus-All Optimizations</title>
        <p>SequenceAligner implements optimizations tailored for all-versus-all comparison scenarios, rather than
generic pairwise alignment tasks. The CPU implementation features targeted SIMD enhancements with
support for AVX512, AVX2, and SSE2, primarily optimizing vectorized conversion of character sequences
into pre-mapped integer representations and eficient matrix initialization. This preprocessing step
transforms raw nucleotide or amino acid sequences into index arrays using lookup tables, eliminating
per-character translation during alignment and reducing runtime overhead. Although the core dynamic
programming loops rely on compiler auto-vectorization rather than explicit SIMD intrinsics, the
implementation incorporates memory prefetching and stack-based allocation for small matrices to
improve cache eficiency. This preprocessing is particularly advantageous when the same sequences are
involved in numerous alignment comparisons, as the time required for retrieval from cache is negligible
in comparison to the alignment process.</p>
        <p>A key eficiency enhancement is the implementation of a specialized memory pool allocator designed
specifically for sequence storage. The allocator uses a linked-list structure of memory blocks, each
sized at 4 MiB, to provide cache-friendly allocation patterns while minimizing memory fragmentation.
Each 4 MiB block is allocated using huge page allocation when available on Linux systems, leveraging
the madvise(MADV_HUGEPAGE) system call to request transparent huge pages that reduce translation
lookaside bufer misses and improve memory access latency [ 24].</p>
        <p>The pool allocator operates through a bump-pointer strategy within each block, where allocations
advance a pointer through the available space without complex bookkeeping overhead. When a block
becomes full, a new 4 MiB block is allocated and linked to the chain. All sequence string data is allocated
contiguously within these blocks with 8-byte alignment, ensuring that sequences accessed sequentially
during alignment operations exhibit improved spatial locality and cache utilization across diferent
processor architectures.</p>
        <p>This design benefits all-versus-all alignment workloads where the same sequences are repeatedly
accessed throughout the computation. Since sequence data remains in the pool throughout the entire
alignment process, subsequent accesses benefit from cache residency, reducing memory bandwidth
requirements and improving computational throughput. The pool allocator eliminates the overhead
of frequent malloc/free operations while providing deterministic memory access patterns suited for
high-performance computing applications.</p>
        <p>The multithreaded implementation employs an adaptive batch-based work distribution strategy
that balances load distribution with cache eficiency considerations. Rather than assigning individual
sequence pairs to threads, which would incur significant synchronization overhead, the system
distributes work in batches of consecutive rows from the upper triangular alignment matrix. Each thread
acquires a mutex-protected batch of rows to process, with batch sizes dynamically adjusted based on
the remaining work and number of active threads.</p>
        <p>The work distribution algorithm implements adaptive batch sizing with two distinct phases. In the
initial phase, covering the first 90% of sequence pairs, batch sizes equal the number of threads to ensure
balanced workload distribution across all available cores. As the computation approaches completion
and fewer rows remain, batch sizes are halved to prevent thread starvation and ensure that all threads
remain productive until completion. This adaptive approach addresses the inherent load imbalance in
triangular matrix computations, where later rows contain progressively fewer elements to process.</p>
        <p>Thread-local progress tracking minimizes synchronization overhead by accumulating alignment
counts locally within each thread and periodically updating shared progress counters through atomic
operations. This approach reduces contention on shared variables while maintaining accurate progress
reporting for user feedback. The implementation balances update frequency to provide responsive
progress indicators without excessive synchronization overhead.</p>
        <p>The sequence access pattern is optimized for both spatial and temporal locality through strategic
prefetching of sequence data into processor caches. When retrieving sequence pointers, the
implementation explicitly prefetches the corresponding character data using processor-specific prefetch
instructions. This anticipates imminent access during alignment computation, reducing cache miss
penalties.</p>
        <p>The sequence storage layout maintains arrays of sequence lengths and pointers separate from the
actual sequence character data stored in the memory pools. This separation enables eficient metadata
access during batch processing while ensuring that sequence character data benefits from the
cachefriendly allocation patterns of the pool allocator. The memory pool’s contiguous allocation strategy
ensures that sequences loaded during the initial phases of computation remain cache-resident for
subsequent accesses, which is particularly beneficial given the all-versus-all comparison pattern where
each sequence participates in multiple alignment operations.</p>
        <p>Memory prefetching directives are strategically placed to load sequence data ahead of the alignment
computation, taking advantage of the predictable access patterns inherent in all-versus-all alignment
scenarios. Additional prefetching occurs within the similarity computation functions, where
vectorized comparison operations prefetch upcoming sequence segments to maintain memory bandwidth
utilization during SIMD-accelerated character matching operations.</p>
        <p>The optional GPU acceleration through CUDA provides substantial performance improvements
across all implemented algorithms. Modern GPU architectures demonstrate consistent acceleration
capabilities for sequence alignment tasks, with particularly pronounced benefits for computationally
intensive algorithms such as Smith-Waterman [15].</p>
        <p>The software includes an optimization that automatically detects when the parameters of the afine
gap Gotoh algorithm reduce to a linear gap case (i.e., when the gap open and extension penalties are
equal,  =  ) and transparently switches to a more eficient Needleman–Wunsch implementation.
This not only preserves correctness, as the recurrence relations become equivalent, but also improves
performance by avoiding unnecessary afine penalty overhead. A user warning is also issued to warn
the user that the afine model is configured redundantly, encouraging parameter clarity and optimal
performance.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Scalability: Adaptive Memory Management</title>
        <p>SequenceAligner addresses the critical challenge of processing datasets that produce results larger than
available system memory through sophisticated memory management strategies. The implementation
automatically detects when the similarity matrix exceeds available RAM and seamlessly transitions
to memory-mapped file storage with chunked organization optimized for both storage eficiency and
subsequent analysis access patterns.</p>
        <p>
          The triangular matrix storage option reduces memory requirements by 50% for symmetric results
while maintaining computational eficiency through optimized index calculation. This optimization is
crucial for large datasets where memory eficiency directly impacts computational feasibility. Result
compression with configurable levels (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ) provides additional space-time trade-ofs, allowing users to
optimize storage requirements based on available disk space and analysis timeline constraints. After
analysis, the chunked storage organization enables eficient partial result access without loading entire
matrices into memory.
        </p>
        <p>For GPU computations, the software implements automatic batching when results exceed GPU
memory capacity, maintaining high computational throughput while managing memory constraints
transparently.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Accessibility: Simple and Extensible Codebase</title>
        <p>SequenceAligner emphasizes code clarity and modularity to facilitate algorithmic extensions and
educational use. The implementation adheres to standard C99 and employs a clean, modular design
with minimal third-party dependencies, namely HDF5 and the optional CUDA toolkit. This design
avoids complex dependency chains that can hinder deployment across diverse research environments.</p>
        <p>Core alignment routines are implemented as self-contained modules with well-defined interfaces,
allowing researchers to adapt or extend functionality with minimal code modification. For example,
adding a custom substitution matrix requires defining the matrix as a 2D array and adding a single line
to the substitution matrix collection. No additional changes to the code are necessary. This enables
userdefined configurations without additional programming overhead. The substitution matrix abstraction
and algorithm selection system support the integration of new alignment models without the need for
major structural changes. This flexibility enables rapid prototyping of alternative scoring schemes or
novel algorithmic variants.</p>
        <p>The software’s modular architecture separates algorithmic computation from input and output
operations, memory management, and result handling. This separation supports independent optimization
of each component and facilitates the incorporation of external libraries, such as employing
Parasail’s optimized alignment functions for specific use cases while maintaining the complete workflow
infrastructure.</p>
        <p>Platform abstraction layers unify system-specific operations such as thread management, file input
and output, and hardware-specific optimizations. This approach ensures performance and portability
across both Linux and Windows environments. The build system detects available CPU and GPU
features and automatically applies the appropriate compilation optimizations.</p>
        <p>The implementation is structured to preserve educational clarity while maintaining high performance.
The codebase employs clear abstraction layers that encapsulate low-level optimizations such as SIMD
vectorization, ensuring that performance enhancements do not compromise code readability. The linear
and modular design with clear separation of concerns makes the codebase well-suited not only for
research applications, but also as a resource for students and practitioners interested in understanding
or modifying alignment algorithms.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Open Source Development Model</title>
        <p>The open-source nature of SequenceAligner facilitates community-driven improvements and
specialization for diverse research needs. Researchers can easily fork the project to implement domain-specific
modifications while maintaining the robust workflow infrastructure. This approach leverages the
collaborative development model that has proven successful in computational biology, where community
contributions drive innovation and ensure broad applicability across research domains.</p>
        <p>The simple codebase structure enables researchers to contribute improvements, optimizations, or
specialized variants back to the community, fostering a collaborative development environment that
benefits the entire bioinformatics research community.</p>
        <p>Open-source development principles are fundamental to advancing computational biology research.
They enable algorithmic verification, customization for specific research needs, and collaborative
improvement of methods. The ability to examine, modify, and extend alignment algorithms is crucial
for researchers who need to adapt methods to novel biological questions or incorporate domain-specific
knowledge into their analyses. Furthermore, open-source tools facilitate reproducible research by
providing transparent implementations that can be verified and replicated by the scientific community.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Alignment Algorithms</title>
        <p>represent two sequences of lengths  and  , respectively.</p>
        <p>SequenceAligner implements three commonly used pairwise sequence alignment algorithms, all based
on dynamic programming for computing optimal alignments between two biological sequences. In
the following descriptions of these algorithms, the notation  1 =  11 12 …  1 and  2 =  21 22 …  2 is used to
3.6.1. Needleman-Wunsch Algorithm with Linear Gap Penalties
The Needleman-Wunsch algorithm [25] computes global alignment using the recurrence relation:
 (, ) =
max  ( − 1, ) −</p>
        <p>( − 1,  − 1) + (
⎧
⎨
⎩ (,  − 1) −</p>
        <p>1,  2) (match/mismatch)
(deletion)
(insertion)
where (, )</p>
        <p>is the scoring function from the substitution matrix and  is the linear gap penalty
(specified as a positive value). The boundary conditions are:
 (, 0) =  ⋅ (−)
 (0, ) =  ⋅ (−)
for  = 0, 1, … , 
for  = 0, 1, … ,</p>
        <sec id="sec-3-6-1">
          <title>The optimal global alignment score is  (, )</title>
          <p>3.6.2. Gotoh Algorithm with Afine Gap Penalties
The Gotoh algorithm [26] extends global alignment with afine gap penalties using three matrices:
for matches/mismatches,   (, ) for horizontal gaps, and   (, ) for vertical gaps:
 (, ) = (</p>
          <p>1,  2) + max{ ( − 1,  − 1), 
  (, ) = max{ (,  − 1) − , 
  (, ) = max{ ( − 1, ) − , 
where  is the gap opening penalty and  is the gap extension penalty.</p>
          <p>The boundary conditions are:</p>
          <p>(0, 0) = 0,   (0, 0) =   (0, 0) = −∞/2
For  &gt; 0 ∶
  (0, ) = max{ (0,  − 1) − ,</p>
          <p>(0,  − 1) − }
For  &gt; 0 ∶
  (, 0) = max{ ( − 1, 0) − ,</p>
          <p>
            ( − 1, 0) − }
 (0, ) =   (0, ),   (0, ) = −∞/2
 (, 0) =   (, 0),   (, 0) = −∞/2
The optimal global alignment score is  (, )
(
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
(
            <xref ref-type="bibr" rid="ref10">10</xref>
            )
(
            <xref ref-type="bibr" rid="ref11">11</xref>
            )
(12)
3.6.3. Smith-Waterman Algorithm with Afine Gap Penalties
The Smith-Waterman algorithm [27] identifies optimal local alignments using the same three-matrix
approach as Gotoh, but with a zero lower bound to allow alignment termination at any position:
 (, ) =
max{0, (
          </p>
          <p>1,  2) + max{ ( − 1,  − 1), 
  (, ) = max{0,  (,  − 1) − , 
  (, ) = max{0,  ( − 1, ) − ,</p>
        </sec>
        <sec id="sec-3-6-2">
          <title>The optimal local alignment score is the maximum value found in the</title>
          <p>matrix during computation:
max0≤≤,0≤≤</p>
          <p>Although all three algorithms share the same space and time complexity of  ()
, Smith-Waterman
and Gotoh require more computationally intensive operations per matrix cell, due to local alignment
scoring and afine gap penalty calculations, respectively, resulting in higher practical runtime compared
to Needleman-Wunsch with linear gap penalties.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Performance Evaluation and Benchmarking</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>A comprehensive performance evaluation was conducted to assess SequenceAligner’s efectiveness
across various computational scenarios and to compare it against Parasail, which has been shown to be
one of the fastest libraries for pairwise sequence alignment[16]. The evaluation utilized three carefully
selected datasets representing diferent scales of bioinformatics analysis: AVPPred [ 28] (small-scale),
AMP [8] (medium-scale), and Drosophila (large-scale) taken from The PeptideAtlas Project [29]. These
datasets were chosen in order to benchmark performance with respect to the dataset size and average
sequence length, and to assess the computational requirements. The statistical characteristics of the
used datasets are given in Table 1.
Statistical characteristics of the datasets used for performance evaluation.
as the primary SIMD backend. The system featured 32 GB of DDR4-3200 memory in a dual-channel
configuration. CUDA implementation was tested on an NVIDIA GeForce RTX 4060 with 1024 CUDA
threads.</p>
        <p>For comparative analysis, the Parasail implementation was developed using Python multiprocessing
with minimal features to enable fair all-versus-all comparison. The Parasail implementation discarded
alignment results to minimize I/O overhead and was written to simulate a researcher needing a quick
all-versus-all implementation while still being aware of the fastest available tools. On the other
hand, SequenceAligner does store results on-the-fly which provides a more real-world usage scenario,
but replicating it in Python would incur a more significant overhead which could skew results in
SequenceAligner’s favor.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dataset Scale and Sequence Length Impact</title>
        <p>The performance evaluation demonstrates clear scaling patterns related to both dataset size and average
sequence length. Table 2 shows the computational time and throughput for SequenceAligner across
diferent dataset characteristics using 16 CPU threads.</p>
        <p>The results reveal that average sequence length significantly impacts per-alignment computational
cost. Despite AMP requiring 81 times more alignments to be performed in comparison to AVPPred, it
achieves only 55% of the throughput due to its 41% longer average sequence length. Conversely, the
Drosophila dataset, with sequences 17% shorter than AVPPred, achieves 46% higher throughput despite
its substantialy larger scale of 3,182 times more alignments.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. CPU Threading Performance Analysis</title>
        <p>Table 3 presents a threading eficiency analysis for diferent algorithms on the AMP dataset. Threading
eficiency varies significantly by algorithm complexity. The Needleman-Wunsch algorithm, optimized
for linear gap penalties in SequenceAligner, achieves speedup of 7.91 times with 16 threads. The more
computationally intensive Gotoh and Smith-Waterman algorithms achieve superior scaling with a
factors of 12.49 and 12.67, respectively. This indicates that complex algorithms benefit more from
parallelization due to reduced thread management overhead relative to computation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Comparison of SequenceAligner and Parasail for All-Versus-All Sequence</title>
      </sec>
      <sec id="sec-4-5">
        <title>Alignment</title>
        <p>A comprehensive comparison between SequenceAligner and Parasail demonstrates significant
performance advantages for SequenceAligner across all tested configurations. Table 4 presents the performance
comparison for 16-thread CPU implementations.</p>
        <p>SequenceAligner consistently outperforms Parasail across all datasets and algorithms. The most
dramatic improvements occur with the Needleman-Wunsch algorithm, where SequenceAligner achieves
7.05 to 16.65 times speedup due to the separation of linear and afine gap penalty models. For afine gap
penalty algorithms (Gotoh and Smith-Waterman), SequenceAligner maintains substantial advantages of
1.78 to 5.31 times speedup.</p>
        <p>The performance gap increases with dataset scale, suggesting SequenceAligner’s optimizations
become more efective with larger computational workloads. This is particularly evident in the Drosophila
dataset where SequenceAligner achieves its highest relative performance gains. The results indicate that
Parasail performs consistently across diferent sequence lengths, maintaining relatively stable
throughput across datasets, while SequenceAligner’s performance varies more significantly with sequence
characteristics, achieving optimal performance on shorter sequences.</p>
        <p>It is important to note that the purpose of these comparisons is not to suggest that SequenceAligner
should replace Parasail entirely. Rather, these results demonstrate that SequenceAligner can compete
with and even surpass established implementations for short to medium-length sequences in applications
when all-versus-all sequence alignment are needed. Both Parasail’s and SequenceAligner’s architecture
(being C99 codebases) allows Parasail to be integrated as an alternative alignment backend within
SequenceAligner, providing users with the best of both worlds: optimized performance for short
sequences through SequenceAligner’s native implementation, and Parasail’s proven efectiveness for
very long sequences where its optimizations become more pronounced.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.5. CUDA Acceleration Analysis</title>
        <p>CUDA implementation demonstrates exceptional performance improvements across all datasets and
algorithms. Table 5 compares CUDA against the best CPU performance (16 threads) for SequenceAligner.</p>
        <p>CUDA acceleration provides substantial performance improvements across all scenarios, with
speedups ranging from 2.40 to 8.59 times. The Gotoh and Smith-Waterman algorithms consistently
achieve strong GPU acceleration (5.12 to 8.59 times speedups), while Needleman-Wunsch shows varied
results with speedup ranging from 2.40 to 5.24 depending on dataset characteristics.</p>
        <p>CUDA performance demonstrates optimal throughput characteristics influenced by both sequence
length and dataset scale. The Drosophila dataset achieves the highest absolute throughput of 80.82M
APS for Needleman-Wunsch, benefiting from both its shorter average sequences (17.9 amino acids) and
massive scale (1.73 billion pairwise comparisons). This combination allows GPU parallelization to excel
through two complementary mechanisms: shorter sequences reduce per-alignment computational
overhead while the enormous number of alignments enables efective utilization of GPU’s massive
parallelization capabilities. The AMP dataset, with moderate sequence lengths (30.5 amino acids) and
fewer total alignments (44.26M), still demonstrates strong GPU acceleration with excellent speedups
maintained across all algorithms, particularly for the more computationally intensive afine gap penalty
methods where the increased computational complexity better ofsets GPU memory management
overhead.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.6. Computational Eficiency and Resource Utilization</title>
        <p>The evaluation reveals distinct performance characteristics for diferent computational approaches.
CPU implementations excel at sustained throughput with predictable scaling, while CUDA provides
superior absolute performance with algorithm-dependent acceleration patterns.</p>
        <p>SequenceAligner’s CPU implementation achieves high eficiency through several optimizations:
SIMD vectorization using AVX2 instructions, cache-aware memory access patterns, and optimized
algorithmic implementations for specific gap penalty models. The threading model employs dynamic
work distribution to maintain load balancing across cores.</p>
        <p>CUDA implementation benefits from massive parallelization capabilities, processing thousands of
alignment pairs simultaneously. Despite minimal optimization efort (1-2 days compared to weeks for
CPU), CUDA consistently outperforms highly optimized CPU code, indicating significant untapped
potential for GPU-accelerated sequence alignment.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.7. Performance Summary and Future Directions</title>
        <p>The comprehensive benchmarking demonstrates SequenceAligner’s great performance across all
evaluated metrics. CPU implementations achieve 1.78 to 16.65 times speedup over the all-versus-all Parasail
Python script, while CUDA acceleration provides speedup ranging from 2.40 to 8.59 over CPU
implementations.</p>
        <p>These results strongly suggest that future development should prioritize CUDA implementations for
all-versus-all sequence alignment, with CPU implementations serving as essential fallbacks for specific
use cases. While GPU acceleration provides superior throughput for the majority of bioinformatics
applications, CPU implementations remain valuable for very long sequences that may not fit eficiently
in GPU memory, or for users without access to high-performance GPU hardware. Current CPU
implementations, while highly optimized and efective, cannot match the raw computational throughput
achievable with GPU acceleration for typical sequence analysis workflows. This performance advantage
is particularly relevant for machine learning applications where all-versus-all comparisons are commonly
required.</p>
        <p>The absence of publicly available CUDA kernels for all-versus-all sequence alignment represents a
significant gap in the bioinformatics software ecosystem. The demonstrated performance advantages,
combined with the increasing prevalence of GPU hardware in computational biology, indicate that
GPUaccelerated implementations should become the standard for large-scale sequence analysis workflows.</p>
        <p>SequenceAligner establishes new performance benchmarks for sequence alignment software while
maintaining algorithmic accuracy and providing practical storage solutions through integrated HDF5
output. The substantial performance improvements, particularly for large-scale datasets, enable
previously computationally prohibitive analyses to be performed on standard hardware configurations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>SequenceAligner provides an integrated solution for all-versus-all pairwise sequence alignment,
addressing key workflow and scalability challenges commonly encountered in large-scale bioinformatics
analyses. By providing a complete, ready-to-use solution rather than requiring assembly from multiple
libraries, SequenceAligner eliminates substantial barriers to entry for researchers seeking to conduct
comprehensive sequence analyses. The tool also supports similarity-based filtration, enabling the
creation of non-redundant datasets suitable for machine learning-based peptide activity prediction,
thereby helping to reduce bias and improve model generalization.</p>
      <p>The software’s emphasis on four core principles - convenience, eficiency, scalability, and accessibility
- ensures that researchers can obtain rigorous, reproducible results without requiring specialized
computational expertise or extensive software development. The complete workflow integration, from
sequence parsing to result storage, democratizes access to high-performance sequence alignment
capabilities across diverse research environments.</p>
      <p>The modular architecture and transparent algorithms of SequenceAligner make it suitable for both
high-performance research and educational use. Its clear component separation supports easy
integration of alternative methods or optimizations, such as Parasail’s functions, while preserving the
benefits of a unified workflow. Combined with an open source, minimal dependency C99 codebase,
this design encourages community-driven development and facilitates customization, aligning with the
collaborative, reproducible ethos of computational biology.</p>
      <p>Performance benchmark on three peptide datasets demonstrated the efectiveness of specialized
all-versus-all optimizations, with alignment rates ranging from 27.4 to 80.8 million sequence pairs per
second on consumer-grade NVIDIA GeForce RTX 4060. Furthermore, due to specialized optimizations,
SequenceAligner performs up to 16.65 times more alignments in comparison to Parasail when
allversus-all sequence alignments are needed. The CUDA-accelerated GPU implementation outperformed
a 16-thread CPU implementation by a factor of up to 8.59, highlighting the substantial benefits of
hardware-specific optimization.</p>
      <p>The adaptive memory management capabilities enable processing of datasets with results that
exceed available system memory, making previously computationally prohibitive analyses feasible
for routine research applications. The demonstrated scalability across diferent dataset sizes and
hardware configurations, combined with cross-platform compatibility and straightforward deployment,
makes SequenceAligner accessible to researchers with common computational resources, which is
crucial for democratizing advanced sequence analysis capabilities and enabling broader participation in
computational biology research.</p>
      <p>SequenceAligner successfully bridges the gap between high-performance computing requirements
and practical usability, providing the bioinformatics community with a tool that handles the scale and
complexity demands of modern biological research while maintaining the accuracy, reproducibility, and
extensibility essential for scientific applications. The software represents a contribution to the
opensource computational biology ecosystem that prioritizes both performance and accessibility, fostering
collaborative development and community-driven innovation in sequence analysis methodologies.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>This project was supported by the Croatian Science Foundation/Hrvatska zaklada za znanost (grant no:
UIP-2019-04-7999) and by the University of Rijeka (grant no: uniri-23-78 and uniri-23-16). The authors
would like to thank the Parasail development team for providing substitution matrices that enabled
comprehensive algorithm validation.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT to help improve sentence conciseness
and check for grammatical errors. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the final content of the publication.
[12] S. Gull, N. Shamim, F. Minhas, Amap: Hierarchical multi-label prediction of biologically active
and antimicrobial peptides, Computers in biology and medicine 107 (2019) 172–181.
[13] M. Njirjak, L. Žužić, M. Babić, P. Janković, E. Otović, D. Kalafatovic, G. Mauša, Reshaping the
discovery of self-assembling peptides with generative ai guided by hybrid deep learning, Nature
machine intelligence (2024) 1–14.
[14] M. Farrar, Striped smith–waterman speeds database searches six times
over other simd implementations, Bioinformatics 23 (2006) 156–161. URL:
https://doi.org/10.1093/bioinformatics/btl582. doi:10.1093/bioinformatics/
btl582.
arXiv:https://academic.oup.com/bioinformatics/articlepdf/23/2/156/49820591/bioinformatics_23_2_156.pdf.
[15] S. A. Manavski, G. Valle, Cuda compatible gpu cards as eficient hardware accelerators for
smithwaterman sequence alignment, BMC bioinformatics 9 (2008) 1–9.
[16] J. Daily, Parasail: Simd c library for global, semi-global, and local pairwise sequence alignments,</p>
      <p>BMC bioinformatics 17 (2016) 1–11.
[17] P. J. Cock, T. Antao, J. T. Chang, B. A. Chapman, C. J. Cox, A. Dalke, I. Friedberg, T. Hamelryck,
F. Kauf, B. Wilczynski, et al., Biopython: freely available python tools for computational molecular
biology and bioinformatics, Bioinformatics 25 (2009) 1422.
[18] J. Köster, Rust-bio: a fast and safe bioinformatics library, Bioinformatics 32 (2016) 444–446.
[19] J. R. Rideout, G. Caporaso, E. Bolyen, D. McDonald, Y. V. Baeza, J. C. Alastuey, A. Pitman, J. Morton,
Q. Zhu, J. Navas, K. Gorlick, J. Debelius, Z. Xu, M. Aton, llcooljohn, J. Shorenstein, L. Luce, W. V.
Treuren, J. Chase, charudatta navare, A. Gonzalez, C. J. Brislawn, W. Patena, K. Schwarzberg,
teravest, J. Reeder, I. Sfiligoi, shifer1, nbresnick, D. K. D. Murray, scikit-bio/scikit-bio: scikit-bio
0.6.3, 2025. URL: https://doi.org/10.5281/zenodo.14640761. doi:10.5281/zenodo.14640761.
[20] The HDF5® Library &amp; File Format - version 1.14.6., https://www.hdfgroup.org/solutions/hdf5/,</p>
      <p>Accessed 27-05-2025.
[21] S. Henikof, J. G. Henikof, Amino acid substitution matrices from protein blocks., Proceedings of
the National Academy of Sciences 89 (1992) 10915–10919.
[22] D. Mo, A model of evolutionary change in protein, Atlas of protein sequence and structure, vol 5,
suppl 3 (1978) 345–352.
[23] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool,
Journal of Molecular Biology 215 (1990) 403–410. URL: https://www.sciencedirect.com/science/
article/pii/S0022283605803602. doi:https://doi.org/10.1016/S0022-2836(05)80360-2.
[24] M. Gorman, Understanding the Linux Virtual Memory Manager, Prentice Hall PTR, USA, 2004.
[25] S. B. Needleman, C. D. Wunsch, A general method applicable to the search for similarities in the
amino acid sequence of two proteins, Journal of molecular biology 48 (1970) 443–453.
[26] O. Gotoh, An improved algorithm for matching biological sequences, Journal of molecular biology
162 (1982) 705–708.
[27] T. F. Smith, M. S. Waterman, et al., Identification of common molecular subsequences, Journal of
molecular biology 147 (1981) 195–197.
[28] N. Thakur, A. Qureshi, M. Kumar, Avppred: collection and prediction of highly efective
antiviral peptides, Nucleic Acids Research 40 (2012) W199–W204. URL: https://doi.org/10.
1093/nar/gks450. doi:10.1093/nar/gks450.
arXiv:https://academic.oup.com/nar/articlepdf/40/W1/W199/18783632/gks450.pdf.
[29] F. Desiere, E. W. Deutsch, N. L. King, A. I. Nesvizhskii, P. Mallick, J. Eng,
S. Chen, J. Eddes, S. N. Loevenich, R. Aebersold, The peptideatlas project,
Nucleic Acids Research 34 (2006) D655–D658. URL: https://doi.org/10.1093/nar/
gkj040. doi:10.1093/nar/gkj040.
arXiv:https://academic.oup.com/nar/articlepdf/34/suppl_1/D655/3924258/gkj040.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wuyun</surname>
          </string-name>
          , W. Zheng,
          <article-title>The historical evolution and significance of multiple sequence alignment in molecular structure and function prediction</article-title>
          ,
          <source>Biomolecules</source>
          <volume>14</volume>
          (
          <year>2024</year>
          ). URL: https://www.mdpi.com/2218-273X/14/12/1531. doi:
          <volume>10</volume>
          .3390/biom14121531.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tang</surname>
          </string-name>
          , L. Xu,
          <article-title>Developments in algorithms for sequence alignment: A review</article-title>
          ,
          <source>Biomolecules</source>
          <volume>12</volume>
          (
          <year>2022</year>
          ). URL: https://www.mdpi.com/2218-273X/12/4/546. doi:
          <volume>10</volume>
          .3390/ biom12040546.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Uniprot: the universal protein knowledgebase in 2023</article-title>
          , Nucleic acids research
          <volume>51</volume>
          (
          <year>2023</year>
          )
          <fpage>D523</fpage>
          -
          <lpage>D531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Benson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Karsch-Mizrachi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Lipman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ostell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Sayers</surname>
          </string-name>
          , Genbank,
          <source>Nucleic acids research</source>
          <volume>42</volume>
          (
          <year>2013</year>
          )
          <article-title>D32</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Suzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McGarvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Uniref: comprehensive and non-redundant uniprot reference clusters</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2007</year>
          )
          <fpage>1282</fpage>
          -
          <lpage>1288</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/btm098. doi:
          <volume>10</volume>
          .1093/ bioinformatics/btm098. arXiv:https://academic.oup.com/bioinformatics/articlepdf/23/10/1282/49812789/bioinformatics_23_
          <fpage>10</fpage>
          _
          <fpage>1282</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Campillos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C.</given-names>
            <surname>Gavin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <article-title>Drug target identification using side-efect similarity</article-title>
          ,
          <source>Science</source>
          <volume>321</volume>
          (
          <year>2008</year>
          )
          <fpage>263</fpage>
          -
          <lpage>266</lpage>
          . URL: https: //www.science.org/doi/abs/10.1126/science.1158140. doi:
          <volume>10</volume>
          .1126/science.1158140. arXiv:https://www.science.org/doi/pdf/10.1126/science.1158140.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Altenhof</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zarowiecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tomiczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Warwick</given-names>
            <surname>Vesztrocy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Dalquen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Telford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Glover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dylus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dessimoz</surname>
          </string-name>
          ,
          <article-title>Oma standalone: orthology inference among public and custom genomes and transcriptomes</article-title>
          ,
          <source>Genome Research</source>
          <volume>29</volume>
          (
          <year>2019</year>
          )
          <fpage>1152</fpage>
          -
          <lpage>1163</lpage>
          . URL: http://genome.cshlp.org/content/29/7/1152.abstract. doi:
          <volume>10</volume>
          .1101/gr.243212. 118. arXiv:http://genome.cshlp.org/content/29/7/1152.full.pdf+html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Otovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Njirjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalafatovic</surname>
          </string-name>
          , G. Mausa,
          <article-title>Sequential properties representation scheme for recurrent neural network-based prediction of therapeutic peptides</article-title>
          ,
          <source>Journal of chemical information and modeling</source>
          <volume>62</volume>
          (
          <year>2022</year>
          )
          <fpage>2961</fpage>
          -
          <lpage>2972</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Acpred-fl: a sequence-based predictor using efective feature representation to improve the prediction of anti-cancer peptides</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>34</volume>
          (
          <year>2018</year>
          )
          <fpage>4007</fpage>
          -
          <lpage>4016</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Boopathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramaniyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Manavalan</surname>
          </string-name>
          , D.-C.
          <article-title>Yang, macppred: a support vector machine-based meta-predictor for identification of anticancer peptides</article-title>
          ,
          <source>International journal of molecular sciences 20</source>
          (
          <year>2019</year>
          )
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Deepacp: a novel computational approach for accurate identification of anticancer peptides by deep learning algorithm</article-title>
          ,
          <source>Molecular Therapy Nucleic Acids</source>
          <volume>22</volume>
          (
          <year>2020</year>
          )
          <fpage>862</fpage>
          -
          <lpage>870</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>