<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Flag Index Tool for Faster Repeated Queries of Mapped Reads</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Kotrs</string-name>
          <email>honza@kotrs.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matej Lexa</string-name>
          <email>lexa@fi.muni.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Health Monk s.r.o.</institution>
          ,
          <addr-line>Korunni 2569/108, 10100 Praha</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Masaryk University, Faculty of Informatics</institution>
          ,
          <addr-line>Botanicka 68a, 60200 Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present bafiq , a command-line rust tool for repeated filtering and counting of sequencing reads stored in the BAM file format. This software can be easily incorporated into NGS computational workflows, especially in situations where one needs to repeatedly process mapped reads in large files based on their FLAG values. Commands in bafiq follow query language familiar from samtools and similar software where, however, indexing is only based on the location of reads in a reference genome. Our tool allows significant computational time savings in situations where a single BAM file is queried repeatedly based on diferent FLAG combinations, leveraging a lfag-based index. We made an efort to match or outperform the commonly used tasks as well as in sequence counting tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>NGS workflow</kwd>
        <kwd>BAM flags</kwd>
        <kwd>read alignment filtering</kwd>
        <kwd>BAM index</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        wrapping these in Java [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Python [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], or R [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Other BAM processors include Picard [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], samblaster
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], biobambam [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and Scramble [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They often focus on extra flexibility, as is the case of
samql
([
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), or speed which was the priority of sambamba [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The flexibility of samql comes in the form of
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
an SQL-like language for querying BAM files. The speed improvements are mostly based on parallel
execution and code optimizations.</p>
      <p>
        A well-known speed-centered addition to BAM/CRAM files is an index based on their
coordinatebased sorting. This functionality made its way into samtools and many other BAM processors as well.
The index divides the mapped and sorted reads in a BAM file into bins, which are then referenced in
the index for fast data retrieval from a given range. Genome browsers such as IGV [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] rely on this
mechanism to modify visualized browser content quickly even when working on very large BAM files.
      </p>
      <p>In our work with BAM files we often needed to process them in various ways that, unlike the example
above, depended on other components of the read data, not the coordinates. BAM format allows for
reads to be marked using FLAGs and TAGs. For example, we may want to eliminate read pairs that do
not map concordantly, or reads where their mate in the pair did not map. Or, we may want to only
work with reads mapping with a certain minimum mapping quality (MAPQ value). While existing tools
ofer arguments to select specific flags or tags, these operations do not take advantage of indexing, as
if the queries and filtering operations were not supposed to be done more than once within a given
workflow. We noticed that this is not always the case. One may want to count certain types of reads in
the alignment files, repeat the counting procedure for diferent settings of a variable, then filter the file
for only a subset of reads to create a ”cleaned” version for further work. Also in certain educational
settings, an instructor may require students to query the same BAM file as part of a structured exercise.</p>
      <p>To support higher speeds in these scenarios without compromising on the real-world data volume,
we propose to create a new type of index file that can be stored along with the original alignment file,
similarly to the coordinate-based (.bai) index that can be made using samtools. The bafiq tool described
here is a command-line program written in rust that takes a BAM file as an input along with a number
of filtering arguments. It then either counts or extracts the reads in SAM format that pass the query
criteria. When queried, bafiq first looks for an existing flag index (.bfi) of the input file. If it does not
exist, it is created upon first query. Any subsequent calls to bafiq on the same input BAM file will take
advantage of the index and return results faster. Counts are returned immediately as they are part of
the index, sequence retrieval based on the query follows standard I/O limitations with the added benefit
of the index allowing selective compressed blocks skipping.</p>
    </sec>
    <sec id="sec-2">
      <title>2. BAM Flag Index (.bfi)</title>
      <sec id="sec-2-1">
        <title>2.1. Read counts</title>
        <p>
          The SAM format [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] defines a set of 16 binary FLAG values forming the total space of all FLAG
combinations stored as 16-bit integers, or 216 = 65,536. However not all flags are used, bits 12-15 are
reserved, therefore the more realistic boundary is much smaller: 212 = 4,096. Furthermore many of the
lfag combinations do not make biological sense. For example a single read cannot be first in pair and
second in pair at the same time. Thus in a typical human NGS BAM file we observe up to 100 of flag
combinations actually used out of which just 4 represent over 90% (Table 1).
        </p>
        <p>Leveraging this observation makes pre-computing of all the flag combinations and the number of the
corresponding reads upfront and embedding that into the .bfi index feasible. This way all count queries
are resolved immediately as bafiq retrieves that count directly from the index. That also means the
very first bafiq count query time is almost exclusively spent by building the index.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Read sequences</title>
        <p>Learning the sequence number of given flag combinations in a BAM file is useful in itself however the
downstream analysis often also requires the sequence extraction. For that reason the .bfi index stores
the gzip file byte ofsets of the blocks which contain at least one sequence of the given flag combination.
This allows for faster extraction of the actual sequences due to skipping of entire blocks during the
retrieval process without processing the whole file start to finish.
paired, mapped, read rev. strand, first in pair
paired, mapped, mate rev. strand, second in pair
paired, mapped, mate rev. strand, first in pair
paired, mapped, read rev. strand, second in pair
paired, mapped, mate rev. strand, second in pair, PCR/optical dup.
paired, mapped, read rev. strand, first in pair, PCR/optical dup.
paired, mapped, read rev. strand, second in pair, PCR/optical dup.
paired, mapped, mate rev. strand, first in pair, PCR/optical dup.
paired, read rev. strand, first in pair
paired, mate rev. strand, second in pair</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Index Compression</title>
        <p>The BAM file is compressed using gzip which stores sequences in blocks. For reference a BAM
ifle containing human chromosome 1 sequences obtained from a WGS (whole-genome sequencing)
experiment of 30x read depth (common scenario) contains close to 70M sequences. A typical BAM
block contains 180 sequences on average (see Section 6.1). That is roughly 380,000 blocks potentially
(worst-case scenario) distributed among the flag combinations. The lists of blocks containing sequences
of given flag combinations are referred to as bins. As each block can host sequences with diferent flag
combinations it can be part of multiple bins. As a result the total number of block ofsets stored across
all bins can grow quickly. Therefore we have briefly explored possible means of compression of the
index itself to keep the size reasonable ideally without compromising bafiq’s speed.</p>
        <sec id="sec-2-3-1">
          <title>2.3.1. Bin Sparsity</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Delta Encoding</title>
          <p>As established earlier the list of bins is quite sparse (100 out of 4,096). Stripping empty bins is thus the
ifrst obvious step. That in itself however does not provide large enough space-saving benefit so we
explored further.</p>
          <p>Each bin contains an ordered sequence of byte ofsets which are growing in nature. Thus storing the
ofsets as deltas from the first saves a lot of space and is trivial to decode at runtime.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>2.3.3. Compression Gains</title>
          <p>For the referenced example of chromosome 1 the .bfi index compression gains amount to ∼30% using
just the 2 methods:
Original size:
Compressed size:
Compression ratio:
Space saved:
6,496,336 bytes (6.2 MB)
4,471,326 bytes (4.3 MB)
1.45x
31.2%</p>
          <p>If we break it down by technique the ratio is as expected:
Bin Sparsity: 129,280 bytes saved ( 6.0%)
Delta encoding: 2,025,458 bytes saved (94.0%)</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>2.3.4. Dictionary Compression</title>
          <p>We have also explored the dictionary compression which would replace repeated ofset subsequences
in the index with a literal to be expanded upon decompression. However the initial implementation did
not scale well ( ( 2) in complexity) and was not the primary focus so we have decided not to include it.
It might be a subject of future optimizations should the index size becomes an issue.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Index Building Strategies</title>
      <p>Since our main goal was to enable speedy exploratory analysis using flags we have explored and
compared multiple strategies (Table 2).1 of reading the input BAM file and building the index.</p>
      <p>
        The index build process boils down to (1) read, (2) decompress, (3) index and optionally (4) compress.
Each of the steps can be solved sequentially, in parallelized fashion or some combination of the two
with various benefits and tradeofs. Considering our goals we have decided to implement following 3
strategies: Channel Producer-Consumer (CPC) (utilizing crossbeam-channels[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]), Work Stealing (WS)
(utilizing rayon[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]) and Constant Memory (CM). The first two ( CPC and WS) leveraging memory
mapping and diferent types of parallelization orchestration with potentially large memory requirements
in favor of speed. The third one is more memory conservative keeping an all-times low profile of
mere 100MB. It does not memory-map the whole input file rather streams it by chunks and drains the
processed ones from the memory as fast as possible.
      </p>
      <p>
        The Work Stealing strategy leverages ryon[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] crate implementation of a parallel computation
scheduling concept dating back to 1999 from the Cilk runtime[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The core idea is as worker threads
becoming available to take new tasks they ”steal” tasks from queue of other busy worker threads. This
requires less synchronization then a central channel contention used by the Channel Producer-Consumer
strategy.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Discovery Phase</title>
        <p>The first computational challenge is the fast gzip block discovery in the BAM file with respect to the
underlying OS I/O limitations. Two approaches were explored: (1) Discover-All-First and (2) Streaming.
The Discover-All-First approach uses single thread to read the whole file (mapped into memory) organized
into gzip blocks for downstream processing. While the Streaming approach is feeding the downstream
gzip block processors continuously as they are being discovered.</p>
        <p>
          By nature of the memory mapping (leveraging memmap2[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] crate) bafiq eventually loads the whole
input file into memory. As that could cause pressure in low-memory environments (although not
completely, the memory recovers as needed over time) the Constant Memory strategy streams the input
from the disk and uses capped bufer so that the memory usage stays linear in respect to the input size.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Decompression Phase</title>
        <p>The next phase takes in discovered gzip blocks and decompresses them in memory for the downstream
BAM records processing. Here the challenge is how to eficiently utilize existing decompression libraries
(as the decompression itself is outside of the scope of this work).</p>
        <p>
          We are leveraging libdeflater [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] crate for fast decompression with all 3 strategies using thread local
bufer to process.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Flags Extraction Phase</title>
        <p>
          In this phase the just decompressed block becomes addressable for the contents. With the simple
operation of reading just the flag bit we opted-out of using the htslib-rust[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] crate to avoid unnecessary
1As the naïve single-core non-parallel strategy scanning the BAM file start to finish was under-performing even on the
smallest test file (1.3GB, over 1m 30s) we have decided not to include it in further performance exploration.
abstraction (however we’re leveraging it for writing SAM format output later in the sequence extraction
task).
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Index Assembly Phase</title>
        <p>Both WS and CPC strategies build up local indices within scoped threads and merge at the end of
decompression and extraction phases. The CM uses the same merge-tree algorithm although merges to
the main index after every micro-batch is processed to minimize memory impact.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Querying</title>
      <p>To define flag combination for querying the BAM sequences we opted for
-f (required flags) and -F (forbidden flags) to keep familiarity.</p>
      <p>
        As the SAM[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] specification defines following flags (bits) can be combined to query for sequences:
samtools specification using
Bit
      </p>
      <p>Hex</p>
      <p>Read Flag</p>
    </sec>
    <sec id="sec-5">
      <title>5. Benchmarking</title>
      <p>We have measured the CPU and memory utilization as well as overall time to finish a task of all 3
selected bafiq strategies alongside samtools as a reference. As bafiq’s advantage comes primarily after
the index is built we also added our version of quick counting and filtering ( bafiq fast-count ) more
resembling to what samtools view -c does to demonstrate the underlying performance.</p>
      <p>Apart from the actual index building we bench-marked the sequence retrievals using the .bfi index.</p>
      <p>Two short-read BAM files originating from a typical NGS WGS 30x pipeline were selected for
benchmarking - chr22 (1.3GB) and chr1 (8.2GB) aligned to hg38 reference. As many of bafiq features
build on parallelized processing all benchmarking runs had explicit –threads setting aligned with
samtools for fair comparison.</p>
      <p>All benchmark runs were performed either to produce or consume uncompressed indices as the final
index size was not an issue (∼0.5% of the input file).</p>
      <sec id="sec-5-1">
        <title>5.1. Bench 1: first query (bafiq index )</title>
        <p>As seen in the index building benchmarks (Table 3) the best performing strategy was Work Stealing
with its index building time close to samtools view -c across the diferent number of threads available,
even slightly faster when both constrained to 2 threads. As the subsequent query for sequence count
is fetched pre-computed from the index and resolve under 20ms we consider the index building a
time-equivalent task to bafiq index + bafiq query .</p>
        <p>We can clearly see that the cost of channel management overhead and work organization required
by Channel Producer-Consumer strategy is very costly especially with limited threads but becomes less
significant as the number of available threads grows.</p>
        <p>Due to the memory-mapping feature both CPC and WS strategies can exhaust memory up to the
original BAM file size which is not pracical for full-genome alignments. Because of that we have included
more memory-conservative strategy (Constant Memory (CM)) which, although not as performant as
WS, keeps a constant memory footprint (∼100 MB) independent of the input file size.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Bench 2: sequence extraction (bafiq view )</title>
        <p>The sequence retrieval (Table 4) of bafiq view leverages stored gzip block byte ofsets (indexed for each
of the flag combination) so that reading from the original BAM file does not depend on a pre-scanning
step anymore and can limit its read to focused blocks. As each block stores 180 sequences on average
traversing it is relatively cheap operation.</p>
        <p>Where .bfi index performs quite well are use-cases of retrieving rare flag combinations relative to
the bulk of records (chr22 0x4 ∼3,922). However its utility as a performance helper decreases with the
abundance of matching reads (as seen in chr22 0x2 and 0x10 flags).</p>
        <p>Despite that observed performance decrease in most measured scenarios the index-based retrieval
was still faster than without it.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Supplementary Tools</title>
      <sec id="sec-6-1">
        <title>6.1. Index Viewer (bafiq-viewer )</title>
        <p>As the .bfi index file is stored in a binary form to save space we have developed a tool to quickly glance
over the contents including basic index statistics. For example a .bfi index of chromosome 1 can look as
follows:
Total records:
Total bins:
Non-empty bins:
Total blocks:
Reads per block:</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>69095506
56
56
379169
min=35, max=310, avg=182.2
We have shown that an exploratory analysis of aligned reads based on FLAG combinations can vastly
benefit from a dedicated index to obtain sequence counts (from s to ms) and for sequence retrieval
scenario where the flag combination matches smaller subset. However in some cases where the flag
combination matches large proportion of sequences of the BAM file the benefits of stored ofsets in
current implementation diminish.</p>
      <p>Nevertheless the code base and insights from individual presented strategies can serve as a
demonstrator of utility of building narrowly focused indices. The present FLAG index scope could be also
extended to include other fields from the header (such as TAG or MAPQ).</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Claude Sonnet 4 in order to: Assist in implementing
parts of the rust code base. After using the tool, the authors reviewed and edited the content as needed
and take full responsibility for the code content.</p>
      <p>The authors have not employed any Generative AI tools for the publication content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Online Resources</title>
      <p>The source code for the bafiq tool is available via GitHub (https://github.com/honzakotrs/bafiq) .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schbath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zytnicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fayolle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Loux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Gibrat</surname>
          </string-name>
          ,
          <article-title>Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis</article-title>
          ,
          <source>Journal of Computational Biology</source>
          <volume>19</volume>
          (
          <year>2012</year>
          )
          <fpage>796</fpage>
          -
          <lpage>813</lpage>
          . doi:
          <volume>10</volume>
          .1089/cmb.
          <year>2012</year>
          .
          <volume>0022</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Handsaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wysoker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fennell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Homer</surname>
          </string-name>
          , G. Marth, G. Abecasis,
          <string-name>
            <given-names>R.</given-names>
            <surname>Durbin</surname>
          </string-name>
          , 1000
          <source>Genome Project Data Processing Subgroup</source>
          ,
          <article-title>The sequence alignment/map format</article-title>
          and SAMtools,
          <source>Bioinformatics</source>
          <volume>25</volume>
          (
          <year>2009</year>
          )
          <fpage>2078</fpage>
          -
          <lpage>2079</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btp352.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cochrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Amid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bower</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cerdeño-Tárraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cleland</surname>
          </string-name>
          , et al.,
          <article-title>Facing growth in the European Nucleotide Archive</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>41</volume>
          (
          <year>2012</year>
          )
          <fpage>D30</fpage>
          -
          <lpage>D35</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/ gks1175.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cechova</surname>
          </string-name>
          ,
          <article-title>Probably correct: Rescuing repeats with short and long reads</article-title>
          ,
          <source>Genes</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <article-title>48</article-title>
          . doi:
          <volume>10</volume>
          .3390/genes12010048.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gröber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Händler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Schultze</surname>
          </string-name>
          , G. Mayer,
          <article-title>Systematic evaluation of error rates and causes in short samples in next-generation sequencing</article-title>
          .,
          <source>Sci Rep</source>
          <volume>8</volume>
          (
          <year>2018</year>
          ). URL: https://doi.org/10.1038/s41598-018-29325-6.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] HTSJDK, https://github.com/samtools/htsjdk,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Heger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Pysam</surname>
          </string-name>
          , https://github.com/pysam-developers/pysam,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Morgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pagès</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Obenchain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hayden</surname>
          </string-name>
          ,
          <article-title>Rsamtools: binary alignment (BAM), FASTA, variant call (BCF), and tabix file import</article-title>
          .
          <source>R package version 2.24.0</source>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .18129/B9.bioc. Rsamtools.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Picard</given-names>
            <surname>Toolkit</surname>
          </string-name>
          , https://broadinstitute.github.io/picard/,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Faust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <article-title>SAMBLASTER: fast duplicate marking and structural variant read extraction</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>30</volume>
          (
          <year>2014</year>
          )
          <fpage>2503</fpage>
          -
          <lpage>2505</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btu314.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Faust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <article-title>Biobambam: tools for read pair collation based algorithms on bam files</article-title>
          ,
          <source>Source Code Biol. Med</source>
          .
          <volume>9</volume>
          (
          <year>2014</year>
          )
          <article-title>13</article-title>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btu314.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>B. J. K.</surname>
          </string-name>
          ,
          <article-title>The Scramble conversion tool</article-title>
          .,
          <source>Bioinformatics</source>
          <volume>30</volume>
          (
          <year>2014</year>
          )
          <fpage>2818</fpage>
          -
          <lpage>2819</lpage>
          . doi:
          <volume>10</volume>
          .1093/ bioinformatics/btu390.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Maragkakis, SamQL: a structured query language and filtering tool for the SAM/BAM ifle format</article-title>
          .,
          <source>BMC Bioinformatics 22</source>
          (
          <year>2021</year>
          )
          <article-title>474</article-title>
          . doi:
          <volume>10</volume>
          .1186/s12859-021-04390-3.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tarasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Vilella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cuppen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Nijman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prins</surname>
          </string-name>
          ,
          <article-title>Sambamba: fast processing of NGS alignment formats</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>31</volume>
          (
          <year>2015</year>
          )
          <fpage>2032</fpage>
          -
          <lpage>2034</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btv098.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Thorvaldsdóttir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Robinson</surname>
          </string-name>
          , Mesirov,
          <string-name>
            <given-names>J. P.</given-names>
            ,
            <surname>Integrative Genomics</surname>
          </string-name>
          <article-title>Viewer (IGV): highperformance genomics data visualization and exploration</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>14</volume>
          (
          <year>2012</year>
          )
          <fpage>178</fpage>
          -
          <lpage>192</lpage>
          . doi:
          <volume>10</volume>
          .1093/bib/bbs017.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Byrska-Bishop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. S.</given-names>
            <surname>Evani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Abel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Regier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Corvelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Musunuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagulapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fairley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Runnels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Winterkorn</surname>
          </string-name>
          , E. Lowy,
          <string-name>
            <given-names>H. G. S. V.</given-names>
            <surname>Consortium</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flicek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Germer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Brand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Talkowski</surname>
          </string-name>
          , G. Narzisi, M. C.
          <article-title>Zody, High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios</article-title>
          ,
          <source>Cell</source>
          <volume>185</volume>
          (
          <year>2022</year>
          )
          <fpage>3426</fpage>
          -
          <lpage>3440</lpage>
          .
          <year>e19</year>
          . doi:
          <volume>10</volume>
          .1016/j.cell.
          <year>2022</year>
          .
          <volume>08</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>A. d'Antras</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Crichton</surname>
            ,
            <given-names>F. S.</given-names>
          </string-name>
          <string-name>
            <surname>Klock</surname>
            <given-names>II</given-names>
          </string-name>
          , S. Kazlauskas,
          <article-title>crossbeam: Tools for concurrent programming in Rust</article-title>
          , https://github.com/crossbeam-rs/crossbeam,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-20.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Matsakis</surname>
          </string-name>
          , J. Stone, rayon: Data parallelism in Rust, https://crates.io/crates/rayon,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-20.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Blumofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Leiserson</surname>
          </string-name>
          ,
          <article-title>Scheduling multithreaded computations by work stealing</article-title>
          ,
          <source>Journal of the ACM</source>
          <volume>46</volume>
          (
          <year>1999</year>
          )
          <fpage>720</fpage>
          -
          <lpage>748</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Levien</surname>
          </string-name>
          , contributors, memmap2:
          <article-title>Memory-mapped file IO for Rust</article-title>
          , https://crates.io/crates/ memmap2,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-20.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kewley</surname>
          </string-name>
          , libdeflater: Fast DEFLATE/zlib/gzip compression and decompression , https://github.com/ libdeflater/libdeflater,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-20.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Köster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lähnemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Holtgrewe</surname>
          </string-name>
          , J. Gehring, rust-htslib: Rust bindings to HTSlib, https://github.com/rust-bio/rust-htslib,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-20.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>