<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SnakeLines Workflow for SARS-CoV-2 Variant Detection from Next-Generation Sequencing Reads</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrián Goga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miroslav Böhmer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rastislav Hekel</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Werner Krampl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronislava Brejová</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Vinarˇ</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaroslav Budiš</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Szemes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Comenius University Science Park in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Natural Sciences, Comenius University in Bratislava</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Geneton Ltd.</institution>
          ,
          <addr-line>Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Slovak Centre of Scientific and Technical Information</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Slovak Centre of Scientific and Technical Information</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The ongoing SARS-CoV-2 pandemic, which emerged in December 2019, revolutionized genomic surveillance, leading to new means of tracking viral spread and monitoring genetic changes in their genomes over time. One of the key sequencing methods used during the pandemic is based on massively parallel short read sequencing based on Illumina technology. In this work, we present a highly scalable and easily deployable computational pipeline for the analysis of Illumina sequencing data, which is used in Slovak SARS-CoV-2 genomic surveillance efforts. We discuss several issues that arose during the pipeline design, and which could both provide useful insight into the analysis processes and serve as a guideline for optimized future outbreak surveillance projects.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The Severe Acute Respiratory Syndrome Coronavirus type
2 (SARS-CoV-2) pandemic first emerging in Wuhan in
December 2019, also erupting in Slovakia in March of
2020, led to a revolution in genomic surveillance.
Multiple DNA sequencing library preparation protocols and
data analysis pipelines have been utilized to keep track
of the rapidly evolving virus, for example ARTIC [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
SARS-CoV-2 RECoVERY [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], V-pipe [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], etc. Their use
resulted in gathering millions of genomic sequences in the
GISAID database [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        We present a computational pipeline, which was
crucial for the analysis of Illumina sequencing data of
SARSCoV-2 genomes in Slovakia. The pipeline was developed
using the SnakeLines framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is built upon
the widely used Snakemake workflow engine [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The
flexibility of this framework allowed us to deploy our
solution within national computational infrastructures
supporting sequencing laboratories of Comenius University
Science Park in Bratislava and the Public Health Authority
of the Slovak Republic. Up to now (from mid-February to
late July of 2021), we have successfully sequenced,
analysed and released 3055 viral samples with sufficient
coverage (at least 3bp across more than 90% of the genome).
      </p>
      <p>The rest of this paper is organized as follows. In
section 2, we give a brief overview of the DNA
sequencing process and data analysis with focus on variant
detection of the SARS-CoV-2. In section 3, we describe the
SnakeLines-based variant detection pipeline we designed
for the SARS-CoV-2. Section 4 is focused on discussing
several selected problems that occurred during the pipeline
design. Finally, in section 5 we conclude our findings,
propose several recommendations based on our experience,
and outline directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background - Sequencing and Data</title>
    </sec>
    <sec id="sec-3">
      <title>Analysis</title>
      <p>
        The reference genome of the SARS-CoV-2 has 29 903
base pairs (bp) and was sequenced in the early days of
the epidemic [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. The goal of sequencing of additional
samples is to identify differences from this reference
sequence, called genomic variants. Continuous monitoring
of viral genomes is crucial for the early detection of new
mutants that may lead to more severe virulence
characteristics, such as the rate of infection.
      </p>
      <p>A typical analysis of genomic data can be divided into
three consecutive steps. At first, the input genomic
material is fragmented and digitalized using sequencers. The
fragments are then compared to a related genome to
identify genomic variations. Finally, the effect of detected
variations is estimated.</p>
      <p>SARS-CoV-2 genomes are typically sequenced from
samples collected for RT-qPCR diagnostic tests. The next
step is the library preparation process, which we have done
using the Illumina COVIDSeq Test. The Illumina
COVIDSeq Test leverages a modified version of the validated,
publicly available ARTIC multiplex PCR protocol. It
involves a two-step multiplex PCR approach in which the
whole genome is amplified using 98 amplicon primers.
Multiplexing (or barcoding) is often used to enable
simultaneous sequencing of multiple samples by ligating unique
identifier sequences, called barcodes, to the reads that
belong to a particular sample.</p>
      <p>
        The amplified DNA fragments are then subjected to
a sequencing process according to the technology being
used, which in our case is the Illumina Next
Generation Sequencing (Illumina NGS). Illumina NGS sequences
a predefined number of bases from both ends of each
DNA fragment (we have used 2 36bp and 2 74bp read
lengths) to form ‘paired-end’ reads (see the left of Fig. 1).
The Illumina NGS reads, albeit much shorter than those of
the Third Generation Sequencing technologies (for
example the Oxford Nanopore Technologies), are significantly
more accurate on the base level, which makes them
generally more suitable for calling lower frequency variants
that may indicate co-infection with multiple strains in a
patient, evolution of the virus in a patient, or laboratory
contamination [
        <xref ref-type="bibr" rid="ref23 ref27">23, 27</xref>
        ].
      </p>
      <p>
        Even though the fragments are not sequenced
completely, the length of the unsequenced middle part is
known, which proves useful in locating the read’s
position of origin. The output of the sequencing process are
the signal intensities of individual dNTPs
(deoxynucleoside triphosphates) measured through laser excitation and
imaging [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], from which the nucleotide sequences are
extracted in the base calling procedure. The base called reads
occasionally contain errors that complicate further
analyses. The measure of these errors is reflected in the
quality scores (Q-scores), which are generated for each base
and represent the probability of erroneous base calls [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Upon obtaining the base called reads, demultiplexing - the
inverse step to multiplexing can be executed so that the
reads are assigned to the sequenced samples on the basis
of their barcodes.
      </p>
      <p>The output of the sequencing process is then subjected
to a series of steps according to the type of analysis
being performed. At the top level, we distinguish between
reference-based and reference-free analysis on the basis of
whether it constructs the sequenced genome by comparing
it to a known reference genome or it does it using solely
the sequenced data.</p>
      <p>In the reference-based analysis, the sequenced reads are
directly compared to the reference genome to identify
differences in their sequences, called genomic variants. From
a higher perspective, the individual steps of the
referencebased analysis can be described as follows:</p>
      <sec id="sec-3-1">
        <title>1. Reads preprocessing</title>
      </sec>
      <sec id="sec-3-2">
        <title>2. Mapping to a reference</title>
      </sec>
      <sec id="sec-3-3">
        <title>3. Variant calling</title>
      </sec>
      <sec id="sec-3-4">
        <title>4. Consensus construction</title>
      </sec>
      <sec id="sec-3-5">
        <title>5. Generation of summary reports</title>
        <p>Step 1 comprises a set of procedures that clear the reads
of any laboratory artifacts that could degrade the results of
a downstream analysis. These procedures include, e.g. the
removal of low quality parts of the reads, elimination of
foreign DNA sources (e.g. human), removal of the
fragments duplicated by the PCR process in case of the
Illumina sequencing technology, etc.</p>
        <p>
          The genomic regions from which the reads originate
are located by performing alignments to the reference
sequence in step 2. This is conducted using one of the
reference-mapping tools such as the widely used Bowtie2
or BWA [
          <xref ref-type="bibr" rid="ref15 ref17">15, 17</xref>
          ]. The data describing the mappings
(usually in the SAM format) is subsequently sorted by the
reference position and indexed for quick access using the
SAMtools program [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Further postprocessing is
possible, e.g. discarding the reads that do not meet the mapping
quality threshold, etc.
        </p>
        <p>
          The step 3 - variant calling - is the identification of
variations that are present in the sequenced samples. These
variations of interest consist mainly of single-nucleotide
polymorphisms (SNPs) and small indels, and are
detected using tools like FreeBayes, GATK, or BCFtools
[
          <xref ref-type="bibr" rid="ref18 ref24 ref8">8, 24, 18</xref>
          ]. The goal of these tools is to distinguish
sequencing errors in reads from real variants based on the
number of reads containing the variant, their quality, read
mapping quality and other data. The identified variants are
incorporated into the consensus sequence of the sequenced
sample in the next step 4. The positions in which the
sequencing coverage (i.e. the number of mapped reads
covering the position) is less than a predefined threshold get
marked by the ‘N’ symbol, which stands for any base. This
consensus can be then used to identify the most likely
phylogenetic lineage by matching it to a database of known
lineages using, for example, the Pangolin tool [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The
steps 2-4 are depicted in Figure 1.
        </p>
        <p>
          Finally, comprehensible reports in the PDF/HTML
format describing the summary statistics and plotted graphs
are generated for each individual step of the analysis.
These reports are used to monitor the quality of a given
step or to detect a potential problem with the analysed
sample, caused by laboratory or computational
processing. This is usually done using a combination of
specialized reporting tools, such as FastQC, MultiQC or
Qualimap [
          <xref ref-type="bibr" rid="ref21 ref3 ref7">3, 7, 21</xref>
          ].
        </p>
        <p>
          The reference-free approach shares the first and last step
with the reference-based one, but the middle step consists
of a genome assembly, in which the reads are joined into
long sequences called contigs. This is done by observing
the extent of overlaps in individual read pairs. The
contigs can be further assembled into ‘scaffolds’ in which the
distances between contig pairs are known. The tools used
for the assembly include Spades, Unicycler and Megahit
[
          <xref ref-type="bibr" rid="ref1 ref16 ref32">1, 32, 16</xref>
          ], with the summary reports of the assembly
quality reported by Quast [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and the assembled graph
visualized by Bandage [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. The reference-free approach is
suitable for analyses of viruses without a sufficiently similar
reference sequence. Furthermore, it is able to detect larger
structural changes, e.g., longer insertions.
        </p>
        <p>The MALVIRUS web application demonstrates a novel
approach to reference-free variant calling, which uses a</p>
        <sec id="sec-3-5-1">
          <title>GTGCGTTTCT GTGCGTTTCT</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>ATGGCTGTGC TGGCTGTGCG</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>CGTTTCCCAC CTGTGCGTTT</title>
        </sec>
        <sec id="sec-3-5-4">
          <title>GTGCGTTTCT GGCTGAGCGT</title>
        </sec>
        <sec id="sec-3-5-5">
          <title>TGTGCGTTTC TGCGTTTCTC</title>
        </sec>
        <sec id="sec-3-5-6">
          <title>GAGCGTTTCC</title>
        </sec>
        <sec id="sec-3-5-7">
          <title>CTGTGCGTTT</title>
        </sec>
        <sec id="sec-3-5-8">
          <title>GGCTGAGCGT</title>
        </sec>
        <sec id="sec-3-5-9">
          <title>TGGCTGTGCG</title>
        </sec>
        <sec id="sec-3-5-10">
          <title>GAGCGTTTCC</title>
        </sec>
        <sec id="sec-3-5-11">
          <title>ATGGCTGTGC</title>
        </sec>
        <sec id="sec-3-5-12">
          <title>CGTTTCCCAC</title>
        </sec>
        <sec id="sec-3-5-13">
          <title>TGTGCGTTTC</title>
        </sec>
        <sec id="sec-3-5-14">
          <title>GAGCGTTTCT</title>
        </sec>
        <sec id="sec-3-5-15">
          <title>AGCGTTTCCC</title>
        </sec>
        <sec id="sec-3-5-16">
          <title>ATGGCTGTGCGTTTCTCACC</title>
        </sec>
        <sec id="sec-3-5-17">
          <title>NTGGCTGAGCGTTTCCCANN</title>
          <p>
            catalog of known variants from the viral population (built
utilizing databases such as GISAID) to genotype
newlysequenced samples in an alignment-free fashion [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
3
3.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Implementation of Our Pipeline</title>
      <p>SnakeLines Workflow Framework
Analyses of DNA sequencing datasets typically consist
of a number of steps performed by individual tools in
a particular order, which is very time demanding when
executing each tool manually by an operator. This
naturally leads to development of automated computational
pipelines, many of which also support the isolation of used
tools in individual environments or parallel execution of
independent computations. Another desirable property of
computational pipelines is the simplification of analysis
reproducibility between different data analysis centers.</p>
      <p>
        An increasingly popular choice for pipeline building in
bioinformatics is the Snakemake [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] workflow engine.
The development of a pipeline in this engine consists of
defining a set of basic building blocks called rules in a
Python-based language [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Each rule defines commands
for transforming a set of inputs into a set of outputs, where
both inputs and outputs are described using the same set
of wildcards that will be substituted upon execution. The
Snakemake engine builds the rules into computational
directed acyclic graphs (DAGs), whose vertices are the
instantiated rules and a directed edge (a; b) represents the
fact that the input of the rule b requires the output of the
rule a [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The rule instances are subsequently executed
in the order given by a topological sort of the DAG.
      </p>
      <p>
        SnakeLines [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a collection of configurable
computational pipelines that extends the Snakemake workflow
engine to allow the user (possibly a laboratory scientist
without the ability to script its own pipelines) to define
a sequence of steps of the analysis in a comprehensible
YAML configuration file. SnakeLines then loads the rules
required for each step, lets Snakemake build and execute
the computational DAG and in the end, generates the
summarizing reports and archives them together with the
output log, configuration file and other necessary metadata.
The archivation step ensures full reproducibility and is
especially useful when the number of different analyses on
various datasets grows, as it becomes much more difficult
to manually keep track of the data paths and executed
commands.
      </p>
      <p>Each of the SnakeLines rules describes a particular
atomic operation of the analysis, e.g., deduplication,
mapping, etc., following the convention that each rule should
use a single third-party tool (after which it is called, see
Fig. 2) or have a clear, easily describable purpose. As the
environmental dependencies of the tools differ and often
are in conflict, each rule operates in its own Conda
environment. Conda (https://conda.io) is a package
management tool for Python supported by Snakemake to take
care of installation of the desired package version and its
execution in an isolated environment, which greatly
simplifies its deployment.</p>
      <p>The Snakemake engine natively supports parallel
execution on any number of provided cores. Each rule can
specify the number of cores required for its execution
according to which Snakemake engine schedules the run. As
a consequence, SnakeLines can be quickly deployed on
a wide range of computational infrastructures from
regular personal computers to different high-performance
computers (HPC).</p>
      <p>To deploy our pipeline, one must download the
SnakeLines framework (preferably using Conda, alternatively by
cloning the Git repository) and the YAML configuration
file that defines the analysis. After specifying the set of
input files to be analyzed using a regular expression, the
Snakemake engine is initiated with the number of provided
cores.
3.2</p>
      <p>
        SARS-CoV-2 Pipeline: Tools and Settings
As SARS-CoV-2 genomes typically contain mostly short
mutations, we have decided to pursue a reference-based
approach. For the analysis of the Illumina COVIDSeq
samples, we devised the following pipeline. The reads
preprocessing starts with the trimming step using the
Cutadapt tool [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which is configured to first remove the
PCR primers from both sides of the read, then cut off those
bases from both ends of the reads whose quality is lower
than 20. The reads that end up with fewer than 20 bases
after the trimming are deemed too short and will be removed.
The set of primers together with the cutting thresholds is
fully configurable.
      </p>
      <p>
        Subsequently, the reads are subjected to a
decontamination process, in which the fragments of human RNA
are eliminated. This is done by aligning the reads to the
human genome using the Bowtie2 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] tool and
removing those that were successfully mapped. Afterwards, the
reads that contain the same fragments, presumably due to
the PCR amplification, are removed in the deduplication
step using the FastUniq [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] tool. The reads preprocessing
is finalized by the generation of summary reports assessing
the reads characteristics using the FastQC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] toolset.
      </p>
      <p>
        In the mapping step, we employed the BWA [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] tool to
map the reads to the SARS-CoV-2 reference genome and
SAMtools [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to sort and index the generated SAM/BAM
files. Reads that originate from the same DNA fragment
and were duplicated by PCR are removed by the Picard
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] tool. Again, the final step is the quality statistics
summarization with the use of the Qualimap 2 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] tool.
Variant calling is performed by the BCFtools [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] call
command with the default parameters, using only reads with
minimum mapping quality of 13, but without a restriction
on the base quality. The quality of the variant call set is
then evaluated by the GATK [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] toolset. BCFtools is also
used to construct the consensus sequence, in which the ‘N’
base is placed wherever the coverage is less than 3. Figure
2 displays our pipeline in the form of a DAG of Snakemake
jobs.
3.3
      </p>
      <p>
        Comparison to Selected SARS-CoV-2 Pipelines
As mentioned earlier, several computational pipelines
have been developed or adjusted to process SARS-CoV-2
samples. The ARTIC pipeline [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] specializes in Oxford
Nanopore Technologies sequencing and offers two
workflows: Medaka [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] workflow and Nanopolish [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], where
the latter is aided by also using raw sequencing signals
instead of only the base called reads. ARTIC covers all the
steps from base calling the raw sequencing signals to
variant calling and subsequent consensus building. Similarly
to our SnakeLines pipeline, ARTIC utilizes Conda for its
deployment.
      </p>
      <p>
        The RECoVERY (REconstruction of COronaVirus
gEnomes &amp; Rapid analYsis) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] pipeline operates on raw
reads, regardless of the sequencing platform being used.
Apart from the variant identification, RECoVERY is also
able to annotate the open reading frames (ORFs) of the
sequenced sample. As opposed to the command line usage
of SnakeLines or ARTIC, the RECoVERY pipeline is built
into the ARIES (Advanced Research Infrastructure for
Experimentation in genomicS) web portal of the Galaxy
platform (https://usegalaxy.org/), which offers a
userfriendly interface with the possibility to run the analyses
on the server side.
      </p>
      <p>
        Of the mentioned pipelines, the most technologically
similar to ours is the V-pipe [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], which also utilizes
Conda for its deployment and the Snakemake engine for
execution of its components. V-pipe accepts both
singleend and paired-end Illumina NGS reads. To align the
reads, V-pipe proposes a novel method called
‘ngshmmalign’, which is based on the profile hidden Markov models
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and uses LoFreq [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], ShoRAH [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] variant callers to
detect single nucleotide variants.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Pipeline Design Issues</title>
      <p>Modern DNA sequencing technologies opened up new
possibilities for systematic characterization of viruses
circulating in the population. However, the analysed data
are affected by specific biases that have to be accounted
for to rule out incorrect conclusions caused by laboratory
artefacts. Thorough bioinformatics processing and
quality control of sequenced data are therefore crucial to
obtain reliable genomic sequences. Upon examination of
the results that were produced by an early version of our
pipeline, we needed to iteratively revisit the development
to account for several identified issues. Our solutions for
these problems contributed to the current version of the
pipeline presented in section 3.
4.1</p>
      <p>
        Primer Binding Sites
The Illumina COVIDSeq Test is designed to amplify
SARS-CoV-2 virus-specific sequences with 98 amplicons,
designed to cover a nearly entire viral genome. However,
these primers have proven to be one of the main problems
of such an approach. They remain in the reads while
covering the true variability of the virus. In Figure 3 we present
an example of a short genomic region containing a
mutation typical for a given subtype of SARS-CoV-2 virus.
Reads whose beginning overlaps the position with the
mutation do not contain the mutated base because the primers
were sequenced. In contrast, we see the mutation in the
reads that have the position in middle or at the end (these
were sequenced from an overlapping amplicon with
different primer pairs). It may happen that the number of
reads with a primer is much higher than the number of
reads with the mutation, and as a result the mutation is
not called. This is a typical problem of protocols aimed at
sequencing the whole genome by amplifying sections
using a pool of primers. We solved this problem by
replacing the originally employed Trimmomatic tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by the
Cutadapt tool [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to which we provided a list of ARTIC
primers that were then removed from the reads.
4.2
      </p>
      <p>
        Genome Coverage
Amplification of the SARS-CoV-2 genome by a large
number of PCR reactions operating in two pools can lead
to the formation of primer dimers. This may in turn result
in sections of the genome with low coverage (Fig. 4). One
solution to this problem was proposed by Itokawa et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. They introduced 12 alternative primers in the
ARTIC primer set to replace primers that were predicted to
be involved in 14 primer interactions which improved the
results. Another solution, which we have used, is to mark
the non-sequenced sites by the ‘N’ symbol in the
consensus sequence, which means that this site had low coverage.
The proportion of genome bases masked as ‘N’ in reported
samples varied between 0:01%–9:69% with the median of
0:25%.
The third issue represents problematic regions for
mapping and variant calling. Most mappers and variant callers
had a huge problem to correctly determine these
problematic sites. For example, region 11288–11296 (Fig. 5, top)
was tricky as it contains quite a long deletion (9 bp) in the
B.1.1.7 clade with four T’s inside the deleted area followed
by three T’s and similar bases as in this deleted area. In
another example, the region 28881–28884 was also
problematic, as it had four mutated bases in a row in many B.1.1.7
samples. Some mappers soft-clipped the start or the end of
the reads rather than mapped these mutated bases to this
reference region (Fig. 5, bottom), so the variant callers
also had a problem detecting these mutations. In the third
example, which we see in Fig. 6, Bowtie2 mapper had a
huge issue with mapping reads around the low coverage
area in comparison to BWA. We resolved all the problems
mentioned above with the appropriate selection of
mappers and variant callers. The performance of individual
variant callers at the problematic regions is outlined in
Table 1.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper we present a computational pipeline for the
analysis of the SARS-CoV-2 sequencing data obtained by
Illumina NGS. The pipeline is highly flexible due to the
full interchangeability of its components and
parametrization of individual tools through a simple YAML
configuration file. All the required tools are installed automatically
in an isolated virtual environment, which enables rapid
setup on fresh systems and reproducibility across
different Unix-based systems.</p>
      <p>We have successfully deployed our pipeline on two
computational clusters within Comenius University in
Bratislava (CU), namely, one at the CU Science Park and
one at the Slovak Centre of Scientific and Technical
Information. To this day (from mid-February to late July of
2021), we have successfully sequenced and analyzed 3055
samples with sufficient coverage (at least 3 bp across more
than 90% of the genome) which were then released in
dedicated public repositories - European Nucleotide Archive
(ENA) and GISAID. Our pipeline continues to be
routinely used for national genomic surveillance of
SARSCoV-2 in Slovakia orchestrated by the Public Health
Authority of Slovak Republic.</p>
      <p>We also cover our experience with the design and tuning
of the pipeline development. Each of the design choices
discussed in section 4 originated in response to the
particular problems that arose during the testing. For future work
of this kind, we recommend to include a systematic
evaluation of individual tools and thoroughly validate pipelines
in isolated environments before their deployment.</p>
      <p>The pipeline we presented was originally designed
exclusively for the Illumina paired-end reads. However,
SnakeLines was recently extended to support the
singleend reads from ONT sequencers, which paved the way for
us to design an ONT version of our pipeline, which is
currently in development.</p>
    </sec>
    <sec id="sec-7">
      <title>Data Availability</title>
      <p>The SnakeLines framework is freely available for
non-commercial users at https://github.com/
jbudis/snakelines along with Anaconda repository
https://anaconda.org/bioconda/snakelines.
Instructions for installation, running, and extending
the framework, are accessible from the online
documentation https://snakelines.readthedocs.io/.
The configuration and an example of the presented
SARS-CoV-2 Variant Detection pipeline is
described in https://snakelines.readthedocs.
io/en/latest/pipelines/covidseq.html.
Real-world sequencing data for the analysis can
be downloaded from the ENA portal (https:
//www.ebi.ac.uk/ena/browser/view/PRJEB43444).</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>This publication has been published with the support of
the Operational Program Integrated Infrastructure within
projects: Pangenomics for personalized clinical
management of infected persons based on identified viral
genome and human exome (Code ITMS:313011ATL7,
80%); and “Horizontal ICT support and central
infrastructure for research and development institutions” (Code
ITMS: 313011F988, 10%), both co-financed by the
European Regional Development Fund. It was also supported
by Comenius University Grant UK/405/2021 (5%) and a
grant from VEGA 1/0458/18 (5%).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Anton</given-names>
            <surname>Bankevich</surname>
          </string-name>
          , Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin,
          <string-name>
            <given-names>Sergey I Nikolenko</given-names>
            , Son Pham,
            <surname>Andrey D Prjibelski</surname>
          </string-name>
          , et al.
          <article-title>SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing</article-title>
          .
          <source>Journal of Computational Biology</source>
          ,
          <volume>19</volume>
          (
          <issue>5</issue>
          ):
          <fpage>455</fpage>
          -
          <lpage>477</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Anthony</surname>
            <given-names>M Bolger</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Lohse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bjoern</given-names>
            <surname>Usadel</surname>
          </string-name>
          .
          <article-title>Trimmomatic: a flexible trimmer for illumina sequence data</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>30</volume>
          (
          <issue>15</issue>
          ):
          <fpage>2114</fpage>
          -
          <lpage>2120</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Brown</surname>
          </string-name>
          , Meg Pirrung, and Lee Ann McCue.
          <article-title>FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool</article-title>
          . Bioinformatics,
          <volume>33</volume>
          (
          <issue>19</issue>
          ):
          <fpage>3137</fpage>
          -
          <lpage>3139</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jaroslav</given-names>
            <surname>Budis</surname>
          </string-name>
          , Werner Krampl, Marcel Kucharik, Rastislav Hekel, Adrian Goga, Michal Lichvar, David Smolak,
          <string-name>
            <given-names>Miroslav</given-names>
            <surname>Bohmer</surname>
          </string-name>
          , Andrej Balaz, Frantisek Duris, Juraj Gazdarica, Katarina Soltys, Jan Turna, Jan Radvanszky, and Tomas Szemes.
          <article-title>SnakeLines: integrated set of computational pipelines for sequencing reads</article-title>
          .
          <source>arXiv 2106.13649</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Simone</given-names>
            <surname>Ciccolella</surname>
          </string-name>
          , Luca Denti, Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Previtali</surname>
          </string-name>
          .
          <article-title>MALVIRUS: an integrated web application for viral variant calling</article-title>
          .
          <source>bioRxiv</source>
          <year>2020</year>
          .
          <volume>05</volume>
          .05.076992,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Luca</given-names>
            <surname>De Sabato</surname>
          </string-name>
          , Gabriele Vaccari, Arnold Knijn, Giovanni Ianiro, Ilaria Di Bartolo, and Stefano Morabito. SARSCoV-2
          <string-name>
            <surname>RECoVERY:</surname>
          </string-name>
          <article-title>a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data</article-title>
          .
          <source>bioRxiv</source>
          <year>2021</year>
          .
          <volume>01</volume>
          .16.425365,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Ewels</surname>
          </string-name>
          , Måns Magnusson, Sverker Lundin, and Max Käller.
          <article-title>MultiQC: summarize analysis results for multiple tools and samples in a single report</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>32</volume>
          (
          <issue>19</issue>
          ):
          <fpage>3047</fpage>
          -
          <lpage>3048</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Garrison</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gabor</given-names>
            <surname>Marth</surname>
          </string-name>
          .
          <article-title>Haplotype-based variant detection from short-read sequencing</article-title>
          .
          <source>arXiv 1207.3907</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Gurevich</surname>
          </string-name>
          , Vladislav Saveliev, Nikolay Vyahhi, and
          <string-name>
            <given-names>Glenn</given-names>
            <surname>Tesler</surname>
          </string-name>
          .
          <article-title>QUAST: quality assessment tool for genome assemblies</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>29</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1072</fpage>
          -
          <lpage>1075</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Illumina</surname>
          </string-name>
          .
          <article-title>Understanding Illumina Quality Scores</article-title>
          .
          <source>Technical Note: Informatics</source>
          ,
          <volume>23</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Illumina</surname>
          </string-name>
          .
          <article-title>An Introduction to Next-generation Sequencing Technology</article-title>
          . https://www.illumina.com/content/ dam/illumina-marketing/documents/products/ illumina_sequencing_introduction.pdf,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Broad</given-names>
            <surname>Institute</surname>
          </string-name>
          . Picard toolkit. http:// broadinstitute.github.io/picard/,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Kentaro</surname>
            <given-names>Itokawa</given-names>
          </string-name>
          , Tsuyoshi Sekizuka, Masanori Hashino, Rina Tanaka, and
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Kuroda</surname>
          </string-name>
          .
          <article-title>Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR</article-title>
          .
          <source>PLOS One</source>
          ,
          <volume>15</volume>
          (
          <issue>9</issue>
          ):e0239403,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Köster</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sven</given-names>
            <surname>Rahmann</surname>
          </string-name>
          .
          <article-title>Snakemake - a scalable bioinformatics workflow engine</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>28</volume>
          (
          <issue>19</issue>
          ):
          <fpage>2520</fpage>
          -
          <lpage>2522</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ben</given-names>
            <surname>Langmead and Steven L Salzberg</surname>
          </string-name>
          .
          <article-title>Fast gapped-read alignment with Bowtie 2</article-title>
          .
          <string-name>
            <given-names>Nature</given-names>
            <surname>Methods</surname>
          </string-name>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):
          <fpage>357</fpage>
          -
          <lpage>359</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Dinghua</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chi-Man</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Ruibang Luo, Kunihiko Sadakane, and
          <string-name>
            <surname>Tak-Wah Lam</surname>
          </string-name>
          .
          <article-title>MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>31</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1674</fpage>
          -
          <lpage>1676</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Heng</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM</article-title>
          .
          <source>arXiv 1303.3997</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Heng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bob</given-names>
            <surname>Handsaker</surname>
          </string-name>
          , Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin.
          <article-title>The sequence alignment/map format and SAMtools</article-title>
          . Bioinformatics,
          <volume>25</volume>
          (
          <issue>16</issue>
          ):
          <fpage>2078</fpage>
          -
          <lpage>2079</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Marcel</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Cutadapt removes adapter sequences from high-throughput sequencing reads</article-title>
          .
          <source>EMBnet. journal</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>ARTIC</given-names>
            <surname>Network</surname>
          </string-name>
          .
          <article-title>The ARTIC field bioinformatics pipeline</article-title>
          . https://github.com/artic-network/ fieldbioinformatics.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Konstantin</surname>
            <given-names>Okonechnikov</given-names>
          </string-name>
          , Ana Conesa, and
          <string-name>
            <surname>Fernando</surname>
          </string-name>
          García-Alcalde.
          <article-title>Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <fpage>292</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Áine O'Toole</surname>
            , Emily Scher, Verity Hill, and
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Rambaut</surname>
          </string-name>
          .
          <article-title>Pangolin - Phylogenetic Assignment of Named Global Outbreak LINeages</article-title>
          . https://github.com/ cov-lineages/pangolin,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Nicole</surname>
            <given-names>Pedro</given-names>
          </string-name>
          , Cláudio N Silva, Ana C Magalhães, Bruno Cavadas, Ana M Rocha,
          <string-name>
            <surname>Ana C Moreira</surname>
          </string-name>
          , Maria Salomé Gomes, Diogo Silva, Joana Sobrinho-Simões,
          <string-name>
            <given-names>Angélica</given-names>
            <surname>Ramos</surname>
          </string-name>
          , et al.
          <article-title>Dynamics of a Dual SARS-CoV-2 Lineage Co-infection on a Prolonged Viral Shedding COVID-</article-title>
          19
          <source>Case: Insights into Clinical Severity and Disease Duration. Microorganisms</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ):
          <fpage>300</fpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Ryan</surname>
            <given-names>Poplin</given-names>
          </string-name>
          , Valentin Ruano-Rubio,
          <article-title>Mark A DePristo, Tim</article-title>
          J Fennell, Mauricio O Carneiro,
          <string-name>
            <surname>Geraldine A Van der Auwera</surname>
          </string-name>
          , David E Kling, Laura D Gauthier, Ami LevyMoonshine,
          <string-name>
            <surname>David Roazen</surname>
          </string-name>
          , et al.
          <article-title>Scaling accurate genetic variant discovery to tens of thousands of samples</article-title>
          .
          <source>bioRxiv 201178</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Susana</given-names>
            <surname>Posada-Céspedes</surname>
          </string-name>
          , David Seifert, Ivan Topolsky, Kim Philipp Jablonski,
          <string-name>
            <surname>Karin J Metzner</surname>
            , and
            <given-names>Niko</given-names>
          </string-name>
          <string-name>
            <surname>Beerenwinkel</surname>
          </string-name>
          .
          <article-title>V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>James</surname>
            <given-names>T Robinson</given-names>
          </string-name>
          , Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P Mesirov.
          <article-title>Integrative genomics viewer</article-title>
          .
          <source>Nature Biotechnology</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>24</fpage>
          -
          <lpage>26</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>AE</given-names>
            <surname>Samoilov</surname>
          </string-name>
          ,
          <article-title>VV Kaptelova, AY Bukharina, OY Shipulina, EV Korneenko, AV Lukyanov, AA Grishaeva, AA Ploskireva, Anna S Speranskaya,</article-title>
          and
          <string-name>
            <given-names>VG</given-names>
            <surname>Akimkin</surname>
          </string-name>
          .
          <article-title>Change of dominant strain during dual SARS-CoV-2 infection</article-title>
          . medRxiv
          <year>2020</year>
          .
          <volume>11</volume>
          .29.20238402,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Yuelong</given-names>
            <surname>Shu and John McCauley</surname>
          </string-name>
          . GISAID:
          <article-title>Global initiative on sharing all influenza data-from vision to reality</article-title>
          . Eurosurveillance,
          <volume>22</volume>
          (
          <issue>13</issue>
          ):
          <fpage>30494</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Jared</given-names>
            <surname>Simpson</surname>
          </string-name>
          . Nanopolish:
          <article-title>Signal-level algorithms for minION data</article-title>
          . https://github.com/jts/nanopolish,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[30] Oxford Nanopore Technologies. Medaka</source>
          . https:// github.com/nanoporetech/medaka,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Changtai</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zhongping Liu, Zixiang Chen, Xin Huang, Mengyuan Xu,
          <string-name>
            <given-names>Tengfei</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>and Zhenhua Zhang.</surname>
          </string-name>
          <article-title>The establishment of reference sequence for SARS-CoV-2 and variation analysis</article-title>
          .
          <source>Journal of Medical Virology</source>
          ,
          <volume>92</volume>
          (
          <issue>6</issue>
          ):
          <fpage>667</fpage>
          -
          <lpage>674</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Ryan</surname>
            <given-names>R Wick</given-names>
          </string-name>
          , Louise M Judd, Claire L Gorrie, and
          <string-name>
            <given-names>Kathryn E</given-names>
            <surname>Holt.</surname>
          </string-name>
          <article-title>Unicycler: resolving bacterial genome assemblies from short and long sequencing reads</article-title>
          .
          <source>PLOS Computational Biology</source>
          ,
          <volume>13</volume>
          (
          <issue>6</issue>
          ):e1005595,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Ryan</surname>
            <given-names>R Wick</given-names>
          </string-name>
          , Mark B Schultz,
          <string-name>
            <surname>Justin Zobel</surname>
            , and
            <given-names>Kathryn E</given-names>
          </string-name>
          <string-name>
            <surname>Holt.</surname>
          </string-name>
          <article-title>Bandage: interactive visualization of de novo genome assemblies</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>31</volume>
          (
          <issue>20</issue>
          ):
          <fpage>3350</fpage>
          -
          <lpage>3352</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Andreas</surname>
            <given-names>Wilm</given-names>
          </string-name>
          , Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo, Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd, and Niranjan Nagarajan.
          <article-title>LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cellpopulation heterogeneity from high-throughput sequencing datasets</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>40</volume>
          (
          <issue>22</issue>
          ):
          <fpage>11189</fpage>
          -
          <lpage>11201</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Haibin</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, and Shilin Chen.
          <article-title>FastUniq: a fast de novo duplicates removal tool for paired short reads</article-title>
          .
          <source>PLOS One</source>
          ,
          <volume>7</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e52249</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Osvaldo</surname>
            <given-names>Zagordi</given-names>
          </string-name>
          , Arnab Bhattacharya, Nicholas Eriksson, and Niko Beerenwinkel.
          <article-title>ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>