=Paper= {{Paper |id=Vol-2962/paper15 |storemode=property |title=SnakeLines Workflow for SARS-CoV-2 Variant Detection from Next-Generation Sequencing Reads |pdfUrl=https://ceur-ws.org/Vol-2962/paper15.pdf |volume=Vol-2962 |authors=Adrián Goga,Miroslav Böhmer,Rastislav Hekel,Werner Krampl,Bronislava Brejová,Tomáš Vinař,Jaroslav Budiš,Tomáš Szemes |dblpUrl=https://dblp.org/rec/conf/itat/GogaBHKBVBS21 }} ==SnakeLines Workflow for SARS-CoV-2 Variant Detection from Next-Generation Sequencing Reads== https://ceur-ws.org/Vol-2962/paper15.pdf
              SnakeLines Workflow for SARS-CoV-2 Variant Detection from
                          Next-Generation Sequencing Reads

    Adrián Goga1,2 , Miroslav Böhmer2,3,4,5 , Rastislav Hekel3 , Werner Krampl2,3,5 , Bronislava Brejová1 , Tomáš Vinař1 ,
                                       Jaroslav Budiš2,3,6 , and Tomáš Szemes2,3,5
                    1   Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Slovakia,
                                       2 Comenius University Science Park in Bratislava, Slovakia,
                                                  3 Geneton Ltd., Bratislava, Slovakia,
                                    4 Slovak Centre of Scientific and Technical Information, Slovakia,
                                5 Faculty of Natural Sciences, Comenius University in Bratislava, Slovakia
                                    6 Slovak Centre of Scientific and Technical Information, Slovakia


Abstract: The ongoing SARS-CoV-2 pandemic, which                          late July of 2021), we have successfully sequenced, anal-
emerged in December 2019, revolutionized genomic                          ysed and released 3055 viral samples with sufficient cov-
surveillance, leading to new means of tracking viral spread               erage (at least 3bp across more than 90% of the genome).
and monitoring genetic changes in their genomes over                         The rest of this paper is organized as follows. In sec-
time. One of the key sequencing methods used during                       tion 2, we give a brief overview of the DNA sequenc-
the pandemic is based on massively parallel short read                    ing process and data analysis with focus on variant detec-
sequencing based on Illumina technology. In this work,                    tion of the SARS-CoV-2. In section 3, we describe the
we present a highly scalable and easily deployable com-                   SnakeLines-based variant detection pipeline we designed
putational pipeline for the analysis of Illumina sequenc-                 for the SARS-CoV-2. Section 4 is focused on discussing
ing data, which is used in Slovak SARS-CoV-2 genomic                      several selected problems that occurred during the pipeline
surveillance efforts. We discuss several issues that arose                design. Finally, in section 5 we conclude our findings, pro-
during the pipeline design, and which could both pro-                     pose several recommendations based on our experience,
vide useful insight into the analysis processes and serve                 and outline directions for future work.
as a guideline for optimized future outbreak surveillance
projects.
                                                                          2   Background - Sequencing and Data
1    Introduction                                                             Analysis

The Severe Acute Respiratory Syndrome Coronavirus type                    The reference genome of the SARS-CoV-2 has 29 903
2 (SARS-CoV-2) pandemic first emerging in Wuhan in                        base pairs (bp) and was sequenced in the early days of
December 2019, also erupting in Slovakia in March of                      the epidemic [31]. The goal of sequencing of additional
2020, led to a revolution in genomic surveillance. Mul-                   samples is to identify differences from this reference se-
tiple DNA sequencing library preparation protocols and                    quence, called genomic variants. Continuous monitoring
data analysis pipelines have been utilized to keep track                  of viral genomes is crucial for the early detection of new
of the rapidly evolving virus, for example ARTIC [20],                    mutants that may lead to more severe virulence character-
SARS-CoV-2 RECoVERY [6], V-pipe [25], etc. Their use                      istics, such as the rate of infection.
resulted in gathering millions of genomic sequences in the                   A typical analysis of genomic data can be divided into
GISAID database [28].                                                     three consecutive steps. At first, the input genomic mate-
   We present a computational pipeline, which was cru-                    rial is fragmented and digitalized using sequencers. The
cial for the analysis of Illumina sequencing data of SARS-                fragments are then compared to a related genome to iden-
CoV-2 genomes in Slovakia. The pipeline was developed                     tify genomic variations. Finally, the effect of detected vari-
using the SnakeLines framework [4], which is built upon                   ations is estimated.
the widely used Snakemake workflow engine [14]. The                          SARS-CoV-2 genomes are typically sequenced from
flexibility of this framework allowed us to deploy our so-                samples collected for RT-qPCR diagnostic tests. The next
lution within national computational infrastructures sup-                 step is the library preparation process, which we have done
porting sequencing laboratories of Comenius University                    using the Illumina COVIDSeq Test. The Illumina COVID-
Science Park in Bratislava and the Public Health Authority                Seq Test leverages a modified version of the validated,
of the Slovak Republic. Up to now (from mid-February to                   publicly available ARTIC multiplex PCR protocol. It in-
                                                                          volves a two-step multiplex PCR approach in which the
      Copyright c 2021 for this paper by its authors. Use permitted un-
                                                                          whole genome is amplified using 98 amplicon primers.
der Creative Commons License Attribution 4.0 International (CC BY         Multiplexing (or barcoding) is often used to enable simul-
4.0).                                                                     taneous sequencing of multiple samples by ligating unique
identifier sequences, called barcodes, to the reads that be-    removal of low quality parts of the reads, elimination of
long to a particular sample.                                    foreign DNA sources (e.g. human), removal of the frag-
   The amplified DNA fragments are then subjected to            ments duplicated by the PCR process in case of the Illu-
a sequencing process according to the technology being          mina sequencing technology, etc.
used, which in our case is the Illumina Next Genera-               The genomic regions from which the reads originate
tion Sequencing (Illumina NGS). Illumina NGS sequences          are located by performing alignments to the reference se-
a predefined number of bases from both ends of each             quence in step 2. This is conducted using one of the
DNA fragment (we have used 2 × 36bp and 2 × 74bp read           reference-mapping tools such as the widely used Bowtie2
lengths) to form ‘paired-end’ reads (see the left of Fig. 1).   or BWA [15, 17]. The data describing the mappings (usu-
The Illumina NGS reads, albeit much shorter than those of       ally in the SAM format) is subsequently sorted by the ref-
the Third Generation Sequencing technologies (for exam-         erence position and indexed for quick access using the
ple the Oxford Nanopore Technologies), are significantly        SAMtools program [18]. Further postprocessing is possi-
more accurate on the base level, which makes them gen-          ble, e.g. discarding the reads that do not meet the mapping
erally more suitable for calling lower frequency variants       quality threshold, etc.
that may indicate co-infection with multiple strains in a          The step 3 - variant calling - is the identification of vari-
patient, evolution of the virus in a patient, or laboratory     ations that are present in the sequenced samples. These
contamination [23, 27].                                         variations of interest consist mainly of single-nucleotide
   Even though the fragments are not sequenced com-             polymorphisms (SNPs) and small indels, and are de-
pletely, the length of the unsequenced middle part is           tected using tools like FreeBayes, GATK, or BCFtools
known, which proves useful in locating the read’s posi-         [8, 24, 18]. The goal of these tools is to distinguish se-
tion of origin. The output of the sequencing process are        quencing errors in reads from real variants based on the
the signal intensities of individual dNTPs (deoxynucleo-        number of reads containing the variant, their quality, read
side triphosphates) measured through laser excitation and       mapping quality and other data. The identified variants are
imaging [11], from which the nucleotide sequences are ex-       incorporated into the consensus sequence of the sequenced
tracted in the base calling procedure. The base called reads    sample in the next step 4. The positions in which the se-
occasionally contain errors that complicate further analy-      quencing coverage (i.e. the number of mapped reads cov-
ses. The measure of these errors is reflected in the qual-      ering the position) is less than a predefined threshold get
ity scores (Q-scores), which are generated for each base        marked by the ‘N’ symbol, which stands for any base. This
and represent the probability of erroneous base calls [10].     consensus can be then used to identify the most likely phy-
Upon obtaining the base called reads, demultiplexing - the      logenetic lineage by matching it to a database of known
inverse step to multiplexing can be executed so that the        lineages using, for example, the Pangolin tool [22]. The
reads are assigned to the sequenced samples on the basis        steps 2-4 are depicted in Figure 1.
of their barcodes.                                                 Finally, comprehensible reports in the PDF/HTML for-
   The output of the sequencing process is then subjected       mat describing the summary statistics and plotted graphs
to a series of steps according to the type of analysis be-      are generated for each individual step of the analysis.
ing performed. At the top level, we distinguish between         These reports are used to monitor the quality of a given
reference-based and reference-free analysis on the basis of     step or to detect a potential problem with the analysed
whether it constructs the sequenced genome by comparing         sample, caused by laboratory or computational process-
it to a known reference genome or it does it using solely       ing. This is usually done using a combination of special-
the sequenced data.                                             ized reporting tools, such as FastQC, MultiQC or Qual-
   In the reference-based analysis, the sequenced reads are     imap [3, 7, 21].
directly compared to the reference genome to identify dif-         The reference-free approach shares the first and last step
ferences in their sequences, called genomic variants. From      with the reference-based one, but the middle step consists
a higher perspective, the individual steps of the reference-    of a genome assembly, in which the reads are joined into
based analysis can be described as follows:                     long sequences called contigs. This is done by observing
                                                                the extent of overlaps in individual read pairs. The con-
 1. Reads preprocessing                                         tigs can be further assembled into ‘scaffolds’ in which the
 2. Mapping to a reference                                      distances between contig pairs are known. The tools used
                                                                for the assembly include Spades, Unicycler and Megahit
 3. Variant calling                                             [1, 32, 16], with the summary reports of the assembly qual-
                                                                ity reported by Quast [9] and the assembled graph visual-
 4. Consensus construction                                      ized by Bandage [33]. The reference-free approach is suit-
 5. Generation of summary reports                               able for analyses of viruses without a sufficiently similar
                                                                reference sequence. Furthermore, it is able to detect larger
   Step 1 comprises a set of procedures that clear the reads    structural changes, e.g., longer insertions.
of any laboratory artifacts that could degrade the results of      The MALVIRUS web application demonstrates a novel
a downstream analysis. These procedures include, e.g. the       approach to reference-free variant calling, which uses a
                                                     GAGCGTTTCC
GTGCGTTTCT GTGCGTTTCT                              CTGTGCGTTT
                                                 GGCTGAGCGT
ATGGCTGTGC TGGCTGTGCG                           TGGCTGTGCG
CGTTTCCCAC CTGTGCGTTT                                GAGCGTTTCC
                                               ATGGCTGTGC
GTGCGTTTCT GGCTGAGCGT                                   CGTTTCCCAC
TGTGCGTTTC TGCGTTTCTC                               TGTGCGTTTC
                                                     GAGCGTTTCT
                                                      AGCGTTTCCC
                                               ATGGCTGTGCGTTTCTCACC                      NTGGCTGAGCGTTTCCCANN
Figure 1: A schematic visualization of the Illumina NGS data analysis. Left: a set of fragments, each having 3 bp
sequenced from its ends. Middle: the reads mapped to the reference sequence with the identified variants highlighted
(red). Right: the consensus sequence of this example where the positions with coverage lower than 2 were marked by ‘N’.


catalog of known variants from the viral population (built      The archivation step ensures full reproducibility and is es-
utilizing databases such as GISAID) to genotype newly-          pecially useful when the number of different analyses on
sequenced samples in an alignment-free fashion [5].             various datasets grows, as it becomes much more difficult
                                                                to manually keep track of the data paths and executed com-
                                                                mands.
3     Implementation of Our Pipeline                               Each of the SnakeLines rules describes a particular
                                                                atomic operation of the analysis, e.g., deduplication, map-
3.1   SnakeLines Workflow Framework
                                                                ping, etc., following the convention that each rule should
Analyses of DNA sequencing datasets typically consist           use a single third-party tool (after which it is called, see
of a number of steps performed by individual tools in           Fig. 2) or have a clear, easily describable purpose. As the
a particular order, which is very time demanding when           environmental dependencies of the tools differ and often
executing each tool manually by an operator. This nat-          are in conflict, each rule operates in its own Conda envi-
urally leads to development of automated computational          ronment. Conda (https://conda.io) is a package man-
pipelines, many of which also support the isolation of used     agement tool for Python supported by Snakemake to take
tools in individual environments or parallel execution of       care of installation of the desired package version and its
independent computations. Another desirable property of         execution in an isolated environment, which greatly sim-
computational pipelines is the simplification of analysis       plifies its deployment.
reproducibility between different data analysis centers.           The Snakemake engine natively supports parallel exe-
   An increasingly popular choice for pipeline building in      cution on any number of provided cores. Each rule can
bioinformatics is the Snakemake [14] workflow engine.           specify the number of cores required for its execution ac-
The development of a pipeline in this engine consists of        cording to which Snakemake engine schedules the run. As
defining a set of basic building blocks called rules in a       a consequence, SnakeLines can be quickly deployed on
Python-based language [14]. Each rule defines commands          a wide range of computational infrastructures from regu-
for transforming a set of inputs into a set of outputs, where   lar personal computers to different high-performance com-
both inputs and outputs are described using the same set        puters (HPC).
of wildcards that will be substituted upon execution. The          To deploy our pipeline, one must download the Snake-
Snakemake engine builds the rules into computational di-        Lines framework (preferably using Conda, alternatively by
rected acyclic graphs (DAGs), whose vertices are the in-        cloning the Git repository) and the YAML configuration
stantiated rules and a directed edge (a, b) represents the      file that defines the analysis. After specifying the set of
fact that the input of the rule b requires the output of the    input files to be analyzed using a regular expression, the
rule a [14]. The rule instances are subsequently executed       Snakemake engine is initiated with the number of provided
in the order given by a topological sort of the DAG.            cores.
   SnakeLines [4] is a collection of configurable compu-
tational pipelines that extends the Snakemake workflow          3.2   SARS-CoV-2 Pipeline: Tools and Settings
engine to allow the user (possibly a laboratory scientist
without the ability to script its own pipelines) to define      As SARS-CoV-2 genomes typically contain mostly short
a sequence of steps of the analysis in a comprehensible         mutations, we have decided to pursue a reference-based
YAML configuration file. SnakeLines then loads the rules        approach. For the analysis of the Illumina COVIDSeq
required for each step, lets Snakemake build and execute        samples, we devised the following pipeline. The reads
the computational DAG and in the end, generates the sum-        preprocessing starts with the trimming step using the Cu-
marizing reports and archives them together with the out-       tadapt tool [19], which is configured to first remove the
put log, configuration file and other necessary metadata.       PCR primers from both sides of the read, then cut off those
  cutadapt             bowtie2
 trim_reads         prepare_index

                         bowtie2
              filter_reads_from_reference

                                          custom                   BWA
                                      save_read_group          prepare_index

                                                                 BWA
                                                         map_reads_to_reference

                                                                  samtools
                                                             sort_mapped_reads

                                                                  samtools                                           samtools
                                                                 bam_index                                       prepare_fai_index

                                                                            picard                           annotation                       picard
                                                                        mark_duplicates                  create_wgs_bed_file            prepare_dict_index

                       fastqc                                  qualimap                                     BCFtools                         picard
                    quality_report              mapping_quality_report_across_reference              call_germline_variants            bed_to_interval_list

                                               custom                                                        gatk
                                            summary_report                                                prepare_vcf

                                                                                  bedtools                         bgzip                          gatk
                                                                        generate_coverage_bedgraph              compress_vcf         collect_variant_calling_metrics

                                                                                                                    tabix            bedtools
                                                                                                                  index_vcf     create_genome_file

                   fastqc                                                                                         BCFtools
         html_summary_for_paired_reads                                                                     create_consensus_fasta

                                                                                                                          cat
                                                                                                              concatenate_consensus_fasta

                                                             pipeline


Figure 2: The computational DAG of the SARS-CoV-2 variant calling pipeline implemented in the SnakeLines engine.
Each vertex corresponds to a Snakemake rule, while a directed edge represents input/output dependence. The ‘pipeline’
rule serves to unite the files that should be generated.


bases from both ends of the reads whose quality is lower                                  3.3   Comparison to Selected SARS-CoV-2 Pipelines
than 20. The reads that end up with fewer than 20 bases af-
ter the trimming are deemed too short and will be removed.                                As mentioned earlier, several computational pipelines
The set of primers together with the cutting thresholds is                                have been developed or adjusted to process SARS-CoV-2
fully configurable.                                                                       samples. The ARTIC pipeline [20] specializes in Oxford
                                                                                          Nanopore Technologies sequencing and offers two work-
   Subsequently, the reads are subjected to a decontami-                                  flows: Medaka [30] workflow and Nanopolish [29], where
nation process, in which the fragments of human RNA                                       the latter is aided by also using raw sequencing signals in-
are eliminated. This is done by aligning the reads to the                                 stead of only the base called reads. ARTIC covers all the
human genome using the Bowtie2 [15] tool and remov-                                       steps from base calling the raw sequencing signals to vari-
ing those that were successfully mapped. Afterwards, the                                  ant calling and subsequent consensus building. Similarly
reads that contain the same fragments, presumably due to                                  to our SnakeLines pipeline, ARTIC utilizes Conda for its
the PCR amplification, are removed in the deduplication                                   deployment.
step using the FastUniq [35] tool. The reads preprocessing                                   The RECoVERY (REconstruction of COronaVirus
is finalized by the generation of summary reports assessing                               gEnomes & Rapid analYsis) [6] pipeline operates on raw
the reads characteristics using the FastQC [3] toolset.                                   reads, regardless of the sequencing platform being used.
   In the mapping step, we employed the BWA [17] tool to                                  Apart from the variant identification, RECoVERY is also
map the reads to the SARS-CoV-2 reference genome and                                      able to annotate the open reading frames (ORFs) of the se-
SAMtools [18] to sort and index the generated SAM/BAM                                     quenced sample. As opposed to the command line usage
files. Reads that originate from the same DNA fragment                                    of SnakeLines or ARTIC, the RECoVERY pipeline is built
and were duplicated by PCR are removed by the Picard                                      into the ARIES (Advanced Research Infrastructure for Ex-
[12] tool. Again, the final step is the quality statistics sum-                           perimentation in genomicS) web portal of the Galaxy plat-
marization with the use of the Qualimap 2 [21] tool. Vari-                                form (https://usegalaxy.org/), which offers a user-
ant calling is performed by the BCFtools [18] call com-                                   friendly interface with the possibility to run the analyses
mand with the default parameters, using only reads with                                   on the server side.
minimum mapping quality of 13, but without a restriction                                     Of the mentioned pipelines, the most technologically
on the base quality. The quality of the variant call set is                               similar to ours is the V-pipe [25], which also utilizes
then evaluated by the GATK [24] toolset. BCFtools is also                                 Conda for its deployment and the Snakemake engine for
used to construct the consensus sequence, in which the ‘N’                                execution of its components. V-pipe accepts both single-
base is placed wherever the coverage is less than 3. Figure                               end and paired-end Illumina NGS reads. To align the
2 displays our pipeline in the form of a DAG of Snakemake                                 reads, V-pipe proposes a novel method called ‘ngshmma-
jobs.                                                                                     lign’, which is based on the profile hidden Markov models
[25] and uses LoFreq [34], ShoRAH [36] variant callers to
detect single nucleotide variants.


4     Pipeline Design Issues

Modern DNA sequencing technologies opened up new
possibilities for systematic characterization of viruses cir-
culating in the population. However, the analysed data
are affected by specific biases that have to be accounted
for to rule out incorrect conclusions caused by laboratory
artefacts. Thorough bioinformatics processing and qual-
ity control of sequenced data are therefore crucial to ob-
tain reliable genomic sequences. Upon examination of
the results that were produced by an early version of our
pipeline, we needed to iteratively revisit the development
to account for several identified issues. Our solutions for
these problems contributed to the current version of the
pipeline presented in section 3.


4.1   Primer Binding Sites

The Illumina COVIDSeq Test is designed to amplify
SARS-CoV-2 virus-specific sequences with 98 amplicons,
designed to cover a nearly entire viral genome. However,
these primers have proven to be one of the main problems
of such an approach. They remain in the reads while cover-
ing the true variability of the virus. In Figure 3 we present
an example of a short genomic region containing a mu-
tation typical for a given subtype of SARS-CoV-2 virus.
Reads whose beginning overlaps the position with the mu-
tation do not contain the mutated base because the primers
were sequenced. In contrast, we see the mutation in the
reads that have the position in middle or at the end (these
were sequenced from an overlapping amplicon with dif-
ferent primer pairs). It may happen that the number of
                                                                Figure 3: Example of a mutation (A28095T) that was un-
reads with a primer is much higher than the number of
                                                                detected while using Trimmomatic [2] (top) and the same
reads with the mutation, and as a result the mutation is
                                                                mutation after employing Cutadapt [19] (bottom). Nearby
not called. This is a typical problem of protocols aimed at
                                                                mutation does not overlap a primer and was detected in
sequencing the whole genome by amplifying sections us-
                                                                both cases. The visualization was performed with IGV
ing a pool of primers. We solved this problem by replac-
                                                                software [26].
ing the originally employed Trimmomatic tool [2] by the
Cutadapt tool [19] to which we provided a list of ARTIC
primers that were then removed from the reads.                  the non-sequenced sites by the ‘N’ symbol in the consen-
                                                                sus sequence, which means that this site had low coverage.
4.2   Genome Coverage                                           The proportion of genome bases masked as ‘N’ in reported
                                                                samples varied between 0.01%–9.69% with the median of
Amplification of the SARS-CoV-2 genome by a large               0.25%.
number of PCR reactions operating in two pools can lead
to the formation of primer dimers. This may in turn result      4.3   Problematic Regions for Mapping and Variant
in sections of the genome with low coverage (Fig. 4). One             Calling
solution to this problem was proposed by Itokawa et al.
[13]. They introduced 12 alternative primers in the AR-         The third issue represents problematic regions for map-
TIC primer set to replace primers that were predicted to        ping and variant calling. Most mappers and variant callers
be involved in 14 primer interactions which improved the        had a huge problem to correctly determine these problem-
results. Another solution, which we have used, is to mark       atic sites. For example, region 11288–11296 (Fig. 5, top)
Figure 4: Example of a genome coverage across a ref-
erence of SARS-CoV-2 sample (top) and an example of              Figure 5: Visualization of the problematic regions
genome coverage histogram of SARS-CoV-2 sample (bot-             11288–11296 (top) and 28881–28884 (bottom) performed
tom). The plots were generated by Qualimap 2 [21].               with IGV software [26].

was tricky as it contains quite a long deletion (9 bp) in the                    Pos. 11288-11296      Pos. 28881-28884
B.1.1.7 clade with four T’s inside the deleted area followed       Freebayes             5                     3
by three T’s and similar bases as in this deleted area. In an-     GATK                  3                     5
other example, the region 28881–28884 was also problem-            BCFtools              3                     3
atic, as it had four mutated bases in a row in many B.1.1.7
                                                                 Table 1: Comparison of called variants in the problem-
samples. Some mappers soft-clipped the start or the end of
                                                                 atic regions with the use of different variant callers. Posi-
the reads rather than mapped these mutated bases to this
                                                                 tions marked with 3 were called correctly, while positions
reference region (Fig. 5, bottom), so the variant callers
                                                                 marked with 5 were called incorrectly.
also had a problem detecting these mutations. In the third
example, which we see in Fig. 6, Bowtie2 mapper had a
huge issue with mapping reads around the low coverage
                                                                 tion file. All the required tools are installed automatically
area in comparison to BWA. We resolved all the problems
                                                                 in an isolated virtual environment, which enables rapid
mentioned above with the appropriate selection of map-
                                                                 setup on fresh systems and reproducibility across differ-
pers and variant callers. The performance of individual
                                                                 ent Unix-based systems.
variant callers at the problematic regions is outlined in Ta-
                                                                    We have successfully deployed our pipeline on two
ble 1.
                                                                 computational clusters within Comenius University in
                                                                 Bratislava (CU), namely, one at the CU Science Park and
5   Conclusions and Future Work                                  one at the Slovak Centre of Scientific and Technical In-
                                                                 formation. To this day (from mid-February to late July of
In this paper we present a computational pipeline for the        2021), we have successfully sequenced and analyzed 3055
analysis of the SARS-CoV-2 sequencing data obtained by           samples with sufficient coverage (at least 3 bp across more
Illumina NGS. The pipeline is highly flexible due to the         than 90% of the genome) which were then released in ded-
full interchangeability of its components and parametriza-       icated public repositories - European Nucleotide Archive
tion of individual tools through a simple YAML configura-        (ENA) and GISAID. Our pipeline continues to be rou-
                                                                The configuration and an example of the presented
                                                                SARS-CoV-2 Variant Detection pipeline is de-
                                                                scribed   in    https://snakelines.readthedocs.
                                                                io/en/latest/pipelines/covidseq.html.
                                                                Real-world sequencing data for the analysis can
                                                                be downloaded from the ENA portal (https:
                                                                //www.ebi.ac.uk/ena/browser/view/PRJEB43444).


                                                                Acknowledgment
                                                                This publication has been published with the support of
                                                                the Operational Program Integrated Infrastructure within
                                                                projects: Pangenomics for personalized clinical man-
                                                                agement of infected persons based on identified viral
                                                                genome and human exome (Code ITMS:313011ATL7,
                                                                80%); and “Horizontal ICT support and central infras-
                                                                tructure for research and development institutions” (Code
                                                                ITMS: 313011F988, 10%), both co-financed by the Euro-
                                                                pean Regional Development Fund. It was also supported
                                                                by Comenius University Grant UK/405/2021 (5%) and a
                                                                grant from VEGA 1/0458/18 (5%).


                                                                References
Figure 6: Visualization of the problematic region with           [1] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A
low sequencing coverage (21546–21552) mapped with                    Gurevich, Mikhail Dvorkin, Alexander S Kulikov,
Bowtie2 (top) and BWA (bottom) performed with IGV                    Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D
software [26].                                                       Prjibelski, et al. SPAdes: a new genome assembly algo-
                                                                     rithm and its applications to single-cell sequencing. Jour-
                                                                     nal of Computational Biology, 19(5):455–477, 2012.
tinely used for national genomic surveillance of SARS-           [2] Anthony M Bolger, Marc Lohse, and Bjoern Usadel. Trim-
CoV-2 in Slovakia orchestrated by the Public Health Au-              momatic: a flexible trimmer for illumina sequence data.
thority of Slovak Republic.                                          Bioinformatics, 30(15):2114–2120, 2014.
   We also cover our experience with the design and tuning       [3] Joseph Brown, Meg Pirrung, and Lee Ann McCue. FQC
of the pipeline development. Each of the design choices              Dashboard: integrates FastQC results into a web-based, in-
discussed in section 4 originated in response to the particu-        teractive, and extensible FASTQ quality control tool. Bioin-
lar problems that arose during the testing. For future work          formatics, 33(19):3137–3139, 2017.
of this kind, we recommend to include a systematic evalu-        [4] Jaroslav Budis, Werner Krampl, Marcel Kucharik,
ation of individual tools and thoroughly validate pipelines          Rastislav Hekel, Adrian Goga, Michal Lichvar, David
                                                                     Smolak, Miroslav Bohmer, Andrej Balaz, Frantisek Duris,
in isolated environments before their deployment.
                                                                     Juraj Gazdarica, Katarina Soltys, Jan Turna, Jan Radvan-
   The pipeline we presented was originally designed ex-
                                                                     szky, and Tomas Szemes. SnakeLines: integrated set
clusively for the Illumina paired-end reads. However,                of computational pipelines for sequencing reads. arXiv
SnakeLines was recently extended to support the single-              2106.13649, 2021.
end reads from ONT sequencers, which paved the way for           [5] Simone Ciccolella, Luca Denti, Paola Bonizzoni, Gi-
us to design an ONT version of our pipeline, which is cur-           anluca Della Vedova, Yuri Pirola, and Marco Previtali.
rently in development.                                               MALVIRUS: an integrated web application for viral vari-
                                                                     ant calling. bioRxiv 2020.05.05.076992, 2020.
                                                                 [6] Luca De Sabato, Gabriele Vaccari, Arnold Knijn, Giovanni
Data Availability                                                    Ianiro, Ilaria Di Bartolo, and Stefano Morabito. SARS-
                                                                     CoV-2 RECoVERY: a multi-platform open-source bioin-
The SnakeLines framework is freely available for                     formatic pipeline for the automatic construction and analy-
non-commercial users at https://github.com/                          sis of SARS-CoV-2 genomes from NGS sequencing data.
jbudis/snakelines along with Anaconda repository                     bioRxiv 2021.01.16.425365, 2021.
https://anaconda.org/bioconda/snakelines.                        [7] Philip Ewels, Måns Magnusson, Sverker Lundin, and Max
Instructions for installation, running, and extending                Käller. MultiQC: summarize analysis results for multi-
the framework, are accessible from the online docu-                  ple tools and samples in a single report. Bioinformatics,
mentation https://snakelines.readthedocs.io/.                        32(19):3047–3048, 2016.
 [8] Erik Garrison and Gabor Marth. Haplotype-based variant             Moonshine, David Roazen, et al. Scaling accurate genetic
     detection from short-read sequencing. arXiv 1207.3907,             variant discovery to tens of thousands of samples. bioRxiv
     2012.                                                              201178, 2018.
 [9] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and      [25] Susana Posada-Céspedes, David Seifert, Ivan Topolsky,
     Glenn Tesler. QUAST: quality assessment tool for genome            Kim Philipp Jablonski, Karin J Metzner, and Niko Beeren-
     assemblies. Bioinformatics, 29(8):1072–1075, 2013.                 winkel. V-pipe: a computational pipeline for assessing viral
[10] Illumina. Understanding Illumina Quality Scores. Techni-           genetic diversity from high-throughput data. Bioinformat-
     cal Note: Informatics, 23, 2014.                                   ics, 2021.
[11] Illumina. An Introduction to Next-generation Sequencing       [26] James T Robinson, Helga Thorvaldsdóttir, Wendy Winck-
     Technology. https://www.illumina.com/content/                      ler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P
     dam/illumina-marketing/documents/products/                         Mesirov. Integrative genomics viewer. Nature Biotechnol-
     illumina_sequencing_introduction.pdf, 2015.                        ogy, 29(1):24–26, 2011.
[12] Broad Institute.          Picard toolkit.         http://     [27] AE Samoilov, VV Kaptelova, AY Bukharina, OY Ship-
     broadinstitute.github.io/picard/, 2019.                            ulina, EV Korneenko, AV Lukyanov, AA Grishaeva,
[13] Kentaro Itokawa, Tsuyoshi Sekizuka, Masanori Hashino,              AA Ploskireva, Anna S Speranskaya, and VG Akimkin.
     Rina Tanaka, and Makoto Kuroda. Disentangling primer               Change of dominant strain during dual SARS-CoV-2 in-
     interactions improves SARS-CoV-2 genome sequencing by              fection. medRxiv 2020.11.29.20238402, 2020.
     multiplex tiling PCR. PLOS One, 15(9):e0239403, 2020.         [28] Yuelong Shu and John McCauley. GISAID: Global ini-
[14] Johannes Köster and Sven Rahmann. Snakemake — a                    tiative on sharing all influenza data–from vision to reality.
     scalable bioinformatics workflow engine. Bioinformatics,           Eurosurveillance, 22(13):30494, 2017.
     28(19):2520–2522, 2012.                                       [29] Jared Simpson. Nanopolish: Signal-level algorithms for
[15] Ben Langmead and Steven L Salzberg. Fast gapped-read               minION data. https://github.com/jts/nanopolish,
     alignment with Bowtie 2. Nature Methods, 9(4):357–359,             2018.
     2012.                                                         [30] Oxford Nanopore Technologies. Medaka. https://
[16] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko                     github.com/nanoporetech/medaka, 2018.
     Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast             [31] Changtai Wang, Zhongping Liu, Zixiang Chen, Xin Huang,
     single-node solution for large and complex metagenomics            Mengyuan Xu, Tengfei He, and Zhenhua Zhang. The estab-
     assembly via succinct de Bruijn graph. Bioinformatics,             lishment of reference sequence for SARS-CoV-2 and varia-
     31(10):1674–1676, 2015.                                            tion analysis. Journal of Medical Virology, 92(6):667–674,
[17] Heng Li. Aligning sequence reads, clone sequences and as-          2020.
     sembly contigs with BWA-MEM. arXiv 1303.3997, 2013.           [32] Ryan R Wick, Louise M Judd, Claire L Gorrie, and
[18] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue             Kathryn E Holt. Unicycler: resolving bacterial genome
     Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and               assemblies from short and long sequencing reads. PLOS
     Richard Durbin. The sequence alignment/map format and              Computational Biology, 13(6):e1005595, 2017.
     SAMtools. Bioinformatics, 25(16):2078–2079, 2009.             [33] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E
[19] Marcel Martin. Cutadapt removes adapter sequences                  Holt. Bandage: interactive visualization of de novo genome
     from high-throughput sequencing reads. EMBnet. journal,            assemblies. Bioinformatics, 31(20):3350–3352, 2015.
     17(1):10–12, 2011.                                            [34] Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand,
[20] ARTIC Network.           The ARTIC field bioinformat-              Grace Hui Ting Yeo, Swee Hoe Ong, Chang Hua Wong,
     ics pipeline. https://github.com/artic-network/                    Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hib-
     fieldbioinformatics.                                               berd, and Niranjan Nagarajan. LoFreq: a sequence-quality
                                                                        aware, ultra-sensitive variant caller for uncovering cell-
[21] Konstantin Okonechnikov, Ana Conesa, and Fernando
                                                                        population heterogeneity from high-throughput sequencing
     García-Alcalde. Qualimap 2: advanced multi-sample qual-
                                                                        datasets. Nucleic Acids Research, 40(22):11189–11201,
     ity control for high-throughput sequencing data. Bioinfor-
                                                                        2012.
     matics, 32(2):292–294, 2016.
                                                                   [35] Haibin Xu, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan
[22] Áine O’Toole, Emily Scher, Verity Hill, and Andrew Ram-
                                                                        Song, Guangrui Qian, Jinhui Chen, and Shilin Chen. Fas-
     baut. Pangolin - Phylogenetic Assignment of Named
                                                                        tUniq: a fast de novo duplicates removal tool for paired
     Global Outbreak LINeages.          https://github.com/
                                                                        short reads. PLOS One, 7(12):e52249, 2012.
     cov-lineages/pangolin, 2020.
                                                                   [36] Osvaldo Zagordi, Arnab Bhattacharya, Nicholas Eriksson,
[23] Nicole Pedro, Cláudio N Silva, Ana C Magalhães, Bruno
                                                                        and Niko Beerenwinkel. ShoRAH: estimating the genetic
     Cavadas, Ana M Rocha, Ana C Moreira, Maria Salomé
                                                                        diversity of a mixed sample from next-generation sequenc-
     Gomes, Diogo Silva, Joana Sobrinho-Simões, Angélica
                                                                        ing data. BMC Bioinformatics, 12(1):1–5, 2011.
     Ramos, et al. Dynamics of a Dual SARS-CoV-2 Lineage
     Co-infection on a Prolonged Viral Shedding COVID-19
     Case: Insights into Clinical Severity and Disease Duration.
     Microorganisms, 9(2):300, 2021.
[24] Ryan Poplin, Valentin Ruano-Rubio, Mark A DePristo,
     Tim J Fennell, Mauricio O Carneiro, Geraldine A Van der
     Auwera, David E Kling, Laura D Gauthier, Ami Levy-