SnakeLines Workflow for SARS-CoV-2 Variant Detection from Next-Generation Sequencing Reads Adrián Goga1,2 , Miroslav Böhmer2,3,4,5 , Rastislav Hekel3 , Werner Krampl2,3,5 , Bronislava Brejová1 , Tomáš Vinař1 , Jaroslav Budiš2,3,6 , and Tomáš Szemes2,3,5 1 Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Slovakia, 2 Comenius University Science Park in Bratislava, Slovakia, 3 Geneton Ltd., Bratislava, Slovakia, 4 Slovak Centre of Scientific and Technical Information, Slovakia, 5 Faculty of Natural Sciences, Comenius University in Bratislava, Slovakia 6 Slovak Centre of Scientific and Technical Information, Slovakia Abstract: The ongoing SARS-CoV-2 pandemic, which late July of 2021), we have successfully sequenced, anal- emerged in December 2019, revolutionized genomic ysed and released 3055 viral samples with sufficient cov- surveillance, leading to new means of tracking viral spread erage (at least 3bp across more than 90% of the genome). and monitoring genetic changes in their genomes over The rest of this paper is organized as follows. In sec- time. One of the key sequencing methods used during tion 2, we give a brief overview of the DNA sequenc- the pandemic is based on massively parallel short read ing process and data analysis with focus on variant detec- sequencing based on Illumina technology. In this work, tion of the SARS-CoV-2. In section 3, we describe the we present a highly scalable and easily deployable com- SnakeLines-based variant detection pipeline we designed putational pipeline for the analysis of Illumina sequenc- for the SARS-CoV-2. Section 4 is focused on discussing ing data, which is used in Slovak SARS-CoV-2 genomic several selected problems that occurred during the pipeline surveillance efforts. We discuss several issues that arose design. Finally, in section 5 we conclude our findings, pro- during the pipeline design, and which could both pro- pose several recommendations based on our experience, vide useful insight into the analysis processes and serve and outline directions for future work. as a guideline for optimized future outbreak surveillance projects. 2 Background - Sequencing and Data 1 Introduction Analysis The Severe Acute Respiratory Syndrome Coronavirus type The reference genome of the SARS-CoV-2 has 29 903 2 (SARS-CoV-2) pandemic first emerging in Wuhan in base pairs (bp) and was sequenced in the early days of December 2019, also erupting in Slovakia in March of the epidemic [31]. The goal of sequencing of additional 2020, led to a revolution in genomic surveillance. Mul- samples is to identify differences from this reference se- tiple DNA sequencing library preparation protocols and quence, called genomic variants. Continuous monitoring data analysis pipelines have been utilized to keep track of viral genomes is crucial for the early detection of new of the rapidly evolving virus, for example ARTIC [20], mutants that may lead to more severe virulence character- SARS-CoV-2 RECoVERY [6], V-pipe [25], etc. Their use istics, such as the rate of infection. resulted in gathering millions of genomic sequences in the A typical analysis of genomic data can be divided into GISAID database [28]. three consecutive steps. At first, the input genomic mate- We present a computational pipeline, which was cru- rial is fragmented and digitalized using sequencers. The cial for the analysis of Illumina sequencing data of SARS- fragments are then compared to a related genome to iden- CoV-2 genomes in Slovakia. The pipeline was developed tify genomic variations. Finally, the effect of detected vari- using the SnakeLines framework [4], which is built upon ations is estimated. the widely used Snakemake workflow engine [14]. The SARS-CoV-2 genomes are typically sequenced from flexibility of this framework allowed us to deploy our so- samples collected for RT-qPCR diagnostic tests. The next lution within national computational infrastructures sup- step is the library preparation process, which we have done porting sequencing laboratories of Comenius University using the Illumina COVIDSeq Test. The Illumina COVID- Science Park in Bratislava and the Public Health Authority Seq Test leverages a modified version of the validated, of the Slovak Republic. Up to now (from mid-February to publicly available ARTIC multiplex PCR protocol. It in- volves a two-step multiplex PCR approach in which the Copyright c 2021 for this paper by its authors. Use permitted un- whole genome is amplified using 98 amplicon primers. der Creative Commons License Attribution 4.0 International (CC BY Multiplexing (or barcoding) is often used to enable simul- 4.0). taneous sequencing of multiple samples by ligating unique identifier sequences, called barcodes, to the reads that be- removal of low quality parts of the reads, elimination of long to a particular sample. foreign DNA sources (e.g. human), removal of the frag- The amplified DNA fragments are then subjected to ments duplicated by the PCR process in case of the Illu- a sequencing process according to the technology being mina sequencing technology, etc. used, which in our case is the Illumina Next Genera- The genomic regions from which the reads originate tion Sequencing (Illumina NGS). Illumina NGS sequences are located by performing alignments to the reference se- a predefined number of bases from both ends of each quence in step 2. This is conducted using one of the DNA fragment (we have used 2 × 36bp and 2 × 74bp read reference-mapping tools such as the widely used Bowtie2 lengths) to form ‘paired-end’ reads (see the left of Fig. 1). or BWA [15, 17]. The data describing the mappings (usu- The Illumina NGS reads, albeit much shorter than those of ally in the SAM format) is subsequently sorted by the ref- the Third Generation Sequencing technologies (for exam- erence position and indexed for quick access using the ple the Oxford Nanopore Technologies), are significantly SAMtools program [18]. Further postprocessing is possi- more accurate on the base level, which makes them gen- ble, e.g. discarding the reads that do not meet the mapping erally more suitable for calling lower frequency variants quality threshold, etc. that may indicate co-infection with multiple strains in a The step 3 - variant calling - is the identification of vari- patient, evolution of the virus in a patient, or laboratory ations that are present in the sequenced samples. These contamination [23, 27]. variations of interest consist mainly of single-nucleotide Even though the fragments are not sequenced com- polymorphisms (SNPs) and small indels, and are de- pletely, the length of the unsequenced middle part is tected using tools like FreeBayes, GATK, or BCFtools known, which proves useful in locating the read’s posi- [8, 24, 18]. The goal of these tools is to distinguish se- tion of origin. The output of the sequencing process are quencing errors in reads from real variants based on the the signal intensities of individual dNTPs (deoxynucleo- number of reads containing the variant, their quality, read side triphosphates) measured through laser excitation and mapping quality and other data. The identified variants are imaging [11], from which the nucleotide sequences are ex- incorporated into the consensus sequence of the sequenced tracted in the base calling procedure. The base called reads sample in the next step 4. The positions in which the se- occasionally contain errors that complicate further analy- quencing coverage (i.e. the number of mapped reads cov- ses. The measure of these errors is reflected in the qual- ering the position) is less than a predefined threshold get ity scores (Q-scores), which are generated for each base marked by the ‘N’ symbol, which stands for any base. This and represent the probability of erroneous base calls [10]. consensus can be then used to identify the most likely phy- Upon obtaining the base called reads, demultiplexing - the logenetic lineage by matching it to a database of known inverse step to multiplexing can be executed so that the lineages using, for example, the Pangolin tool [22]. The reads are assigned to the sequenced samples on the basis steps 2-4 are depicted in Figure 1. of their barcodes. Finally, comprehensible reports in the PDF/HTML for- The output of the sequencing process is then subjected mat describing the summary statistics and plotted graphs to a series of steps according to the type of analysis be- are generated for each individual step of the analysis. ing performed. At the top level, we distinguish between These reports are used to monitor the quality of a given reference-based and reference-free analysis on the basis of step or to detect a potential problem with the analysed whether it constructs the sequenced genome by comparing sample, caused by laboratory or computational process- it to a known reference genome or it does it using solely ing. This is usually done using a combination of special- the sequenced data. ized reporting tools, such as FastQC, MultiQC or Qual- In the reference-based analysis, the sequenced reads are imap [3, 7, 21]. directly compared to the reference genome to identify dif- The reference-free approach shares the first and last step ferences in their sequences, called genomic variants. From with the reference-based one, but the middle step consists a higher perspective, the individual steps of the reference- of a genome assembly, in which the reads are joined into based analysis can be described as follows: long sequences called contigs. This is done by observing the extent of overlaps in individual read pairs. The con- 1. Reads preprocessing tigs can be further assembled into ‘scaffolds’ in which the 2. Mapping to a reference distances between contig pairs are known. The tools used for the assembly include Spades, Unicycler and Megahit 3. Variant calling [1, 32, 16], with the summary reports of the assembly qual- ity reported by Quast [9] and the assembled graph visual- 4. Consensus construction ized by Bandage [33]. The reference-free approach is suit- 5. Generation of summary reports able for analyses of viruses without a sufficiently similar reference sequence. Furthermore, it is able to detect larger Step 1 comprises a set of procedures that clear the reads structural changes, e.g., longer insertions. of any laboratory artifacts that could degrade the results of The MALVIRUS web application demonstrates a novel a downstream analysis. These procedures include, e.g. the approach to reference-free variant calling, which uses a GAGCGTTTCC GTGCGTTTCT GTGCGTTTCT CTGTGCGTTT GGCTGAGCGT ATGGCTGTGC TGGCTGTGCG TGGCTGTGCG CGTTTCCCAC CTGTGCGTTT GAGCGTTTCC ATGGCTGTGC GTGCGTTTCT GGCTGAGCGT CGTTTCCCAC TGTGCGTTTC TGCGTTTCTC TGTGCGTTTC GAGCGTTTCT AGCGTTTCCC ATGGCTGTGCGTTTCTCACC NTGGCTGAGCGTTTCCCANN Figure 1: A schematic visualization of the Illumina NGS data analysis. Left: a set of fragments, each having 3 bp sequenced from its ends. Middle: the reads mapped to the reference sequence with the identified variants highlighted (red). Right: the consensus sequence of this example where the positions with coverage lower than 2 were marked by ‘N’. catalog of known variants from the viral population (built The archivation step ensures full reproducibility and is es- utilizing databases such as GISAID) to genotype newly- pecially useful when the number of different analyses on sequenced samples in an alignment-free fashion [5]. various datasets grows, as it becomes much more difficult to manually keep track of the data paths and executed com- mands. 3 Implementation of Our Pipeline Each of the SnakeLines rules describes a particular atomic operation of the analysis, e.g., deduplication, map- 3.1 SnakeLines Workflow Framework ping, etc., following the convention that each rule should Analyses of DNA sequencing datasets typically consist use a single third-party tool (after which it is called, see of a number of steps performed by individual tools in Fig. 2) or have a clear, easily describable purpose. As the a particular order, which is very time demanding when environmental dependencies of the tools differ and often executing each tool manually by an operator. This nat- are in conflict, each rule operates in its own Conda envi- urally leads to development of automated computational ronment. Conda (https://conda.io) is a package man- pipelines, many of which also support the isolation of used agement tool for Python supported by Snakemake to take tools in individual environments or parallel execution of care of installation of the desired package version and its independent computations. Another desirable property of execution in an isolated environment, which greatly sim- computational pipelines is the simplification of analysis plifies its deployment. reproducibility between different data analysis centers. The Snakemake engine natively supports parallel exe- An increasingly popular choice for pipeline building in cution on any number of provided cores. Each rule can bioinformatics is the Snakemake [14] workflow engine. specify the number of cores required for its execution ac- The development of a pipeline in this engine consists of cording to which Snakemake engine schedules the run. As defining a set of basic building blocks called rules in a a consequence, SnakeLines can be quickly deployed on Python-based language [14]. Each rule defines commands a wide range of computational infrastructures from regu- for transforming a set of inputs into a set of outputs, where lar personal computers to different high-performance com- both inputs and outputs are described using the same set puters (HPC). of wildcards that will be substituted upon execution. The To deploy our pipeline, one must download the Snake- Snakemake engine builds the rules into computational di- Lines framework (preferably using Conda, alternatively by rected acyclic graphs (DAGs), whose vertices are the in- cloning the Git repository) and the YAML configuration stantiated rules and a directed edge (a, b) represents the file that defines the analysis. After specifying the set of fact that the input of the rule b requires the output of the input files to be analyzed using a regular expression, the rule a [14]. The rule instances are subsequently executed Snakemake engine is initiated with the number of provided in the order given by a topological sort of the DAG. cores. SnakeLines [4] is a collection of configurable compu- tational pipelines that extends the Snakemake workflow 3.2 SARS-CoV-2 Pipeline: Tools and Settings engine to allow the user (possibly a laboratory scientist without the ability to script its own pipelines) to define As SARS-CoV-2 genomes typically contain mostly short a sequence of steps of the analysis in a comprehensible mutations, we have decided to pursue a reference-based YAML configuration file. SnakeLines then loads the rules approach. For the analysis of the Illumina COVIDSeq required for each step, lets Snakemake build and execute samples, we devised the following pipeline. The reads the computational DAG and in the end, generates the sum- preprocessing starts with the trimming step using the Cu- marizing reports and archives them together with the out- tadapt tool [19], which is configured to first remove the put log, configuration file and other necessary metadata. PCR primers from both sides of the read, then cut off those cutadapt bowtie2 trim_reads prepare_index bowtie2 filter_reads_from_reference custom BWA save_read_group prepare_index BWA map_reads_to_reference samtools sort_mapped_reads samtools samtools bam_index prepare_fai_index picard annotation picard mark_duplicates create_wgs_bed_file prepare_dict_index fastqc qualimap BCFtools picard quality_report mapping_quality_report_across_reference call_germline_variants bed_to_interval_list custom gatk summary_report prepare_vcf bedtools bgzip gatk generate_coverage_bedgraph compress_vcf collect_variant_calling_metrics tabix bedtools index_vcf create_genome_file fastqc BCFtools html_summary_for_paired_reads create_consensus_fasta cat concatenate_consensus_fasta pipeline Figure 2: The computational DAG of the SARS-CoV-2 variant calling pipeline implemented in the SnakeLines engine. Each vertex corresponds to a Snakemake rule, while a directed edge represents input/output dependence. The ‘pipeline’ rule serves to unite the files that should be generated. bases from both ends of the reads whose quality is lower 3.3 Comparison to Selected SARS-CoV-2 Pipelines than 20. The reads that end up with fewer than 20 bases af- ter the trimming are deemed too short and will be removed. As mentioned earlier, several computational pipelines The set of primers together with the cutting thresholds is have been developed or adjusted to process SARS-CoV-2 fully configurable. samples. The ARTIC pipeline [20] specializes in Oxford Nanopore Technologies sequencing and offers two work- Subsequently, the reads are subjected to a decontami- flows: Medaka [30] workflow and Nanopolish [29], where nation process, in which the fragments of human RNA the latter is aided by also using raw sequencing signals in- are eliminated. This is done by aligning the reads to the stead of only the base called reads. ARTIC covers all the human genome using the Bowtie2 [15] tool and remov- steps from base calling the raw sequencing signals to vari- ing those that were successfully mapped. Afterwards, the ant calling and subsequent consensus building. Similarly reads that contain the same fragments, presumably due to to our SnakeLines pipeline, ARTIC utilizes Conda for its the PCR amplification, are removed in the deduplication deployment. step using the FastUniq [35] tool. The reads preprocessing The RECoVERY (REconstruction of COronaVirus is finalized by the generation of summary reports assessing gEnomes & Rapid analYsis) [6] pipeline operates on raw the reads characteristics using the FastQC [3] toolset. reads, regardless of the sequencing platform being used. In the mapping step, we employed the BWA [17] tool to Apart from the variant identification, RECoVERY is also map the reads to the SARS-CoV-2 reference genome and able to annotate the open reading frames (ORFs) of the se- SAMtools [18] to sort and index the generated SAM/BAM quenced sample. As opposed to the command line usage files. Reads that originate from the same DNA fragment of SnakeLines or ARTIC, the RECoVERY pipeline is built and were duplicated by PCR are removed by the Picard into the ARIES (Advanced Research Infrastructure for Ex- [12] tool. Again, the final step is the quality statistics sum- perimentation in genomicS) web portal of the Galaxy plat- marization with the use of the Qualimap 2 [21] tool. Vari- form (https://usegalaxy.org/), which offers a user- ant calling is performed by the BCFtools [18] call com- friendly interface with the possibility to run the analyses mand with the default parameters, using only reads with on the server side. minimum mapping quality of 13, but without a restriction Of the mentioned pipelines, the most technologically on the base quality. The quality of the variant call set is similar to ours is the V-pipe [25], which also utilizes then evaluated by the GATK [24] toolset. BCFtools is also Conda for its deployment and the Snakemake engine for used to construct the consensus sequence, in which the ‘N’ execution of its components. V-pipe accepts both single- base is placed wherever the coverage is less than 3. Figure end and paired-end Illumina NGS reads. To align the 2 displays our pipeline in the form of a DAG of Snakemake reads, V-pipe proposes a novel method called ‘ngshmma- jobs. lign’, which is based on the profile hidden Markov models [25] and uses LoFreq [34], ShoRAH [36] variant callers to detect single nucleotide variants. 4 Pipeline Design Issues Modern DNA sequencing technologies opened up new possibilities for systematic characterization of viruses cir- culating in the population. However, the analysed data are affected by specific biases that have to be accounted for to rule out incorrect conclusions caused by laboratory artefacts. Thorough bioinformatics processing and qual- ity control of sequenced data are therefore crucial to ob- tain reliable genomic sequences. Upon examination of the results that were produced by an early version of our pipeline, we needed to iteratively revisit the development to account for several identified issues. Our solutions for these problems contributed to the current version of the pipeline presented in section 3. 4.1 Primer Binding Sites The Illumina COVIDSeq Test is designed to amplify SARS-CoV-2 virus-specific sequences with 98 amplicons, designed to cover a nearly entire viral genome. However, these primers have proven to be one of the main problems of such an approach. They remain in the reads while cover- ing the true variability of the virus. In Figure 3 we present an example of a short genomic region containing a mu- tation typical for a given subtype of SARS-CoV-2 virus. Reads whose beginning overlaps the position with the mu- tation do not contain the mutated base because the primers were sequenced. In contrast, we see the mutation in the reads that have the position in middle or at the end (these were sequenced from an overlapping amplicon with dif- ferent primer pairs). It may happen that the number of Figure 3: Example of a mutation (A28095T) that was un- reads with a primer is much higher than the number of detected while using Trimmomatic [2] (top) and the same reads with the mutation, and as a result the mutation is mutation after employing Cutadapt [19] (bottom). Nearby not called. This is a typical problem of protocols aimed at mutation does not overlap a primer and was detected in sequencing the whole genome by amplifying sections us- both cases. The visualization was performed with IGV ing a pool of primers. We solved this problem by replac- software [26]. ing the originally employed Trimmomatic tool [2] by the Cutadapt tool [19] to which we provided a list of ARTIC primers that were then removed from the reads. the non-sequenced sites by the ‘N’ symbol in the consen- sus sequence, which means that this site had low coverage. 4.2 Genome Coverage The proportion of genome bases masked as ‘N’ in reported samples varied between 0.01%–9.69% with the median of Amplification of the SARS-CoV-2 genome by a large 0.25%. number of PCR reactions operating in two pools can lead to the formation of primer dimers. This may in turn result 4.3 Problematic Regions for Mapping and Variant in sections of the genome with low coverage (Fig. 4). One Calling solution to this problem was proposed by Itokawa et al. [13]. They introduced 12 alternative primers in the AR- The third issue represents problematic regions for map- TIC primer set to replace primers that were predicted to ping and variant calling. Most mappers and variant callers be involved in 14 primer interactions which improved the had a huge problem to correctly determine these problem- results. Another solution, which we have used, is to mark atic sites. For example, region 11288–11296 (Fig. 5, top) Figure 4: Example of a genome coverage across a ref- erence of SARS-CoV-2 sample (top) and an example of Figure 5: Visualization of the problematic regions genome coverage histogram of SARS-CoV-2 sample (bot- 11288–11296 (top) and 28881–28884 (bottom) performed tom). The plots were generated by Qualimap 2 [21]. with IGV software [26]. was tricky as it contains quite a long deletion (9 bp) in the Pos. 11288-11296 Pos. 28881-28884 B.1.1.7 clade with four T’s inside the deleted area followed Freebayes 5 3 by three T’s and similar bases as in this deleted area. In an- GATK 3 5 other example, the region 28881–28884 was also problem- BCFtools 3 3 atic, as it had four mutated bases in a row in many B.1.1.7 Table 1: Comparison of called variants in the problem- samples. Some mappers soft-clipped the start or the end of atic regions with the use of different variant callers. Posi- the reads rather than mapped these mutated bases to this tions marked with 3 were called correctly, while positions reference region (Fig. 5, bottom), so the variant callers marked with 5 were called incorrectly. also had a problem detecting these mutations. In the third example, which we see in Fig. 6, Bowtie2 mapper had a huge issue with mapping reads around the low coverage tion file. All the required tools are installed automatically area in comparison to BWA. We resolved all the problems in an isolated virtual environment, which enables rapid mentioned above with the appropriate selection of map- setup on fresh systems and reproducibility across differ- pers and variant callers. The performance of individual ent Unix-based systems. variant callers at the problematic regions is outlined in Ta- We have successfully deployed our pipeline on two ble 1. computational clusters within Comenius University in Bratislava (CU), namely, one at the CU Science Park and 5 Conclusions and Future Work one at the Slovak Centre of Scientific and Technical In- formation. To this day (from mid-February to late July of In this paper we present a computational pipeline for the 2021), we have successfully sequenced and analyzed 3055 analysis of the SARS-CoV-2 sequencing data obtained by samples with sufficient coverage (at least 3 bp across more Illumina NGS. The pipeline is highly flexible due to the than 90% of the genome) which were then released in ded- full interchangeability of its components and parametriza- icated public repositories - European Nucleotide Archive tion of individual tools through a simple YAML configura- (ENA) and GISAID. Our pipeline continues to be rou- The configuration and an example of the presented SARS-CoV-2 Variant Detection pipeline is de- scribed in https://snakelines.readthedocs. io/en/latest/pipelines/covidseq.html. Real-world sequencing data for the analysis can be downloaded from the ENA portal (https: //www.ebi.ac.uk/ena/browser/view/PRJEB43444). Acknowledgment This publication has been published with the support of the Operational Program Integrated Infrastructure within projects: Pangenomics for personalized clinical man- agement of infected persons based on identified viral genome and human exome (Code ITMS:313011ATL7, 80%); and “Horizontal ICT support and central infras- tructure for research and development institutions” (Code ITMS: 313011F988, 10%), both co-financed by the Euro- pean Regional Development Fund. It was also supported by Comenius University Grant UK/405/2021 (5%) and a grant from VEGA 1/0458/18 (5%). References Figure 6: Visualization of the problematic region with [1] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A low sequencing coverage (21546–21552) mapped with Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Bowtie2 (top) and BWA (bottom) performed with IGV Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D software [26]. Prjibelski, et al. SPAdes: a new genome assembly algo- rithm and its applications to single-cell sequencing. Jour- nal of Computational Biology, 19(5):455–477, 2012. tinely used for national genomic surveillance of SARS- [2] Anthony M Bolger, Marc Lohse, and Bjoern Usadel. Trim- CoV-2 in Slovakia orchestrated by the Public Health Au- momatic: a flexible trimmer for illumina sequence data. thority of Slovak Republic. Bioinformatics, 30(15):2114–2120, 2014. We also cover our experience with the design and tuning [3] Joseph Brown, Meg Pirrung, and Lee Ann McCue. FQC of the pipeline development. Each of the design choices Dashboard: integrates FastQC results into a web-based, in- discussed in section 4 originated in response to the particu- teractive, and extensible FASTQ quality control tool. Bioin- lar problems that arose during the testing. For future work formatics, 33(19):3137–3139, 2017. of this kind, we recommend to include a systematic evalu- [4] Jaroslav Budis, Werner Krampl, Marcel Kucharik, ation of individual tools and thoroughly validate pipelines Rastislav Hekel, Adrian Goga, Michal Lichvar, David Smolak, Miroslav Bohmer, Andrej Balaz, Frantisek Duris, in isolated environments before their deployment. Juraj Gazdarica, Katarina Soltys, Jan Turna, Jan Radvan- The pipeline we presented was originally designed ex- szky, and Tomas Szemes. SnakeLines: integrated set clusively for the Illumina paired-end reads. However, of computational pipelines for sequencing reads. arXiv SnakeLines was recently extended to support the single- 2106.13649, 2021. end reads from ONT sequencers, which paved the way for [5] Simone Ciccolella, Luca Denti, Paola Bonizzoni, Gi- us to design an ONT version of our pipeline, which is cur- anluca Della Vedova, Yuri Pirola, and Marco Previtali. rently in development. MALVIRUS: an integrated web application for viral vari- ant calling. bioRxiv 2020.05.05.076992, 2020. [6] Luca De Sabato, Gabriele Vaccari, Arnold Knijn, Giovanni Data Availability Ianiro, Ilaria Di Bartolo, and Stefano Morabito. SARS- CoV-2 RECoVERY: a multi-platform open-source bioin- The SnakeLines framework is freely available for formatic pipeline for the automatic construction and analy- non-commercial users at https://github.com/ sis of SARS-CoV-2 genomes from NGS sequencing data. jbudis/snakelines along with Anaconda repository bioRxiv 2021.01.16.425365, 2021. https://anaconda.org/bioconda/snakelines. [7] Philip Ewels, Måns Magnusson, Sverker Lundin, and Max Instructions for installation, running, and extending Käller. MultiQC: summarize analysis results for multi- the framework, are accessible from the online docu- ple tools and samples in a single report. Bioinformatics, mentation https://snakelines.readthedocs.io/. 32(19):3047–3048, 2016. [8] Erik Garrison and Gabor Marth. Haplotype-based variant Moonshine, David Roazen, et al. Scaling accurate genetic detection from short-read sequencing. arXiv 1207.3907, variant discovery to tens of thousands of samples. bioRxiv 2012. 201178, 2018. [9] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and [25] Susana Posada-Céspedes, David Seifert, Ivan Topolsky, Glenn Tesler. QUAST: quality assessment tool for genome Kim Philipp Jablonski, Karin J Metzner, and Niko Beeren- assemblies. Bioinformatics, 29(8):1072–1075, 2013. winkel. V-pipe: a computational pipeline for assessing viral [10] Illumina. Understanding Illumina Quality Scores. Techni- genetic diversity from high-throughput data. Bioinformat- cal Note: Informatics, 23, 2014. ics, 2021. [11] Illumina. An Introduction to Next-generation Sequencing [26] James T Robinson, Helga Thorvaldsdóttir, Wendy Winck- Technology. https://www.illumina.com/content/ ler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P dam/illumina-marketing/documents/products/ Mesirov. Integrative genomics viewer. Nature Biotechnol- illumina_sequencing_introduction.pdf, 2015. ogy, 29(1):24–26, 2011. [12] Broad Institute. Picard toolkit. http:// [27] AE Samoilov, VV Kaptelova, AY Bukharina, OY Ship- broadinstitute.github.io/picard/, 2019. ulina, EV Korneenko, AV Lukyanov, AA Grishaeva, [13] Kentaro Itokawa, Tsuyoshi Sekizuka, Masanori Hashino, AA Ploskireva, Anna S Speranskaya, and VG Akimkin. Rina Tanaka, and Makoto Kuroda. Disentangling primer Change of dominant strain during dual SARS-CoV-2 in- interactions improves SARS-CoV-2 genome sequencing by fection. medRxiv 2020.11.29.20238402, 2020. multiplex tiling PCR. PLOS One, 15(9):e0239403, 2020. [28] Yuelong Shu and John McCauley. GISAID: Global ini- [14] Johannes Köster and Sven Rahmann. Snakemake — a tiative on sharing all influenza data–from vision to reality. scalable bioinformatics workflow engine. Bioinformatics, Eurosurveillance, 22(13):30494, 2017. 28(19):2520–2522, 2012. [29] Jared Simpson. Nanopolish: Signal-level algorithms for [15] Ben Langmead and Steven L Salzberg. Fast gapped-read minION data. https://github.com/jts/nanopolish, alignment with Bowtie 2. Nature Methods, 9(4):357–359, 2018. 2012. [30] Oxford Nanopore Technologies. Medaka. https:// [16] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko github.com/nanoporetech/medaka, 2018. Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast [31] Changtai Wang, Zhongping Liu, Zixiang Chen, Xin Huang, single-node solution for large and complex metagenomics Mengyuan Xu, Tengfei He, and Zhenhua Zhang. The estab- assembly via succinct de Bruijn graph. Bioinformatics, lishment of reference sequence for SARS-CoV-2 and varia- 31(10):1674–1676, 2015. tion analysis. Journal of Medical Virology, 92(6):667–674, [17] Heng Li. Aligning sequence reads, clone sequences and as- 2020. sembly contigs with BWA-MEM. arXiv 1303.3997, 2013. [32] Ryan R Wick, Louise M Judd, Claire L Gorrie, and [18] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Kathryn E Holt. Unicycler: resolving bacterial genome Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and assemblies from short and long sequencing reads. PLOS Richard Durbin. The sequence alignment/map format and Computational Biology, 13(6):e1005595, 2017. SAMtools. Bioinformatics, 25(16):2078–2079, 2009. [33] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E [19] Marcel Martin. Cutadapt removes adapter sequences Holt. Bandage: interactive visualization of de novo genome from high-throughput sequencing reads. EMBnet. journal, assemblies. Bioinformatics, 31(20):3350–3352, 2015. 17(1):10–12, 2011. [34] Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, [20] ARTIC Network. The ARTIC field bioinformat- Grace Hui Ting Yeo, Swee Hoe Ong, Chang Hua Wong, ics pipeline. https://github.com/artic-network/ Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hib- fieldbioinformatics. berd, and Niranjan Nagarajan. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell- [21] Konstantin Okonechnikov, Ana Conesa, and Fernando population heterogeneity from high-throughput sequencing García-Alcalde. Qualimap 2: advanced multi-sample qual- datasets. Nucleic Acids Research, 40(22):11189–11201, ity control for high-throughput sequencing data. Bioinfor- 2012. matics, 32(2):292–294, 2016. [35] Haibin Xu, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan [22] Áine O’Toole, Emily Scher, Verity Hill, and Andrew Ram- Song, Guangrui Qian, Jinhui Chen, and Shilin Chen. Fas- baut. Pangolin - Phylogenetic Assignment of Named tUniq: a fast de novo duplicates removal tool for paired Global Outbreak LINeages. https://github.com/ short reads. PLOS One, 7(12):e52249, 2012. cov-lineages/pangolin, 2020. [36] Osvaldo Zagordi, Arnab Bhattacharya, Nicholas Eriksson, [23] Nicole Pedro, Cláudio N Silva, Ana C Magalhães, Bruno and Niko Beerenwinkel. ShoRAH: estimating the genetic Cavadas, Ana M Rocha, Ana C Moreira, Maria Salomé diversity of a mixed sample from next-generation sequenc- Gomes, Diogo Silva, Joana Sobrinho-Simões, Angélica ing data. BMC Bioinformatics, 12(1):1–5, 2011. Ramos, et al. Dynamics of a Dual SARS-CoV-2 Lineage Co-infection on a Prolonged Viral Shedding COVID-19 Case: Insights into Clinical Severity and Disease Duration. Microorganisms, 9(2):300, 2021. [24] Ryan Poplin, Valentin Ruano-Rubio, Mark A DePristo, Tim J Fennell, Mauricio O Carneiro, Geraldine A Van der Auwera, David E Kling, Laura D Gauthier, Ami Levy-