=Paper=
{{Paper
|id=Vol-2962/paper10
|storemode=property
|title=Estimation of Proportions of SARS-CoV-2 Variants in a Mixed Sequencing Sample
|pdfUrl=https://ceur-ws.org/Vol-2962/paper10.pdf
|volume=Vol-2962
|authors=Askar Gafurov,Andrej Baláž,Tomáš Vinař,Broňa Brejová
|dblpUrl=https://dblp.org/rec/conf/itat/GafurovBVB21
}}
==Estimation of Proportions of SARS-CoV-2 Variants in a Mixed Sequencing Sample==
<pdf width="1500px">https://ceur-ws.org/Vol-2962/paper10.pdf</pdf>
<pre>
                         Estimation of Proportions of SARS-CoV-2 Variants
                                   in a Mixed Sequencing Sample

                                Askar Gafurov1 , Andrej Baláž2 , Tomáš Vinař2 , and Broňa Brejová1
 1    Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Slovakia
2    Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Slovakia

Abstract: Estimation of proportions of SARS-CoV-2                         2     Problem description
genome variants (e.g. variant B.1.1.7 originating from
Britain, variant B.1.351 originating from South-Africa) in                2.1   Variants
a population is currently done by sequencing individual
                                                                          In our experiments, we consider several variants of the
samples, which demands individual laboratory processing
                                                                          SARS-CoV-2 virus, which are characterized by specific
of each sample. This labor can be significantly reduced
                                                                          mutations described in the Table 1. For instance, if a par-
by mixing several samples together and processing them
                                                                          ticular SARS-CoV-2 genome has the nucleotide at posi-
in one batch. Our project aims to estimate the proportion
                                                                          tion 3267 mutated from cytosine to thymine, it has one
of samples with given variants from such mixtures using
                                                                          of the mutations characteristic for the variant which origi-
probabilistic modeling.
                                                                          nated in the United Kingdom [14] and is known as a vari-
                                                                          ant of concern alpha [16]. In the Pangolin SARS-CoV-2
1      Introduction                                                       lineage classification [15], it is denoted B.1.1.7. In this
                                                                          paper, we will denote this variant by acronym UK listed
Since December 2019, the severe acute respiratory syn-                    in the table. In our list of variants, we have included two
drome coronavirus 2 (SARS-CoV-2) is rapidly spreading                     additional variant of concern (beta and gamma) as well as
across the world, causing the coronavirus infectious dis-                 several variants that had high prevalence in Slovakia in the
ease (COVID-19) pandemic. Up to April 5, 2021, there                      fall of 2020 and early 2021, which is the time from which
were 132 mil. cases of COVID-19 infection. Among these,                   our data originate. For each variant, the table also con-
109 mil. cases have already had an outcome from which                     tains the minimum number of these characteristic muta-
there was 2.8 mil. deaths, resulting in an estimate of the                tions which need to be observed in the genome in order to
mortality rate around 2.5% [18].                                          be considered as belonging to a given variant in our study.
   Due to its worldwide spread and a mutation rate of about               If a genome did not reach the number of characteristic mu-
2 mutations per month [9], the SARS-CoV-2 developed                       tations for any considered variant, it was assigned the label
multiple variants. Some emerging variants have accumu-                    “other”.
lated significantly more mutations and proved to be more                     This characterization was used on all samples from the
dangerous, causing concerns around the globe [4]. Scien-                  GISAID database (downloaded on Mar 4, 2021) [5], which
tists estimate that continued transmission of SARS-CoV-2                  provides a comprehensive resource of more than 650 thou-
and selective pressures, such as vaccines, are creating ideal             sands fully assembled SARS-CoV-2 genomes. Individ-
conditions for additional significant virus evolution [11].               ual genomes were aligned to the SARS-CoV-2 reference
Therefore, it is critical to characterize the virus strains fur-          Wuhan/Hu-1/2019 by minimap2 [10]. Then, each se-
ther and monitor the spread of the variants in the popula-                quence was assigned a label as described earlier. Fi-
tion in order to inform public policies and assess the effec-             nally, a matrix P was constructed, which consisted of
tiveness of containment strategies.                                       entries pk,i,a representing the relative frequency of nu-
   Monitoring of the SARS-CoV-2 variants requires con-                    cleotide a at genomic position i of variant k among aligned
ducting many sequencing experiments to determine the                      genomes.
genome of the virus and to identify its mutations compared                   This matrix thus characterizes each variant not only
to a reference genome. Based on this, it is then possi-                   by the selected mutations listed in Table 1, but also cap-
ble to decide which variant was sequenced. Some part of                   tures any additional significant mutations present in the
the laboratory work is done individually for each sample,                 genomic sequences classified to the variant. This approach
which is time-consuming and resource-intensive. In this                   can be easily applied to a different set of variants by simply
work, we propose a model capable of estimating the pro-                   modifying the input table of variants and their characteris-
portions of different variants in a pooled sample, which                  tic mutations.
allows to combine multiple samples into one sequencing
experiment.
                                                                          2.2   Sequencing data
      Copyright c 2021 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY         In order to identify the SARS-CoV-2 variant of a new
4.0).                                                                     patient, it is necessary to sequence an obtained sam-
           Variant                Acronym         Required       Mutations
           P.1 (gamma)             Brazil            8           T733C, C2749T, C3828T, A5648C, C12778T,
           B.1.258                   CZ              4           G12988T, G15598A, G18028T, T24910C, T26972C
           B.1.177                  EU1              4           C22227T, C28932T, G29645T, G21255C, C26801G
           B.1.160                  EU2              3           C4543T, G5629T, G22992A, T26876C
           B.1.351 (beta)         South.Afr          3           G23012A, A21801C, A23063T, G22813T
           B.1.1.7 (alpha)          UK              11           C3267T, C5388A, T6954C, A23063T, C23271A,
                                                                 C23604A, C23709T, T24506G, G24914C, C27972T,
                                                                 G28048T, A28111G, C28977T
                                                                 C13860T, G17259T, C21614T, C21621A
           B.1.1.170             UKBA318               3         G12824A, C25777T, G26062T, C29754T

                         Table 1: Characteristic mutations of individual variants considered in this study.


ple. In this work, the used samples were sequenced us-                    5 months after the analysis. This suggests that sewage
ing the ARTIC amplicon protocol [17] and the Oxford                       samples may provide a more comprehensive snapshot of
Nanopore MinION sequencer.                                                currently circulating SARS-CoV-2 variants in comparison
   In the ARTIC protocol, the RNA of the virus is first                   with solely clinical cases.
reverse transcribed into DNA and specific regions of
the DNA are amplified. These amplicons cover almost                          A wastewater study conducted in Spain [13] identified
the entire length of the genome with slight overlaps be-                  238 SNVs and 6 deletions in comparison with the refer-
tween adjacent regions. In our data, the amplicons had                    ence genome of SARS-CoV-2 isolate Wuhan-Hu-1. The
length about 1700 bp. Amplified DNA was sequenced                         study used 40 samples, which were sequenced with AR-
by the MinION sequencer, and the resulting sequences are                  TIC protocol v.3, analysed using the iVAR software.
called reads. The reads are then mapped to the reference
genome by tool minimap2 [10] and stored in the BAM for-                      In the wastewater study from Switzerland [7], the au-
mat.                                                                      thors showed that it is possible to use Illumina reads to de-
   Because each read comes from a genome of a specific                    tect the variants before they appear in clinical cases. The
variant, it is expected that it contains mutations that are               48 samples were collected as 24-hours composite or grab
characteristic for this variant. When analyzing a mixture                 samples. Amplicons were created using the ARTIC v.3
of reads from different patient samples, the task at hand is              protocol and sequenced using Illumina NovaSeq 6000, re-
to estimate the proportions of reads coming from individ-                 sulting in paired reads of length 250bp. The mutations
ual variants, based on the presence of mutations character-               were identified from these reads using the V-pipe bioin-
istic for these variants.                                                 formatics pipeline. These mutations were then compared
                                                                          with the clinical cases from Switzerland in the GISAID
3    Related work                                                         database for the presence of characteristic mutations of
                                                                          the B.1.1.7 and B.1.351 variants using Fisher’s exact test.
There had been several attempts to identify SARS-CoV-2                    The authors also looked at mutations co-occurring in the
variants from mixed samples. Most commonly, these were                    same read and found further evidence of the presence of
wastewater-based epidemiological studies.                                 the B.1.1.7 variant in Switzerland already in early Decem-
   Crits-Christoph et al. [3] showed that single nucleotide               ber.
variants (SNVs)1 detected from the sewage water from
San Francisco Bay Area were significantly similar to local                   Mixed samples can also originate from environmental
California-based patient-derived genotypes, thus demon-                   sources other than sewage water. The first report on re-
strating the possibility of identifying local SARS-CoV-2                  covering near-complete SARS-CoV-2 genome sequences
variants in the wastewater samples. The SNVs were ob-                     from environmental surface swabs assessed the contami-
tained via SNV caller inStrain v1.3.2 [12] and the similar-               nation with the virus in a hospital [2]. The authors of this
ity of genotypes was established via Fisher’s exact test.                 study confirmed the low likelihood that SARS-CoV-2 con-
   Another study [6] analyzed 91 wastewater samples from                  tamination on hospital surfaces contains infectious virus.
11 states in the USA and identified 7973 SNVs, of which
5680 were “novel” at the time of the analysis with respect                   Although many studies have identified SARS-CoV-2
to the global clinically derived data. Interestingly, almost              variants from mixed environmental samples, all have qual-
half of the “novel” variants were confirmed within the next               itative results. In this paper, we estimate the proportions
    1 Single nucleotide variants are simple mutations substituting one    of the variants in the sample, which we believe is corre-
DNA base for another, not to be confused with virus variants, which are   lated with the proportions of the variants circulating in the
groups of evolutionarily related viruses.                                 community.
4     Model description                                        do the memory requirements of data preprocessing, allow-
                                                               ing for the efficient processing of gigabytes of sequencing
In our model, we assume that in a set of aligned reads         data even on common computers. All processes are easily
coming from one variant, the expected observed symbol          parallelizable and could be potentially accelerated using
counts at any given position in genome are proportional        a GPU.
to the symbol frequencies among all sequences in a given
variant, which are captured in matrix P described in sub-
                                                               4.3   Adding error model
section 2.1. The observed symbol counts at a particular
position can be thus described by a multinomial distribu-      Sequencing reads contain errors, where the symbol in the
tion. To simplify the model, we further assume that indi-      read differs from the actual genome being sequenced. We
vidual positions in the genome are independent. When se-       assume that these errors occur uniformly at random and
quencing a mixed sample, we assume that individual vari-       add them to the model as follows. Let ε ∈ (0, 1) be the
ants are present at some unknown proportions which are         substitution error rate. Let m̄i,a (W ) := 1 − mi,a (W ) denote
represented as weights in a mixture model. We assume           the probability of observing a base different from base a
that these proportions stay the same at all positions in the   at position i in a mixture sample without sequencing er-
genome. A more detailed description of the model is given      rors. The probability of observing symbol a at position i
in the next subsection.                                        in a mixture sample sequenced with errors is then equal to:
                                                                                                            ε
                                                                      qi,a (W, ε) = (1 − ε) · mi,a (W ) +     · m̄i,a (W ).
4.1   Basic model                                                                                           3
Let Σ = {A,C, G, T } be the symbol alphabet. Let K de-           The likelihood of observing O given mixture weights W
note the number of virus variants and L denote the length      and substitution rate ε is then proportional to:
of the reference genome. Let W = (w1 , . . . , wK ) denote                                  L
the (unknown) weights of individual variants in a mixture.                  Pr[O|W, ε] ∼ ∏ ∏ qi,a (W, ε)Oi,a .
Let pk,i,a denote the probability of observing symbol a at                                 i=1 a∈Σ
position i in the k-th variant and Oi,a denote the count of
                                                                  The error rate ε can be set to a particular value, or be
symbol a at position i in the observed reads.
                                                               inferred simultaneously with the mixture weights W :
   The probability of observing symbol a at position i
in a mixture with weights W is then equal to mi,a (W ) :=                                       L
∑Kk=1 wk · pk,i,a .                                                  (W ∗ , ε ∗ ) := arg min − ∑ ∑ Oi,a · log qi,a (W, ε)
                                                                                    W,ε      i=1 a∈Σ
   The total probability of observations O given mixture
weights W is then proportional to:                               In our experiments, we used the latter option.
                              L
              Pr[O|W ] ∼ ∏ ∏ mi,a (W )Oi,a .                   4.4   Evaluation of a posteriori probability of a given
                          i=1 a∈Σ
                                                                     read belonging to a particular variant
  The inference of the mixture weights from observations
                                                               We can represent a read as a data set consisting of only one
can be then solved via maximisation of the log-likelihood
                                                               read, i.e. Oi,a is equal to 1 if read has symbol a aligned to
function:
                                                               reference’s position i, and 0 otherwise. Let V ∈ {1, . . . , K}
                          L                                    be a random variable representing the variant that the read
         W ∗ := arg max ∑ ∑ Oi,a · log mi,a (W ).              belongs to. The likelihood of a read belonging to variant
                    W    i=1 a∈Σ                               k is equal to Pr[O|V = k] = Pr[O|W = ek ], where ek is the
This task is equivalent to minimisation of cross-entropy       vector of length K with value 1 at position k and value 0
between m and O:                                               elsewhere. The probability of the read belonging to variant
                                                               k given its sequence is then, by the Bayes theorem, equal
                              L                                to:
        W ∗ := arg min − ∑ ∑ Oi,a · log mi,a (W )
                    W     i=1 a∈Σ                                                         Pr[O|V = k] · Pr[V = k]
                                                                     Pr[V = k|O] =                                            .
                                                                                       ∑Kj=1 Pr[O|V = j] · Pr[V = j]
4.2   Minimisation and efficiency
                                                                                                 1
                                                               Assuming a uniform prior Pr[V ] = , the formula reduces
The minimisation is done using the L-BFGS-B algorithm                                            K
[19] implemented in Python library scipy [8]. Since the        to:
                                                                                            Pr[O|V = k]
input data for the minimisation process are essentially two              Pr[V = k|O] = K                   .
tables of fixed sizes L × |Σ| and K × L × |Σ|, the time and                              ∑ j=1 Pr[O|V = j]
memory requirements of the optimisation process itself do         The notion of posterior probability enables us to clas-
not depend on the amount of sequencing reads and neither       sify individual reads into variants by choosing the variant
      Variant      Isolate                                            The true proportions of variants in the samples is calcu-
      CZ           UKBA-702, UKBA-722, UKBA-809,                   lated as the proportion of the sums of read lengths for each
                   UKBA-818                                        variant.
      EU1          UKBA-716, UKBA-717
      EU2          UKBA-701                                        5.2   Results on all reads
      UK           UKBA-703, UKBA-704, UKBA-705,                   The results on the simulated mixes are shown in Table 3.
                   UKBA-706, UKBA-707, UKBA-708,                   We can see that the model is able to distinguish between
                   UKBA-713, UKBA-714, UKBA-718,                   present and absent variants, while significantly overesti-
                   UKBA-719, UKBA-720, UKBA-801,                   mating the “other” variant in all samples.
                   UKBA-802, UKBA-803, UKBA-804,
                                                                       In order to find an explanation for this discrepancy, we
                   UKBA-805, UKBA-806, UKBA-807,
                                                                   analysed the posterior probabilities (see subsection 4.4) of
                   UKBA-808, UKBA-814, UKBA-815,
                                                                   individual reads (see Figure 1). The analysis shows that
                   UKBA-816, UKBA-817
                                                                   there are many “uncertain” reads (i.e. reads with no strong
      UKBA318      UKBA-709, UKBA-710, UKBA-711,                   affinity towards any variant), particularly so among reads,
                   UKBA-712, UKBA-723, UKBA-724                    classified into “other” variant.
                                                                       We decided to formalize the notion of “uncertainty” and
      other        UKBA-715
                                                                   filter out such reads from the data set before running the
                Table 2: Table of used samples                     model.


with the maximum a posteriori probability (MAP). We use            5.3   Filtering out uncertain reads
this read classification for visualization and filtering as dis-
cussed in the next section.                                        Let k-certainty of a read be defined as a sum of its k highest
                                                                   posterior probabilities of belonging to a particular variant.
   Unfortunately, this approach requires to set the substi-
tution rate ε in advance. In our experiments, we estimated            We decided to use 1-certainty criterion in our experi-
the substitution rate by first running the minimisation al-        ments.
gorithm on the available data.                                        The pipeline works as follows: first we estimate the sub-
                                                                   stitution error rate by running the model on the full data
                                                                   set. Then, using the estimated error rate, we evaluate the
5      Experiments                                                 1-certainty for each read. Reads with 1-certainty below
                                                                   a given threshold are then removed from the data set, and
5.1     Simulated data                                             the observed counts Oi,a are calculated. Finally, we run
                                                                   the model on these counts, reporting the mixture weights
To evaluate the model, we created simulated data sets,             W as the final answer.
where the correct proportions of the variants were known.             We ran our algorithm on the simulated mixes with dif-
We have used 37 FASTQ files from the ARTIC sequenc-                ferent filtering thresholds. The estimated weights at dif-
ing experiments, conducted at Biomedical Research Cen-             ferent filtration thresholds are shown in Figure 2. The
ter of the Slovak Academy of Sciences, using protocols             KL-divergence between the true and estimated weights are
described in [1], where each SARS-CoV-2 sample was se-             shown in Figure 3.
quenced separately (using a different barcode).                       This heuristic works reasonably well on data sets with-
   The list of samples and their classification to variants is     out reads from “other” variant (Figures 2a and 2b). How-
shown in Table 2. Each file was mapped to the reference            ever, it fails on data sets with “other” variant present
and the resulting BAM files were merged to create four             (Figures 2c and 2d) because reads from “other” variant
mixed samples as follows:                                          have generally lower 1-certainty (Figure 1), and are con-
                                                                   sequently filtered out.
    1. 1x CZ, 1x EU1, 1x EU2, 1x UK (92 840 reads, aver-
       age base coverage 5 858)
                                                                   6     Discussion
    2. 4x CZ, 2x EU1, 1x EU2, 3x UK, 1x UKBA318
       (264 113 reads, average base coverage 16 562)
                                                                   We have demonstrated that our method can predict the
    3. 2x EU1, 1x EU2, 1x other (99 711 reads, average base        proportions of individual SARS-CoV-2 variants in mixed
       coverage 6 273)                                             samples when all variants in a sample are known to the
                                                                   model (i.e. there are no reads from “other” variant).
    4. 3x CZ, 2x EU1, 9x UK, 1x other (335 601 reads, av-            In our future work on this topic, we would like to ad-
       erage base coverage 21 044)                                 dress the following issues:
                  variant           Mix #1               Mix #2              Mix #3               Mix #4
                               true   estimated     true   estimated    true   estimated     true   estimated
                  Brazil      0.000     0.000      0.000     0.000     0.000     0.001      0.000     0.000
                    CZ        0.327     0.245      0.322     0.249     0.000     0.004      0.163     0.128
                   EU1        0.225     0.230      0.189     0.179     0.506     0.434      0.149     0.142
                   EU2        0.252     0.200      0.088     0.070     0.235     0.178      0.000     0.000
                 South.Afr    0.000     0.000      0.000     0.000     0.000     0.003      0.000     0.000
                    UK        0.196     0.158      0.246     0.209     0.000     0.002      0.612     0.546
                 UKBA318      0.000     0.000      0.155     0.121     0.000     0.005      0.000     0.000
                   other      0.000     0.168      0.000     0.173     0.259     0.372      0.076     0.183

Table 3: Estimated weights for the simulated samples without any filtration. Note that the “other” variant is significantly
overestimated in all samples. Aside from that, the method is able to distinguish between present and absent variants.


Figure 1: The base coverage of a sequence by reads classified into individual variants by maximum a posteriori predictor
(MAP). The first simulated mix is shown in the left, the third in the right. The blue line represents the coverage by all reads
classified into a particular variant, the orange and red lines only reads with 1-certainty above 50% and 90%, respectively.
Reads that have been, both erroneously (first mix) and correctly (third mix) classified into “other” variant, have mostly
low 1-certainty. Even for correct variants, there are regions in the reference where the model cannot capture any reads.
These regions correspond well with the regions without characteristic mutations (see Table 1).
                  (a) Simulated mix #1                                                (b) Simulated mix #2


                  (c) Simulated mix #3                                                (d) Simulated mix #4

Figure 2: The estimated weights for each simulated mix at different thresholds of filtration of uncertain reads. The ver-
tical axis shows the filtration threshold. Lengths of individual color stripes represent their estimated weight for a given
threshold of filtration. The last row (denoted “real”) shows the true weights for each simulated mix. The lowest filtration
threshold does not filter out any reads (because it’s equal to 1/K, which is the lowest possible value for the 1-certainty).


                                                                   • We would like to be able to evaluate confidence in-
                                                                     tervals on our estimations. That would give us, for
                                                                     example, the ability to rule out the presence of a vari-
                                                                     ant with a low estimated weight.

                                                                   • There are many non-informative reads coming from
                                                                     the areas of the genome with no significant alter-
                                                                     ations between individual variants. The proportion of
                                                                     such reads is even higher when the reads are shorter
                                                                     (e.g. obtained by Illumina sequencing). The goal is
                                                                     to extend the model so that it could handle such data.

                                                                   • The sequencing data shows severe non-uniformity of
                                                                     the coverage. This issue may be addressed by a selec-
                                                                     tive weighting of individual reads based on the cov-
                                                                     erage at their position.
Figure 3: The KL-divergence between the true and esti-
mated weights for different read filtration thresholds. The        • The current model does not take into account inser-
horizontal axis shows the filtration threshold. The vertical         tions and deletions, both in the individual variants and
axis shows the resulting KL-divergence (lower is better).            during the sequencing process.
It can be seen that the filtration method works reasonably
well for the mixes without “other” variant present (the first      • The way we characterize the individual variants using
and second mixes).                                                   the averaging of all available data from the GISAID
                                                                     database may incur a bias toward specific subvariants
                                                                     (e.g. some countries, such as the UK, submit a much
  • As we discussed in the previous section, our model               higher number of sequence samples).
    tends to overestimate the “other” variant, and our cur-
    rent filtering heuristics is not a robust solution. The        • Our approach splits all reads into individual bases,
    filtering may introduce bias into the estimation. This           thus potentially losing all long-range information.
    issue may be addressed by a more systematic ap-                  In the future, we would like to adjust the model, so
    proach to defining the “other” variant.                          that it would keep the reads intact.
  • In our model we assume that the individual positions           [8] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source
    are independent. But, the way the variant profiles                 scientific tools for Python, 2001–.
    are created (by demanding at least n mutations to be           [9] K. Kupferschmidt. The pandemic virus is slowly mutating.
    present in a sequence) introduces some dependency                  but does it matter?, 2020.
    between them. In the future, we would like to resolve         [10] H. Li. Minimap2: pairwise alignment for nucleotide se-
    that inconsistency either by changing of the way the               quences. Bioinformatics, 34(18):3094–3100, 2018.
    variant profiles are estimated or by relaxing the as-         [11] NCIRD. Science brief: Emerging SARS-CoV-2 variants,
    sumption in the model.                                             2021.
                                                                  [12] M. R. Olm, A. Crits-Christoph, K. Bouma-Gregson, B. A.
Code availability.       The code is available               at        Firek, M. J. Morowitz, and J. F. Banfield. inStrain profiles
https://github.com/fmfi-compbio/covid-pooling.                         population microdiversity from metagenomic data and sen-
                                                                       sitively detects shared microbial strains. Nature Biotech-
                                                                       nology, pages 1–10, 2021.
Acknowledgements. We would like to thank the reviewers            [13] A. Pérez Cataluña, Á. Chiner-Oms, E. Cuevas Ferrando,
for their thorough reviews and helpful suggestions.                    A. Díaz-Reolid, I. Falcó, W. Randazzo, I. Girón-Guzmán,
                                                                       A. Allende, M. A. Bracho, I. Comas, et al. Detection of
Funding. This research was supported by the Operational                genomic variants of SARS-CoV-2 circulating in wastewater
Program Integrated Infrastructure within project: Pange-               by high-throughput sequencing. 2021.
nomics for personalized clinical management of infected           [14] A. Rambaut et al. Preliminary genomic characterisation of
persons based on identified viral genome and human ex-                 an emergent SARS-CoV-2 lineage in the UK defined by a
ome (Code ITMS:313011ATL7, 80%) co-financed by the                     novel set of spike mutations, 2020.
European Regional Development Fund. It was also sup-              [15] A. Rambaut, E. C. Holmes, Á. O’Toole, V. Hill, J. T. Mc-
ported by VEGA 1/0458/18 (10%) and Comenius Univer-                    Crone, C. Ruis, L. du Plessis, and O. G. Pybus. A dynamic
sity Grant UK/336/2021 (10%).                                          nomenclature proposal for SARS-CoV-2 lineages to assist
                                                                       genomic epidemiology. Nature microbiology, 5(11):1403–
                                                                       1407, 2020.
References                                                        [16] The World Health Organization.              Tracking SARS-
                                                                       CoV-2 variants, 2021.         https://www.who.int/en/
                                                                       activities/tracking-SARS-CoV-2-variants/.
 [1] B. Brejova, K. Borsova, V. Hodorova, V. Cabanova, A. Ga-
                                                                  [17] J. R. Tyson, P. James, D. Stoddart, N. Sparks, A. Wick-
     furov, D. Fricova, M. Nebohacova, T. Vinar, B. Klempa,
                                                                       enhagen, G. Hall, J. H. Choi, H. Lapointe, K. Kamelian,
     and J. Nosek. Nanopore sequencing of SARS-CoV-2:
                                                                       A. D. Smith, et al. Improvements to the ARTIC multiplex
     Comparison of short and long PCR-tiling amplicon proto-
                                                                       PCR method for SARS-CoV-2 genome sequencing using
     cols. medRxiv, 2021.
                                                                       nanopore. bioRxiv, 2020.
 [2] D. A. Coil, T. Albertson, S. Banerjee, G. Brennan, A. J.
                                                                  [18] M. Wasiuta. Worldometers, 2009.
     Campbell, S. H. Cohen, S. Dandekar, S. L. Díaz-Muñoz,
     J. A. Eisen, T. Goldstein, et al. SARS-CoV-2 detection       [19] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778:
     and genomic sequencing from hospital surface samples col-         L-BFGS-B: Fortran Subroutines for Large-Scale Bound-
     lected at UC Davis. medRxiv, 2021.                                Constrained Optimization. ACM Transactions on Mathe-
                                                                       matical Software, 23(4):550–560, 1997.
 [3] A. Crits-Christoph, R. S. Kantor, M. R. Olm, O. N. Whit-
     ney, B. Al-Shayeb, Y. C. Lou, A. Flamholz, L. C. Kennedy,
     H. Greenwald, A. Hinkle, et al. Genome sequencing of
     sewage detects regionally prevalent SARS-CoV-2 variants.
     medRxiv, 2020.
 [4] ECDC. Rapid increase of a SARS-CoV-2 variant with mul-
     tiple spike protein mutations observed in the United King-
     dom—20 december 2020. ECDC: Stockholm, 2020.
 [5] S. Elbe and G. Buckland-Merrett. Data, disease and diplo-
     macy: GISAID’s innovative contribution to global health.
     Global Challenges, 1(1):33–46, 2017.
 [6] R. S. Fontenele, S. Kraberger, J. Hadfield, E. M. Driver,
     D. Bowes, L. A. Holland, T. O. Faleye, S. Adhikari, R. Ku-
     mar, R. Inchausti, et al. High-throughput sequencing of
     SARS-CoV-2 in wastewater provides insights into circulat-
     ing variants. medRxiv, 2021.
 [7] K. Jahn, D. Dreifuss, I. Topolsky, A. Kull, P. Ganesanan-
     damoorthy, X. Fernandez-Cassi, C. Bänziger, E. Stach-
     ler, L. Fuhrmann, K. P. Jablonski, et al. Detection of
     SARS-CoV-2 variants in Switzerland by genomic analysis
     of wastewater samples. medRxiv, 2021.

</pre>