Introduction

Macao, S.A.R.:

PhyloTransformer: A Self-supervised Discriminative Model for SARS-CoV-2 Viral Mutation Prediction Based on a Multi-head Self-attention Mechanism

Yingying Wu

0 3

Shusheng Xu

1 2

Shing-Tung Yau

yau@math.harvard.edu 0

Yi Wu

1 2 0 Harvard University, Center of Mathematical Sciences and Applications 1 Shanghai Qi Zhi Institute , Shanghai , China 2 Tsinghua University, Institute for Interdisciplinary Information Sciences , Beijing , China 3 University of Houston , Houston, U.S

2023

1 9 20

In this article, we developed PhyloTransformer, a Transformer-based self-supervised discriminative model, which can model genetic mutations that may lead to viral reproductive advantage. We trained PhyloTransformer on 1,765,297 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences to infer fitness advantages, by directly modeling the nucleic acid sequence mutations. PhyloTransformer utilizes advanced techniques from natural language processing to enable eficient and accurate intra-sequence dependency modeling over the entire RNA sequence. prediction accuracy of novel mutations and novel combinations using our method and baseline models that only take local segments as input. We found that PhyloTransformer outperformed every baseline method with statistical significance. We also predicted the occurrence of mutations in each nucleotide of the receptor binding motif (RBM) and predicted modifications of -glycosylation sites. We anticipate that the viral mutations predicted by PhyloTransformer may identify potential mutations of threat to guide therapeutics and vaccine design for efective targeting of future SARS-CoV-2 variants. COVID-19, SARS-CoV-2, spike protein, variants of concern, PhyloTransformer, self-supervised neural network Severe acute respiratory syndrome coronavirus 2 (SARS- the changing immune profile of the human population. a 10% mortality rate [3, 4]. Middle East respiratory syn- impact on fitness advantages [ 9]. However, the evolutionvirus exhibited relative evolutionary stasis for approxi- In addition, other SARS-CoV-2 mutations introduced an KDH@IJCIA 2023: The 6th international workshop on knowledge †These authors contributed equally.

Prediction Mechanism

Introduction

CoV-2) is the causative agent of Coronavirus disease 2019 (COVID-19). The unprecedented COVID-19 pandemic is one of three major pathogenic zoonotic disease outbreaks caused by -coronaviruses in the past two decades [1, 2]. Severe acute respiratory syndrome coronavirus (SARS-CoV) emerged in 2002, infecting 8,000 people with drome coronavirus (MERS-CoV) emerged in 2012 with 2,300 cases and a 35% mortality rate [5]. The third outbreak, mediated by SARS-CoV-2, emerged in 2019 with a mortality rate of 3.6% [6] and 219 million cases have been reported as of October 2021.

After the emergence of SARS-CoV-2 in late 2019, the

mately 11 months. Since the end of 2020, SARS-CoV-2 has consistently acquired approximately two mutations per month [7] resulting in novel variants of concern (VOCs).

As more individuals became vaccinated against SARS[10], transmissibility [11, 12], angiotensin converting enzyme 2 (ACE2) binding afinity [ 13], or antigenicity [14]. optimized trade-of to improve overall fecundity. Heavily mutated lineages have also been reported, such as the lineage B.1.1.298, which harbors the following four amino acid substitutions: ΔH69–V70, Y453F, I692V, and M1229I [15]. Some mutations may amplify other mutations, providing an improved fitness advantage. For example, the combination of E484K, K417N, and N501Y results in the highest degree of conformational alterations compared to either E484K or N501Y alone [16]. AccumuWe used the hCoV-19/Wuhan/WIV04/2019 sequence (WIV04) as our reference sequence, which is the oficial reference sequence employed by GISAID (EPI_ISL_402124). WIV04 represented the consensus of several early submissions for the -coronavirus responsible for COVID-19 [19], which was isolated by the Wuhan Institute of Virology from a clinical sample of a bronchoalveolar lavage fluid for RNA extraction and metagenomic next-generation sequencing. The consensus sequence was obtained by de novo assembly [20].

Based on WIV04, we define a mutation as the change Figure 1: PhyloTransformer prediction paradigm. in a nucleotide at a particular position that is diferent from the reference sequence. We define a mutation at immediate attention are circulating, which highlights the a particular position that only occurs in the testing set urgent need to develop efective prevention and treat- but does not occur within the training set as a novel mument strategies. tation, which signifies a mutation that is novel for the

While vaccination has been the most important and training set. We define all the novel mutations over an efective preventive measure, it is also facing challenges. RNA sequence as a novel combination, i.e., a combination The mRNA vaccine BNT162b2 (Pfizer–BioNTech) has of mutations that do not occur in the training data. The 95% eficacy against COVID-19 [ 17]. However, the es- prediction of novel mutations aims to predict single mutatimated efectiveness of the vaccine against the B.1.1.7 tions, while the prediction of novel combinations aims to variant was 89.5% (95% CI, 85.9 to 92.3) at 14 or more predict a collection of single mutations that jointly occur days after the second dose and 75.0% (95% CI, 70.5 to in a mutated sequence. 78.9) against the B.1.351 variant [18] at 14 or more days The prediction accuracies of novel mutations and novel after the second dose. Several studies have characterized combinations were evaluated after the predicting modmultiple mutations that change the antigenic phenotype. els PhyloTransformer, Local Transformer, and ResNet-18 Thus, these studies elucidate how these mutations afect converged. We first performed lag 1 autocorrelation to antibody-mediated neutralization. Variants containing test the correlation between accuracy scores obtained these mutations are potentially highly virulent and have from models that are one checkpoint apart. The autoreceived much recent attention. However, it remains un- correlation tests were performed on small, medium, and known whether more infectious variants exist along with large datasets for predicting novel mutations and novel the likelihood that they will appear and transmit. De- combinations, with a total of 18 tests. We found no time signing vaccines after a novel variant has emerged is not dependency between the 10 accuracy scores in each of optimal because the variant could potentially compro- these 18 tests. For other classical machine learning modmise existing vaccines and spread among the population. els, we repeated the experiment 10 times for each dataset. Thus, more infections might generate further variants, The details are reported in Box 1C. leading to a never-ending pandemic. In this section, we first evaluated PhyloTransformer

In order to win the race against the rapidly evolving generated predictions of novel mutations and novel comSARS-CoV-2, an intelligent system capable of forecasting binations. Next, we compared the accuracy of each prepotential VOCs before they actually appear is urgently re- diction with those obtained from baseline models. We quired. Therefore, in order to infer fitness advantages, we then reported our predictions in the receptor binding proposed PhyloTransformer, which models constraints motif (RBM). Finally, we predicted modifications of from natural sequences, including long-range dependen- glycosylation sites to help identify mutations associated cies between positions. We hope that PhyloTransformer with altered glycosylation that might be favored durcan be used to predict novel mutations and novel com- ing viral evolution. The detailed model architecture and binations of mutations in SARS-CoV-2, as depicted in training process are reported in the Methodology section. Fig. 1. Thus, we anticipate that when variants of high consequence arise, existing vaccines based on PhyloTrans- Predicting Novel Mutations former predictions will have already been developed that target those strains.

We evaluated the eficacy of PhyloTransformer to predict novel mutations and compared it to baseline model predictions from three datasets with diferent sizes spanning diferent time frames. The prediction results are

Smal DMaetdaisuimze

PhyloTransformer Large Local Transformer Smal ResNet-18

DMaetdaisuimze Large

Random Forest Combination : Box 1 | Prediction Accuracy. A. Prediction accuracy of novel mutations from the small, medium, and large datasets based on PhyloTransformer and the best baseline methods. B. Prediction accuracy of novel combinations trained with the small, medium, and large datasets based on PhyloTransformer and the best baseline methods. The accuracy improvement for each indicated model was calculated based on dividing the number of correct predictions by the expected number of correct random guesses. C. Prediction accuracy of PhyloTransformer–and baseline method–generated predictions of novel mutations and novel combinations. Sig. Phylo: -value with respect to PhyloTransformer, compared to random guessing resulted in an accuracy of 0.26% with an SD = 0.012%. Sig. Local: -value with respect to Local Transformer. reported in Box 1. For each mutation, we masked the defined as the following: wrahwicnhunculecoletoidtiedienitthweoruelfdermeuntcaetesetqou,aenndcewaensdelpercetdedicttehde Improvement ∶= Model Acc. . nucleotide with the highest confidence as our prediction. Random Guessing Acc. The prediction accuracy is the proportion of positions For the small dataset, there were 2.26 mutations on averthat are predicted correctly among all novel positions in age with a standard deviation (SD) = 5.06; for the medium the testing set. The prediction accuracy of random guess- dataset, there were 3.06 mutations on average with an ing is exactly 1/3. We evaluated the prediction eficacy SD = 2.56; and for the large dataset, there were 8.75 mutaaveraged over 10 checkpoints after the convergence of tions on average with an SD = 2.87. For the small dataset, PhyloTransformer, Local Transformer, and our baseline random guessing resulted in an accuracy of 13.30% with models on three datasets with the variance marked either an SD = 1.12%; for the medium dataset, random guessbelow or above. Next, we reported the model predictions ing resulted in an accuracy of 5.42% with an SD = 0.12%; from each dataset, which is displayed in Box 1A. and for the large dataset, random guessing resulted in an

We performed a two-sample -test of proportions and accuracy of 0.26% with an SD = 0.012%. The predicted found that for each model, the best prediction accuracy results are summarized in Box 1B, where the accuracy of novel mutations from the large dataset among the 10 improvement value was defined as follows: given the checkpoints significantly less than PhyloTransformer. Lo- dataset (small, medium, or large), take the number of corcal Transformer had the best performance among base- rect predictions generated by the indicated model and line models, but the average over 10 checkpoints was still divide that value by the expected number of correct ran11% lower than that of PhyloTransformer on the large dom guesses. dataset with statistical significance, as shown in Box 1C. We performed a two-sample -test of proportions to Table 1 reports the 20 novel mutations predicted by train- determine whether the accuracy of predicting novel coming PhyloTransformer with greatest probability using the binations by PhyloTransformer significantly less than large dataset. baseline models on the large dataset. The prediction accuracy of PhyloTransformer among the 10 checkpoints Predicting Novel Combinations was higher than that generated by all of the baseline models with statistical significance. Local Transformer was no longer the best baseline model, while ResNet-18 and random forest outperformed Local Transformer for the task of predicting novel combinations.

If a sequence in the testing set does not exist in the training set, we compared it to the reference sequence, then masked the mutated positions and generated predictions at these positions. If the model predicts all the mutations correctly in this sequence, we say that it predicted Predictions in the Spike Protein RBM a novel combination correctly. The accuracy of predicting novel combinations is the proportion of the number of SARS-CoV-2 infects human cells by binding of the visequences whose combinations are predicted correctly ral surface protein spike to its receptor on human cells, to all the sequences in the testing set. the ACE2 protein. Because of its role in viral entry, the

The dificulty of predicting novel combinations changes RBD is a dominant determinant of zoonotic cross-species as the size of the dataset changes, so we measure our transmission. Although SARS-CoV-2 does not cluster prediction eficacy by accuracy improvement, which is within SARS and SARS-related coronaviruses, the RBD of SARS-CoV and SARS-CoV-2 share structural similarities, RBM. PhyloTransformer trained with the large dataset predicted only two mutations. The first mutation was predicted at amino acid 488, changing it from C to R, which is closely adjacent to F486. The second mutation was predicted at amino acid 497, changing it from F to S, once again right next to P499. The close proximity of the introduced mutations and predicted mutations indicated that PhyloTransformer is potentially capable of capturing meaningful genetic phenomena and can generate efective predictions. Our prediction results are probably due to their shared zoonotic ancestry. This sim- reported in Table 2. ilarity implies convergent evolution for improved binding to ACE2 between the SARS-CoV and SARS-CoV-2 RBDs. Therefore, we focused our predictions on the spike Prediction of Glycosylation Site protein RBD. The total length of the SARS-CoV-2 spike Modifications protein is 1,273 amino acids, and its structural features are listed below: The SARS-CoV-2 spike protein is heavily glycosylated.

Viral glycosylation plays a vital role in viral pathobiology, • A signal peptide is located at the N-terminus including antibody resistance, target recognition, viral (1–13 residues). entry, and host immune modulation [23]. Glycosylation • The S1 subunit (14–685 residues) is responsible sites facilitate immune evasion by shielding epitopes from for receptor binding. The S1 subunit contains antibody neutralization; therefore, they are under selecan N-terminal domain (14–305 residues), a C- tive pressure. Since glycosylation site modifications of terminal domain 0 (306-330 residues), an RBD the SARS-CoV-2 spike protein will likely impact the over(331-527 residues), a C-terminal domain 1 (528- all activities of SARS-CoV-2 replication and escape from 590 residues), and a C-terminal domain 2 (591-685 immune surveillance [24], we examined glycosylation residues). site model predictions. We reported our results on the • The S2 subunit (686–1273 residues) is respon- glycosylation sites to help identify mutations associated sible for receptor binding and membrane fu- with altered glycosylation that are favored during viral sion. The S2 subunit contains cleavage sites evolution. PhyloTransformer predicted three mutations (686-815 residues) at S1/S2 and S2’, a fusion of the following glycosylation sites: N122, N331, and peptide (816–855 residues), a fusion peptide re- N343. Table 3 shows the predicted mutations in the spike gion (856-911 residues), a heptapeptide repeat protein changing N to a diferent amino acid. Figure 3 sequence 1 (912–984 residues), a center helix summarizes the predicted mutations, including existing (985-1034 residues), a connector domain (1035- mutations (left) and novel mutations (right), with predic1080 residues), a connector domain 1 (1081- tions mutating away from amino acid N highlighted. 1147 residues), a heptapeptide repeat sequence 2 (1163–1213 residues), a transmembrane domain (1213–1237 residues), and a cytoplasmic domain (1237–1273 residues) [21].

The spike protein RBM comprises amino acids 438 to 506. Yi et al. [22] compared the SARS-CoV-2 and SARSCoV RBD afinity for hACE2 by creating single amino acid substitution mutations in the SARS-CoV and SARS-CoV-2 RBM sequences. The authors found that receptor binding was enhanced by introducing amino acid changes at P499, Q493, F486, A475, and L455, which are all localized to the Sites Glycosy.

Glycosy.

N-mut.

Methodology

Technical Background In this section, we will briefly review the history of sequence models that led to the development of Transformer and then introduce our PhyloTransformer model.

The recurrent neural network (RNN) is the standard neural sequence model which extends the conventional feedforward neural network with a recurrent hidden state dependent on the previous timestep. RNN and its variants, such as the long short-term memory (LSTM) [25] and the gated recurrent unit (GRU) [26], have been widely applied to important AI tasks, including language modeling [27], speech recognition [28], handwriting recognition [29], and machine translation [30]. However, RNNs are dificult to train in practice since the gradients tend to either vanish or explode as the sequence length increases [31]. In addition, these models encode a source sequence into a fixed-length vector, which becomes a bottleneck when tackling particularly long sequences. Therefore, the attention mechanism was introduced [32] to augment RNNs with an additional variable-length representation when encoding the input sequence. The attention mechanism allows the model to only focus on a subset of the input sequence for decoding. The Transformer model comprises a purely attention-based network architecture without RNN backbones to directly capture intra-position dependencies via the self-attention mechanism [33]. In self-attention, each sequence item has direct access to all the other positions, which yields a more powerful global representation of the sequence. This feature also inspires biological applications due to the long-range interactions of genetic sequences. However, the following challenges in modeling mutations on RNA sequences remain: • Length adaptation: most natural language processing (NLP) models deal with sequence lengths of a few hundred to a thousand, but the RNA sequence of SARS-CoV-2 is much longer: the genome of SARS-CoV-2 is 29,903 nucleotides in length [34], and the spike protein has 3,819 nucleotides. • Mutation sparsity: due to the proofreading functions of coronaviruses [8], mutations in the SARS-CoV-2 genome are rare. Our dataset shows consistency in this regard.

Regular Transformer scales quadratically with respect to the input sequence length, and the sparsity of mutations might lead to the generative Transformer model overfitting the identical parts while ignoring the mutations. Therefore, to adapt to biological problems and address issues regarding genetic mutations, a new model that tackles the length and sparsity issues commonly encountered in existing deep neural network architectures is required. To address these two challenges, we propose PhyloTransformer, which is a linear time complexity discriminative model based on the Transformer architecture.

The time and space linearity are achieved by adopting FAVOR+ from Performer [35], which performs an unbiased fast attention approximation with low variance. The mutation sparsity issue is addressed by directly modeling the mutations using the MLM training objective from BERT [36], which is a discriminative variant of Transformer for supervised NLP tasks. A detailed description of PhyloTransformer architecture is presented in the next section.

Model Development We adopted a discriminative approach to model the mutation probability at a particular position in the RNA sequence. Let ( = | ) denote the probability of the th nucleotide changing to given the reference sequence . We will demonstrate how to predict ( | ) by PhyloTransformer and other baseline models in this section.

The PhyloTransformer Model The PhyloTransformer model adopts a Transformerbased network, which utilizes the full spike sequence of 3,819 nucleotides as input and generates output mutation probabilities at particular positions. We followed the MLM pre-training objective from BERT [36]. Note that the attention mechanism in Transformer [33] calculates attention matrices with a shape of × (where is the length of the sequence) to capture the relationship between nucleotides. In order to reduce the computation complexity of the attention matrix, we adopted the FAVOR+ technique from Performer [35], which performs approximate attention computation in linear time. In the following content, we first present the network architecture of PhyloTransformer. Next, we introduce FAVOR+ for fast low-rank approximation of the regular full-rank attention computation in linear time. Finally, the overall training process will be discussed in detail.

Bidirectional Transformer Encoder: Let = ( 1, 2, ..., ) denote the reference sequence, where is the nucleotide at position in the RNA sequence. We ifrst applied trainable projections to map each with its position information to three embedding vectors, , and , for attention computation. Suppose the dimension of each embedding is . The output of the attention layer is computed by the following equation:

Attention(, , ) = ⋅ = softmax ( )

( 1 ) √ where ∈ ℝ × is the attention matrix.

= [ 1; 2; ...; ], = [ 1; 2; ...; ], and = [ 1; 2; ..., ] are embedding matrices in ℝ× , where , , and are row vectors representing three embeddings. After the attention layer is computed, we further applied a feedforward layer with a residual connection. An attention layer and a feed-forward layer compose a single Transformer module. We stacked the

Transformer modules as the overall network architecture of our PhyloTransformer model.

FAVOR+:

In the original attention mechanism, the

2 time complexity of computing the attention layer by Equation ( 1 ) is (

) , which becomes computationally intractable when is large. The Performer [35] model proposed kernelizable attention by deriving a mapping to decouple the attention matrix into ′ and ′, where ′ = ( ), ′ = ( ) and ′, ′ ∈ ℝ× , ≪ . In this case, the attention layer can be computed by the following equation:

Attention(, , ) =

−1( ′(( ′) )), = diag( ′(( ′) 1 )) ( 2 ) ( 3 ) where 1 is an all-ones vector of length . Since ′, ′ ∈ ℝ× , ∈ ℝ × , the computation complexity decreases to ( )

with respect to a small constant , making it computationally feasible to handle particularly long sequences such as RNA data.

Training process:

We denoted the reference sequence as = ( 1, 2, ..., ) and the mutated sequence as = ( 1, 2, ..., ), where and refer to the nucleotide at position . On average, there were 0.0592% mutations in the small dataset, 0.0801% mutations in the medium dataset, and 0.2291% mutations in the large dataset. These numbers refer to the average number of s that are different from the number of s in the respective dataset. tions in , and used the model to predict nucleotides in at those masked positions. Fig. 4 shows the workflow of our model. Specifically, we first identified the set of mutated positions = ( 1, … , ), where 1, … , are During the training process, we masked certain posi- termined by the equation: smoamske mrauntdaotemdppoossiittiioonnss and

C C

C G …… G

G …… A Spike RNA Sequence

Transformer Layer

X 6 A A

Mutated

Sequence A SReeqfeureennccee ( 5 ) ( 6 ) ( 7 ) several unchanged positions = ( 1′, … , ′)such that | ∪ | = 1.5%. Next, we applied a masking function ( ) to each nucleotide at the masked positions.

Namely, ∀ ∈ ∪ , the masking function changes ( ) =

< > ⎧ ⎨ ⎩Random({, , , }) 80% 10% 10% of cases, ( 4 ) of cases, of cases, where < >

is a special masking token. The masking function acts on 1.5% of the entire nucleotides and further randomly maps each nucleotide from this masking subset to ( 1 ) a special token < >

(80% chance), ( 2 ) a random substitution (10%), or ( 3 ) itself (10%).

Denoting the masked sequence as ̃, we encode ̃ with stacked Transformer modules and represent each nucleotide as a hidden vector ℎ from the model output.

Next, the probability distribution of the th nucleotide position over {A, T, C, G} is computed as follows: ( | )̃ = softmax ( ℎ )

∀ ∈ ∪ , where are trainable parameters. The probability of all the masked nucleotides is the following equation: ( | )̃ =

∏ ∈ ∪

( | )̃ The model is optimized to minimize the negative log probability over all the mutated sequences from the training set with respect to diferent masking positions, as de() = − ∈ ∑ [log ( | )̃ ] . () = − ∈ { ∈ ∪ [log ( | )̃ ]} .

Since most of the masked positions are mutated positions, our model is trained to concentrate on mutation predictions. Meanwhile, the randomly chosen positions (i.e., ) also improved the robustness of our model.

Local models In addition to PhyloTransformer, which considers the full sequence, we also examined baseline methods, which predict ( | ) based on local segments from the spike RNA sequence. There is a total of 3,819 nucleotides in the spike sequence. We can obtain a local segment of 15 nucleotides centered around each nucleotide with sequence padding. Thus, we can obtain 3,819 segments of 15 nucleotides from the full spike RNA sequence. The center position of each segment is masked. We adopted various classification methods (including neural models and non-neural methods) to predict the center nucleotide based on other nearby nucleotides. During the training phase, we split all training spike RNA sequences into segments and generated a local dataset with repeated segments filtered out. The training process is shown in Appendix A, where any classification method could be used, such as the standard Transformer, ResNet-18, MLP, logistic regression, KNN, random forest, and gradient boosting.

Conclusion

The overall goal of our research is to train a state-of-theart sequence model using existing viral genetic sequence data to identify SARS-CoV-2 variants that may have evolutionary advantages and become the emerging VOCs.

In this paper, we developed the PhyloTransformer model, a novel deep neural network with a multi-headed selfattention mechanism. PhyloTransformer was subjected to an advanced training methodology to predict potential mutations that may lead to enhanced virus transmissibility or resistance to antisera. Our computational platform may be helpful in guiding the design of therapeutics and vaccines for efective targeting of emerging SARS-CoV-2 VOCs, as well as novel mutants of other viruses that may cause pandemics.

Ethics Statement This research was based on the SARS-CoV-2 sequences in the Global Initiative for Sharing All Influenza Data (GISAID) database ( https://www. gisaid.org/). There is no human information involved in the data. [13] T. N. Starr, A. J. Greaney, S. K. Hilton, D. Ellis, K. H. [22] C. Yi, X. Sun, J. Ye, L. Ding, M. Liu, Z. Yang, X. Lu, Crawford, A. S. Dingens, M. J. Navarro, J. E. Bowen, Y. Zhang, L. Ma, W. Gu, et al., Key residues of M. A. Tortorici, A. C. Walls, et al., Deep mutational the receptor binding motif in the spike protein of scanning of SARS-CoV-2 receptor binding domain SARS-CoV-2 that interact with ace2 and neutralizreveals constraints on folding and ace2 binding, ing antibodies, Cellular & molecular immunology Cell 182 (2020) 1295–1310. 17 (2020) 621–630. [14] E. C. Thomson, L. E. Rosen, J. G. Shepherd, [23] K. J. Doores, The hiv glycan shield as a target for R. Spreafico, A. da Silva Filipe, J. A. Wojcechowskyj, broadly neutralizing antibodies, The FEBS journal C. Davis, L. Piccoli, D. J. Pascall, J. Dillen, et al., Cir- 282 (2015) 4679–4691. culating SARS-CoV-2 spike N439K variants main- [24] D. Hofmann, S. Mereiter, Y. J. Oh, V. Monteil, tain fitness while evading antibody-mediated im- R. Zhu, D. Canena, L. Hain, E. Laurent, C. Grumunity, Cell 184 (2021) 1171–1187. ber, M. Novatchkova, et al., Identification of lectin [15] J. Fonager, S. scientist Ria Lassaunière1, senior sci- receptors for conserved SARS-CoV-2 glycosylation entist Jannik Fonager1, S. scientist Morten Ras- sites, bioRxiv (2021). mussen, A. Frische, S. S. C. P. Strandh, S. scien- [25] S. Hochreiter, J. Schmidhuber, Long short-term tist veterinarian Thomas Bruun Rasmussen, C. vet- memory, Neural computation 9 (1997) 1735–1780. erinarian Anette Bøtner, C. V. A. Fomsgaard, Work- [26] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bening paper on SARS-CoV-2 spike mutations aris- gio, On the properties of neural machine translaing in danish mink, their spread to humans and tion: Encoder-decoder approaches, arXiv preprint neutralization data., ???? URL: https://files.ssi.dk/ arXiv:1409.1259 (2014).

Mink-cluster-5-short-report_AFO2(2020). [27] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, [16] G. Nelson, O. Buzko, P. R. Spilman, K. Niazi, S. Ra- S. Khudanpur, Recurrent neural network based bizadeh, P. R. Soon-Shiong, Molecular dynamic language model, in: Eleventh annual conference simulation reveals e484k mutation enhances spike of the international speech communication associRBD-ACE2 afinity and the combination of E484K, ation, 2010.

K417N and N501Y mutations (501Y. V2 variant) in- [28] A. Graves, A.-r. Mohamed, G. Hinton, Speech duces conformational change greater than n501y recognition with deep recurrent neural networks, mutant alone, potentially resulting in an escape in: 2013 IEEE international conference on acousmutant, BioRxiv (2021). tics, speech and signal processing, Ieee, 2013, pp. [17] F. P. Polack, S. J. Thomas, N. Kitchin, J. Absalon, 6645–6649.

A. Gurtman, S. Lockhart, J. L. Perez, G. P. Marc, E. D. [29] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, Moreira, C. Zerbini, et al., Safety and eficacy of the H. Bunke, J. Schmidhuber, A novel connectionist BNT162b2 mRNA Covid-19 vaccine, New England system for unconstrained handwriting recognition, Journal of Medicine (2020). IEEE transactions on pattern analysis and machine [18] L. J. Abu-Raddad, H. Chemaitelly, A. A. Butt, Efec- intelligence 31 (2008) 855–868. tiveness of the BNT162b2 Covid-19 Vaccine against [30] N. Kalchbrenner, P. Blunsom, Recurrent continuous the B.1.1.7 and B.1.351 Variants, New England Jour- translation models, in: Proceedings of the 2013 connal of Medicine (2021). ference on empirical methods in natural language [19] P. Okada, R. Buathong, S. Phuygun, processing, 2013, pp. 1700–1709.

T. Thanadachakul, S. Parnmen, W. Wongboot, [31] Y. Bengio, P. Simard, P. Frasconi, Learning longS. Waicharoen, S. Wacharapluesadee, S. Uttaya- term dependencies with gradient descent is difimakul, A. Vachiraphan, et al., Early transmission cult, IEEE transactions on neural networks 5 (1994) patterns of coronavirus disease 2019 (COVID-19) 157–166. in travellers from wuhan to thailand, january 2020, [32] D. Bahdanau, K. Cho, Y. Bengio, Neural machine Eurosurveillance 25 (2020) 2000097. translation by jointly learning to align and translate, [20] P. Zhou, X.-L. Yang, X.-G. Wang, B. Hu, L. Zhang, arXiv preprint arXiv:1409.0473 (2014).

W. Zhang, H.-R. Si, Y. Zhu, B. Li, C.-L. Huang, et al., [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, A pneumonia outbreak associated with a new coro- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Atnavirus of probable bat origin, nature 579 (2020) tention is all you need, in: Advances in neural in270–273. formation processing systems, 2017, pp. 5998–6008. [21] Y. Huang, C. Yang, X.-f. Xu, W. Xu, S.-w. Liu, Struc- [34] D. Kim, J.-Y. Lee, J.-S. Yang, J. W. Kim, V. N. Kim, tural and functional properties of SARS-CoV-2 spike H. Chang, The architecture of sars-cov-2 transcripprotein: potential antivirus drug development for tome, Cell 181 (2020) 914–921.

COVID-19, Acta Pharmacologica Sinica 41 (2020) [35] K. M. Choromanski, V. Likhosherstov, D. Dohan, 1141–1149. X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis,

A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell,

A. Weller, Rethinking attention with performers, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum? id=Ua6zuk0WRH. [36] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:10.18653/v1/N19- 1423.

A. Dataset Details We list the details of our three datasets in table 4. B. Local Models C. Training details

Start Date 01/01/2020 01/01/2020 01/01/2020 VOC Mutation 0.81% 0.14% 0.46% 2.38% 1.60% 18.51% 96.44% 19.34% 0.30% Other Mutation 0.14% 0.81% 0.06% 0.05% 0.07% 0.24% 0.02% 0.36% 19.41% VOC Mutation 1.67% 2.85% 3.46% 6.41% 8.78% 73.96% 99.39% 73.12% 2.47%

[1]

Cui ,

Li ,

Z.-L.

Shi , Origin and evolution of pathogenic coronaviruses , Nature Reviews Microbiology 17 ( 2019 ) 181 - 192 .

[2]

De Wit ,

N. Van

Doremalen ,

Falzarano ,

V. J.

Munster , SARS and MERS: recent insights into emerging coronaviruses , Nature Reviews Microbiology 14 ( 2016 ) 523 - 534 .

[3]

M. S.

Patel ,

M. J.

Gutman ,

J. A.

Abboud , Orthopaedic considerations following COVID-19: lessons from the 2003 SARS outbreak , JBJS reviews 8 ( 2020 ) e20 .

[4]

D. S.

Hui ,

E. I.

Azhar ,

T. A.

Madani ,

Ntoumi ,

Kock ,

Dar , G. Ippolito,

T. D.

Mchugh ,

Z. A.

Memish ,

Drosten , et al., The continuing 2019- nCoV epidemic threat of novel coronaviruses to global health-the latest 2019 novel coronavirus outbreak in wuhan, china , International journal of infectious diseases 91 ( 2020 ) 264 - 266 .

[5]

R. L.

Graham ,

R. S.

Baric , Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission , Journal of virology 84 ( 2010 ) 3134 - 3146 .

[6]

Baud ,

Qi ,

Nielsen-Saines ,

Musso ,

Pomar , G. Favre, Real estimates of mortality following COVID-19 infection, The Lancet infectious diseases 20 ( 2020 ) 773 .

[7]

Worobey ,

Pekar ,

B. B.

Larsen ,

M. I.

Nelson ,

Hill ,

J. B.

Joy ,

Rambaut ,

M. A.

Suchard ,

J. O.

Wertheim ,

Lemey , The emergence of SARS-CoV2 in europe and north america , Science 370 ( 2020 ) 564 - 570 .

[8]

E. C.

Smith ,

Blanc ,

Vignuzzi ,

M. R.

Denison , Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics , PLoS pathogens 9 ( 2013 ) e1003565 .

[9] O. A. MacLean , R. J. Orton ,

J. B.

Singer ,

D. L.

Robertson , No evidence for distinct types in the evolution of SARS-CoV-2, Virus Evolution 6 ( 2020 ) veaa034 .

[10]

Yurkovetskiy ,

Wang ,

K. E.

Pascal , C. TomkinsTinch,

T. P.

Nyalile ,

Wang ,

Baum ,

W. E.

Diehl ,

Dauphin ,

Carbone , et al., Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant , Cell 183 ( 2020 ) 739 - 751 .

[11]

Y. J.

Hou ,

Chiba ,

Halfmann ,

Ehre ,

Kuroda ,

K. H.

Dinnon ,

S. R.

Leist ,

Schäfer ,

Nakajima ,

Takahashi , et al., SARS-CoV-2 D614G variant exhibits eficient replication ex vivo and transmission in vivo , Science 370 ( 2020 ) 1464 - 1468 .

[12]

Volz ,

Hill ,

J. T.

McCrone ,

Price ,

Jorgensen , Á. O'Toole , J.

Southgate , R.

Johnson , B . Jackson , F. F. Nascimento , et al., Evaluating the efects of SARSCoV-2 spike mutation D614G on transmissibility and pathogenicity , Cell 184 ( 2021 ) 64 - 75 .

Train End Date 03 /20/ 2020 04/22/ 2020 02/17/2021 In Training Set Unmutated 99 .05% 99 .05% 99 .48% 97 .57% 98 .33% 81 .24% 3 .54% 80 .30% 80 .30%