1. Introduction

Acceleration For Bioinformatics-Based Machine Learning

Anderson Acceleration, SVM, Sequence Analysis

0 Georgia State University , Atlanta , USA

2023

Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms with applications in various fields, including deep learning and optimization. Despite its popularity in these areas, the efectiveness of AA in classical machine learning classifiers has not been thoroughly studied. Tabular data, in particular, presents a unique challenge for deep learning models, and classical machine learning models are known to perform better in these scenarios. However, the convergence analysis of these models has received limited attention. To address this gap in research, we implement a support vector machine (SVM) classifier variant incorporating AA to speed up convergence. We evaluate the performance of our SVM with and without Anderson acceleration on several datasets from the biology domain and demonstrate that the use of AA significantly improves convergence and reduces the training loss as the number of iterations increases. Our findings provide a promising perspective on the potential of Anderson acceleration in training simple machine learning classifiers and underscore the importance of further research in this area. By showing the efectiveness of AA in this setting, we aim to inspire more studies that explore the applications of AA in classical machine learning.

1. Introduction Anderson acceleration is a method that can be used to

Based on the diference between the current and prior weight vectors, a correction term is added to the weight vector updates at each iteration. When the gradients are changing quickly, or the optimization landscape is very non-convex, this correction term can aid in reducing oscillations and speeding convergence. Consider the optimization issue as a trajectory in the weight space, where the weight vector reflects the position at each iteration, to eration, the gradients at each location alone control the trajectory of the optimization process. While Anderson acceleration can smooth out the trajectory and minimize oscillations, the trajectory is also afected by the diference between the current and prior weight vectors.

Solving the convex problem in finding gradient descent is a typical problem in optimization. Newton’s methods use the inverse Hessian matrix [1] to accelerate gradient descent, and they are successful in achieving a faster rate of convergence compared to gradient descent or accelerated gradient descent, but it is very expensive. By utilizing knowledge of the curvature of the loss function for deep learning [9]. Although it might not ofer as big of a gain in terms of convergence speed compared to more complex models, Anderson acceleration may still be efective for optimizing simpler classical ML models.

Ultimately, the precise characteristics of the optimiza- the technique is efective in improving convergence in tion problem being solved will determine how efective other optimization problems [10, 11, 12, 13]. Anderson acceleration is in each given scenario. Another area where Anderson acceleration has shown

In this work, we propose a robust approach to perform promise is in training sparse models, such as sparse codAnderson acceleration (AA) to speed up the training of ing and dictionary learning [19]. In these applications, SVM classifier models for multi-dataset training from Anderson acceleration efectively improves convergence the domain of biological sequencing. We regularize AA and achieves sparsity, an essential consideration in many by including it in the loss optimization of simple lin- machine-learning models. ear classifier models (SVMs) and classical ML training, In recent years, researchers have also explored the in contrast to previous work in complex deep learning use of Anderson acceleration in the training of generamodels. We numerically demonstrate the efectiveness tive adversarial networks (GANs) [20]. In these applicaof the proposed acceleration by comparing the training tions, Anderson acceleration has been shown to improve loss with an increasing number of iterations on diferent convergence and stability and to produce high-quality sets of biological sequences. The results show that using synthesized images.

AA significantly improves convergence and eficiently Finally, it’s worth noting that Anderson acceleration accelerates the training of traditional ML models. has also been applied to the training of robust models that are robust to outlier examples and to adversarial attacks [21]. In these applications, Anderson acceleration 2. Related Work efectively improves the robustness of machine learning models and defends against adversarial attacks.

Iterative optimization methods like gradient descent and its variants are widely used for training ML models, but convergence can be slow, especially for high-dimensional 3. Proposed Approach problems. Anderson acceleration (AA) is a technique for speeding up the convergence of these methods by ex- This section first discusses the algorithm we use for the ploiting the geometry of the search space. It was first proposed method. Later, we discuss the theoretical underintroduced by Anderson [10] as a way to accelerate the standing of Anderson Acceleration and the assumptions convergence of the conjugate gradient method and has considered. since been applied to a variety of optimization techniques, Anderson Acceleration (AA) attempts to make greater such as Newton’s method [11], stochastic gradient de- use of previous data than the fixed-point iteration, which scent [12], and the Nelder-Mead simplex algorithm [13]. only takes the most recent iteration to produce a new

In recent years, there has been a growing interest in estimate, +1 = ( ). The proposed method’s algousing Anderson acceleration for training deep neural net- rithmic pseudocode is provided in Algorithm 1, and the works, where it has been applied to a variety of tasks, model training flow chart is shown in Figure 1. For model such as image classification [ 14], natural language pro- training, given the feature embedding X (or ) made from cessing and reinforcement learning [15]. Anderson ac- SARS-CoV-2 sequences and its lineage (variants) as laceleration is particularly well-suited for deep learning bels Y, the first step involves the embedding generation problems, where it has been shown to improve conver- using the methods discussed in Section 3.2, the feature gence and generalization performance [16, 17]. A method vector generated and the labels for the sequences are to estimate a sparse generalized linear model with con- then supplied to the algorithm. In the algorithm, firstly, vex or non-convex separable penalties using Anderson the weight vector is initialized with random values (1 × acceleration is also proposed in [17]. In these approaches, length of sequence). We then initialize the and Anderson acceleration has been shown to improve con- values (lines 2-3 in Algorithm 1) for each iteravergence and generalization performance compared to tion. Afterward, for each input sample X and its label y, traditional optimization methods, such as gradient de- we predict using weight vector ⃗ (line 5 in Algorithm 1), scent. In addition, it has also been applied to logistic the predicted value is normalized, and the gradient is regression [18] and other ML models. updated (lines 5 and 6 in Algorithm 1). Sample loss is

However, despite these advances, Anderson accelera- updated, and the iteration loss list is maintained tion has not been widely applied to classical machine (lines 8 to 9 in Algorithm 1). After every sample is prolearning classifiers, such as support vector machines cessed, the gradient is averaged out, and weight history (SVM), despite the potential for improved convergence is maintained for the iteration (lines 11 and 12 in Algorates. This is particularly relevant for tabular data, where rithm 1), also shown in Figure 1-e. Anderson acceleration classical machine learning classifiers are widely used. is used to update the weight vector from the third iterThe limited exploration of Anderson acceleration in clas- ation since we need at least two weight histories. The sical machine learning classifiers is surprising, given that diference between the last two weight histories is com

|∇ () − ∇ ( )| ≤ | − | for all , . 2. The objective function is bounded below, i.e., there exists a constant min such that () ≥ min for all . 3. The optimization algorithm is using a fixed step size , and the sequence of points generated by the algorithm satisfies puted and is multiplied with Anderson factor as shown in lines 14 and 15 in Algorithm 1, also shown in Figure Figure 1-ii. The loss and accuracy for the iteration are saved, and the next iteration is performed to do the same steps. Finally, after all iterations, the loss list is returned for the given input feature vectors. The loss for each Iteration is captured and argued to be the better option for faster convergence using Anderson Acceleration.

3.1. Anderson Acceleration

One way to formally prove the convergence of Anderson acceleration is to use the concept of “linear convergence”, which refers to the rate at which the optimization process approaches the optimal solution. Specifically, we can show that under certain conditions, the Anderson acceleration optimization process converges linearly, meaning that the error decreases by a constant factor at each iteration. This contrasts standard gradient descent, which converges at a sublinear rate (e.g., the error decreases by a factor less than 1 at each iteration).

To prove this result, we can start by considering the optimization problem in the form of a series of updates to the weight vector, where the update at each iteration is given by:

+1 = − ∇ ( ) where is weight vector at iteration , is the learning rate, and ∇ ( ) is the gradient of the objective function at . Now, we can add the Anderson acceleration term to the update, resulting in:

+1 = + ( − −1 ) − ∇ ( ) Next, we can define the error at each iteration as:

= − ∗ where ∗ is the optimal weight vector. Now, we can substitute the expression for the update into the expression for the error and rearrange it to get: (7) (8) +1 = (1 − ) + −1 − ∇ ( ) (4) wThheer1e y−i1s0theistraudedleadbetlo, atnhde yyPPrreeddistothaevporieddtichteedlolagboelf. where we have used the fact that ∗ = − ∇ ( ). zero, which will cause an infinity error. The negative Now, we can define the “damping factor” as: sign ensures the optimization problem is formulated as a minimization problem (hence, our loss can be negative).

= 1 − (5) The flowchart for training with Anderson Acceleraand rewrite the expression for the error as: tion(AA) is shown in Figure 1. We provide the Feature vectors X (or ) as input along with the labels Y. Few pa +1 = + (1 − ) −1 (6) roaf miteetreartiionnitsiaalnizdattihoenwsaerieghretqvueicrteodr, isnuictihalaizsetdhewnituhmrbaenrThis expression has the form of a weighted average, dom values. Anderson acceleration factor , for which we where the weight of the current error is given by , and tried several values to study its impact and select the best the weight of the previous error is given by 1 − . Now, value. An empty list for loss is also shown in Figure 1-b. we can make the following assumptions: For a given number of iterations, we process the samples | +1 − | ≤ for some constant and all .

These assumptions are typically made in the analysis

of gradient descent algorithms. They allow us to establish certain convergence properties of the algorithm. Specifically, under these assumptions, it can be shown that the sequence of points generated by gradient descent with Anderson acceleration converges to a stationary point (a point where the gradient is zero) of the objective function at a rate of (1/) , where is the iteration number.

This convergence rate is faster than (1/ 2)achieved by (1) plain gradient descent without Anderson acceleration.

Intuitively, Anderson acceleration can be thought of as a way to incorporate information from past iterations into the current iteration to improve the convergence rate of the optimization algorithm. This is achieved using a weighted combination of the current gradient and (2) the diference between the current and previous iterates.

The weights are chosen such that the resulting update direction better approximates the true gradient at the current iterate, leading to faster convergence. (3) To compute the loss, we use “cross-entropy loss” using the following expression: Cross Entropy Loss = −( ×( +1−10)) (9) to compute the gradient and loss for the sample. The gradient is averaged out, and we update the weight using Anderson Acceleration for that iteration, also shown in Figure 1-ii. The process is repeated for the given number of iterations.

We employ the three representation learning techniques below to convert the biological sequences into lowdimensional embeddings.

3.2.1. Spike2Vec [22]

This technique ofers numerical embedding of the sup

plied input spike sequences to facilitate the use of ML models. Initially, it produces -mers of the supplied spike sequence because -mers are known to maintain the sequence’s ordering information. For a sequence of length , the total number of -mers produced is − + 1 .

For every particular sequence, -mers is a collection of (contiguous) amino acids (also known as mers) of length . (also called nGram in the NLP domain). To convert the -mers alphabetical data into a numerical representation, the Spike2Vec computes the frequency vector based on -mers. This vector comprises the counts of each -mer in the sequence. A fixed-length feature vector is then made using the generated -mers and their frequencies in a sequence. The character alphabet Σ and the length of the -mers are used to calculate the length of this feature vector, which is |Σ| . 3.2.2. Minimizer [23] The performance of sequence classification is significantly impacted by the size and sparsity of feature vectors for sequences based on -mers frequencies. The idea of employing non-contiguous length sub-sequences ( mers), proposed by spaced -mers, to create compact feature vectors with reduced sparsity and size. It first computed -mers using a spike sequence as input. We calculate -mers, where , from those -mers. To conduct the trials, we used = 4 and = 9 . The gap’s dimensions are determined by − . However, this approach still involves bin scanning, which is computationally expensive

The cross-entropy loss penalizes the predicted scores for the incorrect classes and rewards the predicted score for the correct class. During training, the goal is to minimize the cross-entropy loss so that the predicted scores for the correct class are as high as possible compared to and generates very high dimensional feature represen- those for incorrect classes. tation. We took 500 Principle components by applying

PCA [25] for high dimensional embeddings (feature vec

tor length > 1000).

4. Experimental Evaluation To perform evaluation, we use datasets including Genome and Host. The details are as follows: 4.1. Dataset Statistics

4.1.1. Genome Dataset

Using the well-known and widely used database of SARS-CoV-2, GISAID [26], we retrieve the full-length

nucleotide sequences of the coronavirus. Our dataset includes the COVID-19 variant information and 8220 nucleotide sequences. In our sample, there are 41 diferent

Lineages altogether. The goal is to classify the sequences and predict the Lineage it belongs to.

4.1.2. Host Dataset The National Institute of Allergy and Infectious Disease (NIAID) Virus Pathogen Database, Investigation Resource (ViPR) [27], and GISAID was used to retrieve the Spike protein sequences from a collection of spike sequences from several clades of the Coronaviridae family, along with details about the hosts that each spike sequence has infected. The hostname is used as the class label in our classification tasks for this dataset. It displays the distribution of the dataset across the various host types (grouped by family).

4.2. Evaluation Metrics

For performance evaluation of SVM without and with Anderson acceleration, we use cross-entropy loss. The crossentropy loss, also known as the negative log-likelihood loss, is commonly used in supervised learning problems with categorical targets. The cross-entropy loss for a single sample can be expressed mathematically as follows: ∑=1 = − log ( ), where is the predicted score for the correct class and is the number of classes. The cross-entropy loss is averaged over the entire training set to obtain the final objective function optimized during training.

5. Results And Discussion In this section, we report results comparison without and with Anderson acceleration using cross-entropy loss for diferent biological sequence datasets. 5.1. Results For Genome Data The results for genome data using all embedding methods

are reported in Figure 2 for the best value of Anderson Acceleration (AA) factor . We use cross-validation to get the best value for ranging from (0, 0.1, 0.2, ⋯, 1.0) for respective embeddings, where 0 implies no AA and 1.0 shows maximum AA. For Spike2Vec embedding, we can observe that although cross-entropy loss without Anderson acceleration is smaller with fewer iterations, as we increase the iterations, the loss increases too. On the other hand, the loss does not increase significantly while using Anderson acceleration in SVM. Moreover, with AA, the loss started to converge after 300 iterations, which is almost half compared to the loss convergence without

AA (i.e., ≈ 600 iterations). For Minimizer-based embed

ding, although we can observe more fluctuation in loss compared to Spike2Vec, the loss (and convergence) is less when SVM is used along with AA. Similarly, the behavior of spaced -mers-based embedding difers from both

Spike2Vec and Minimizer-based embedding. Although

we can see an overall increasing trend in loss with an increasing number of iterations, the SVM with AA loss is lower than without AA when the number of iterations increases. Overall, it is evident from all three embedding results that the loss with AA is less than the loss without

AA for diferent embedding methods as we increase the

number of iterations, showing the significance of using

AA for the training of SVM.

sso L 0 − 5 number of iterations, while the y-axis shows the cross entropy loss. The figure is best seen in color. machine (SVM) classifier. Our experiments on several sequence-based bioinformatics datasets show that Anderson acceleration results in a considerable decrease in training loss and improved convergence compared to the standard SVM. In the future, we will investigate more traditional linear classifier models, such as the Perceptron, and bigger biological data to assess their scalability and resilience. Moreover, evaluating the robustness and generalizability of the proposed Anderson acceleration method is also an interesting future extension.

5.2. Results For Host Data The results for host data using all embedding methods

are reported in Figure 3 for the best value of Anderson Acceleration (AA) factor . We use cross-validation to get the best value for ranging from (0, 0.1, 0.2, ⋯, 1.0) for respective embeddings, where 0 implies no AA and 1.0 shows maximum AA. For Spike2Vec-based embedding, the behavior is not diferent from the same embedding in the case of Genome data. Although SVM without and with Anderson acceleration converges very fast (i.e., in < 100 iterations), the cross entropy loss with AA is smaller than SVM without AA. We observed some improvement in the SVM without AA in the Minimizer and Spaced -mers-based embedding methods. However, when the number of iterations is smaller, we can observe some lfuctuation in the cross-entropy loss for SVM without AA, compared to the smooth loss curve for SVM with AA, showing its significance in eficient training of the SVM classifier.

ference on Machine Learning , 2020 , pp. 6620 - 6629 . [13]

R. R.

Barton ,

J. S.

Ivey Jr , Modifications of the Nelder-

sponse optimization , Technical Report , 1991 . [14] M. L. Pasini , J.

Yin , V.

Reshniak , M. K.

Stoyanov ,

deep learning models , in: SoutheastCon 2022 , 2022 ,

pp. 289 - 295 . [15]

Zuo ,

Huang ,

Li ,

Gong , Ofline rein-

robotic tasks, Applied Intelligence ( 2022 ) 1 - 14 . [16] M. L. Pasini , J.

Yin , V.

Reshniak , M.

Stoyanov , Sta-

preprint arXiv:2110.14813 ( 2021 ). [17]

Bertrand ,

Klopfenstein ,

P.-A.

Bannier , G. Gidel,

Massias , Beyond l1: Faster and better

arXiv:2204.07826 ( 2022 ). [18]

Bertrand ,

Massias , Anderson acceleration of

on Artificial Intelligence and Statistics , 2021 , pp.

1288- 1296 . [19]

Rodriguez , Computational assessment of the an-

(STSIVA) , 2021 , pp. 1 - 5 . [20]

He ,

Zhao ,

Xi ,

C. J.

Ho ,

Saad , Solve min-

tions , 2022 . [21]

Garstka ,

Cannon ,

Goulart , Safeguarded

2022 , pp. 435 - 440 . [22]

Ali , M.

Patterson, Spike2vec: An eficient and

scalable embedding approach for covid-19 spike

Big

Data (Big Data) , 2021 , pp. 1533 - 1540 . [23]

Roberts ,

Hayes ,

Hunt ,

Mount , J. Yorke,

parison , Bioinformatics 20 ( 2004 ) 3363 - 3369 . [24]

Singh ,

Sekhon , et al., Gakco: a fast gapped

and Knowledge Discovery in Databases , 2017 , pp.

356- 373 . [25]

Wold ,

Esbensen ,

Geladi , Principal compo-

ratory systems 2 ( 1987 ) 37 - 52 . [26]

GISAID

Website , https://www.gisaid.org/, 2021 .

[Online; accessed 17-October-2022]. [27]

B. E.

Pickett ,

E. L.

Sadat ,

Zhang ,

J. M.

Noronha ,

acids research 40 ( 2012 ) D593 - D598 .