<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Preliminary statistical analysis of amino acid sequence embeddings of proteins</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krzysztof Fidelis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykhailo Luchkevych</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaroslav Teplyi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Genome Center</institution>
          ,
          <addr-line>UC Davis, Davis, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera Street 12 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recent extensive research in the field of bioinformatics aimed at predicting the 3D structure of proteins from their amino acid sequence using LLMs has generated large datasets of numerical data on the relative positioning of amino acids in sequences, known as embeddings. These data banks are publicly accessible, enabling their analysis and utilization, particularly for tasks such as identifying sets of typical elements of protein structures. Recognizing typical substructures could significantly simplify the protein analysis process, which involves more than 240 millions of proteins. This work explores the main statistical characteristics of amino acid sequence embeddings of protein pairs, both significantly similar and distinctly diferent in composition and structure, in order to identify patterns in their behavior parameters: linearity, stationarity, probability distribution laws, and others, ensuring the correctness of applying corresponding models and methods in the future.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Protein structure analysis</kwd>
        <kwd>ESM-2 model</kwd>
        <kwd>sequence embeddings</kwd>
        <kwd>statistical analysis</kwd>
        <kwd>sequence alignment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The use of Large Language Models (LLMs) has revolutionized various fields of study, extending their
impact to the domain of protein amino acid sequence analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recent innovations have leveraged
LLMs to decode protein sequences, significantly advancing our understanding and capabilities in
constructing detailed spatial structure databases. Among these innovations, the ESM-2 database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
stands out as a pivotal development. ESM-2, an open-access database, encapsulates an extensive array
of protein spatial structures, thus providing a valuable resource for biochemists and bioinformatics
researchers. This enables an in-depth exploration of the functional attributes of proteins in correlation
with their three-dimensional conformations.
      </p>
      <p>
        Utilizing LLMs for these purposes transforms the representation of proteins into a multidimensional
vector space where each amino acid’s embedding reflects its potential spatial relationships within the
protein’s folded structure. This approach not only enhances the precision of structural predictions
but also introduces a quantitative method to assess the likelihood of proximal interactions among
the amino acids in a given protein. The efectiveness and accuracy of these models are rigorously
evaluated through the Critical Assessment of protein Structure Prediction (CASP) project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where
computational predictions are juxtaposed with experimentally determined structures, validating the
reliability of the models.
      </p>
      <p>Central to the efectiveness of LLMs is the preliminary processing and statistical analysis of data.
The architecture of the system and the specific algorithms employed, particularly the deep learning
components of LLMs, critically influence the characteristics of the resulting embedding arrays. This
initial data processing phase is crucial as it ensures that subsequent analyses and applications of the
data are based on robust and reliable foundation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Analysis of recent research</title>
      <p>
        Protein language models (pLMs) have significantly advanced our understanding of the relationships
within protein sequences, providing a numerical representation of their structural and evolutionary
features. Recent developments, such as the Embedding-based Alignment (EBA), specifically an approach
introduced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], highlights the potential of using high-dimensional sequence embeddings from pLMs
in protein structure analysis. This approach was efectively used to detect distant homologies in the
so-called ’twilight zone’ [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where sequence similarities are not readily apparent.
      </p>
      <p>The authors demonstrate that EBA surpasses both traditional sequence alignment methods and other
pLM-based approaches in detecting structural similarities, without the need for training or parameter
optimization. The use of embeddings allows EBA to capture deeper evolutionary relationships, ofering
a significant improvement in identifying structural similarities in proteins with low sequence identity.</p>
      <p>
        We utilize similar approach, where our research aims to expand on these results by exploring a
variation of the EBA method, focusing specifically on the statistical characterization of sequence
embeddings through autocorrelation and correlation analysis. Our methodology difers from the
proposed EBA in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by analysing the spatial relationships within protein sequences, which are encoded
in the embeddings generated by models like ESM-2.
      </p>
      <p>
        Findings from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], demonstrated that the statistical distribution of amino acid sequences supports
Darwinian evolution. Their research showed that certain peptide combinations occur rarely, suggesting
evolutionary constraints. These constraints may be reflected in the distribution of embeddings, which
could indicate evolutionary pressure shaping protein structures. The confirmation of the statistical
nature of amino acid distributions complements the statistical analysis of sequence embeddings.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Research purpose</title>
      <p>This study conducts a preliminary analysis of protein sequence embeddings using autocorrelation
functions. The goal is to detect repetitive patterns within these embeddings that may indicate underlying
structural or functional elements in the proteins. By employing a sliding window across the sequence
embedding dimensions, we assess the local repetitiveness of these patterns to identify motifs suggestive
of structural features.</p>
      <p>Additionally, this analysis evaluates the similarity between protein pairs by correlating their
autocorrelation outcomes, providing a detailed view of how similarities are distributed across the entire
sequence. This work sets the foundation for developing a variation of the Embedding-based Alignment
(EBA) method.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Problem formulation</title>
      <p>Given a language model, denoted as , for predicting the three-dimensional structure of a protein
from the sequence . The input data for the model is the sequence of amino acids in the protein  =
{1, 2, . . . ,  } , where  represents an individual element of the sequence (the letter corresponding
to the amino acid), and  indicates the number of amino acids in the sequence. The model  is defined
by a set of pre-learned parameters  .</p>
      <p>After processing the sequence , the model outputs a set of parameters. The mapping of  to  can
be formally described as the function  () =  , where  summarizes the computational logic of
model  with parameters  .</p>
      <p>Among the set of output parameters  , we focus on one specific parameter , which is the subjects
of this study. This parameter provides an internal protein’s sequence representation in a form of vectors,
also called embeddings. The parameter  is a matrix that represents the mapping of the input sequence
into a higher-dimensional space (, 1024), where  is a length of the sequence. Hence, ∈× 1024 ,
where each row in  corresponds to an element from  transformed into a 1024-dimensional vector
(embedding), which encodes contextual information about the given element, its properties, and its
interaction with other elements in the sequence.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Statistical Analysis of Sequence Embeddings</title>
      <p>5.0.1. Sequence Alignment and Similarity Metrics
The proteins selected for this analysis are 101 and 1 1, which have been identified as similar,
as well as 11 and 161, which are considered dissimilar. The criteria for this categorization are
based on structural features and evolutionary relationships inferred from sequence homology.</p>
      <p>
        Multiple sequence alignment (MSA), is a crucial tool in bioinformatics and has been employed to
align the amino acid sequences of the chosen proteins. The quality of alignment is quantified by an
MSA score, which assesses the degree of conservation and similarity between sequences. A higher
score denotes a greater level of similarity. Alignments were generated using the T-Cofee program [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The alignment results (Figure 1) for the similar proteins, 101 and 1 1, show a high degree of
conservation, as indicated by a score of 986. The corresponding MSA visual (left side of the attached
ifgure) shows a significant number of residues are identical (marked by an asterisk ’*’) or have strong
similarities (marked by a colon ’:’ or a period ’.’). This suggests these proteins may share functional and
structural properties.</p>
      <p>In contrast, the MSA for dissimilar proteins, 11 and 161, yields a score of 358, reflecting a low
level of similarity. The alignment (right side of the attached figure) has fewer conserved residues and
indicates considerable variation between these sequences, which means they have diferent functions
or structures.</p>
      <p>These MSA results provide a baseline for the subsequent statistical analysis. By establishing the
degree of similarity through MSA scores and visual inspection, we can set expectations for how these
similarities or diferences might manifest in various statistical measures such as embeddings’ magnitude
distributions, distance, angles etc.</p>
      <sec id="sec-5-1">
        <title>5.1. Outlier Normalization</title>
        <p>Prior to the application of statistical methods to analyze protein sequence embeddings, initial view of
the embeddings showed the presence of extreme outlier values. These outlier values are significantly
higher/lower than the general dataset, observed consistently at identical indices across all sequence
embeddings. Due to their magnitude, these outliers have the potential to influence subsequent statistical
computations, thereby skewing the analysis results.</p>
        <p>The visualization of the embeddings is depicted in the first set of plots, illustrating spikes (Figure 2).
These peaks are consistent across all embeddings of diferent proteins, suggesting a systematic anomaly
of the ESM-2 model rather than random or natural variation within the protein structure representation.</p>
        <p>To address this anomaly, a normalization method was applied, where the top five maximum and the
bottom five minimum values were adjusted by taking the average value of two adjacent values. This
threshold of first five values was chosen based on empirical observations and analysis to efectively
remove the outliers without afecting the informational content of the embeddings.</p>
        <p>The second set of plots (Figure 3) displays the embeddings after normalization, where the previously
visible spikes have been truncated. The consistency across diferent embeddings indicates that the
removing top and bottom five values is addressing the issue. This normalization step is critical as
it ensures that the subsequent analytical methods are reflective of structural properties rather than
artifacts introduced by outlier data points.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Distribution Analysis of Embedding Dimensions</title>
        <p>The visualization presented in the Figure 4 demonstrates the distribution of normalized embedding values
across selected dimensions of protein sequences. Histograms are utilized to compare the distributions
between protein pairs that are considered similar and dissimilar, respectively.</p>
        <p>For similar proteins (101 and 1 1), the first two histograms in the top row represent the
distribution of values at specific dimensions (index 10 and index 100). The distributions are noticeably
overlapping, meaning a high degree of similarity in these embedding dimensions. This suggests that
the embeddings capture similar structural or functional features within this dimension.</p>
        <p>The third histogram in the top row depicts the mean value distribution across all dimensions. The
concentration of values around the center and the bell-shaped distribution is indicative of the embeddings
capturing a consistent pattern across dimensions.</p>
        <p>Conversely, the bottom row compares the dimension value distributions for proteins 11 and
161, which are dissimilar. Here, the first two histograms (index 10 and index 230) show a shift in the
frequency of values, suggesting a diference in the structural or functional properties encoded by these
dimensions, though it’s not always the case, as some</p>
        <p>The mean value histogram for these dissimilar proteins shows a distribution is overlapping with
the one observed in similar proteins. This indicates that while there is a commonality in the overall
embedding pattern (as shown by the shape of the distribution), the distribution across specific dimensions
may difer.</p>
        <p>The central, bell-shaped distributions seen across the mean histograms shows the consistency and
validity of using distribution analyses in protein structure comparison studies. This consistency also
suggests that the embeddings may follow an underlying statistical distribution.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Embedding Magnitude Analysis and Statistical Measures</title>
        <p>The computation of embedding magnitudes serves to quantify the strength or intensity of the protein
sequence embeddings. We assess the stationarity of this magnitude distribution, as stationary processes
allow for the reliable application of statistical measures such as mean, median, and variance over time,
yielding consistent and interpretable results across diferent segments of the sequence.</p>
        <p>The histograms in the Figure 5 provide the frequency distribution of embedding magnitudes for both
similar and dissimilar proteins. Contrary to initial expectations, the magnitude distributions between
similar (101 and 1 1) and dissimilar (11 and 161) protein pairs are analogous, which
means that the magnitude alone does not distinguish between the similarities or diferences in protein
structures.</p>
        <p>For the similar proteins, the running mean over magnitudes show a considerable overlap and the
local fluctuations are largely aligned, indicating that the embedding magnitudes change similarly over
the course of the sequences. This alignment in local fluctuations suggests that the similar proteins have
analogous dynamic behaviors in their structure over time.</p>
        <p>In contrast, the running mean plots for dissimilar proteins do not show the same degree of overlap or
alignment. While the overall trend lines appear to be stationary for both similar and dissimilar proteins,
the patterns of local fluctuation provide evidence that the embeddings are sensitive to diferences in
protein structures.</p>
        <p>The trend lines in the running mean plots remain relatively flat and parallel to the x-axis for all
protein pairs, supporting the stationarity of the process. This confirms that the embedding magnitudes
do not display long-term trends or drifts, ensuring that subsequent statistical analyses like mean and
variance calculations are meaningful.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Probability Distribution</title>
        <p>As the last step of an embeddings analysis, we attempt to evaluate the probability distributions of
embedding features. We focus on two aspects: the distribution of dimension cuts and the magnitudes of
embeddings.</p>
        <p>Our findings (Figure 6) reveal that the distribution of both embedding dimension cuts at a fixed index
and mean value across all the dimensions, adheres to a normal distribution. For biological data, where
a multitude of factors contribute to the final observation, such distribution is indicative of a robust
underlying model that produces a stable, predictable pattern.</p>
        <p>In contrast, the magnitudes of the embeddings follow a lognormal distribution, characteristic of
processes governed by multiplicative factors. The lognormal nature of the magnitudes could reflect the
exponential growth processes that underlie protein folding and development, where factors multiply,
leading to the right-skewed distribution observed in our results.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Methodology</title>
      <sec id="sec-6-1">
        <title>6.1. Autocorrelation Function</title>
        <p>We define the autocorrelation function for a vector  ∈ , where  is the length of the vector, using
a sliding window of size , and denote it as (, ). By applying the autocorrelation function
to each window  sliding over the input vector  , we obtain a matrix of autocorrelation functions
∈− − 1× . We then normalize the matrix  using the normalization function  ().</p>
        <p>(, ) =  (:+)
where  is the autocorrelation function, ∈{1, . . . ,  }, ∈{1, . . . ,  −  − 1}.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Normalization Function</title>
        <p>We define the normalization function  () for the autocorrelation function that takes as input a matrix
 and normalizes it by the maximum value of the corresponding row, resulting in a matrix ′, where
each row is divided by its maximum value:
 () =</p>
        <p>()
where ∈{1, . . . ,  }.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Self-Similarity of Embeddings</title>
        <p>Given the matrix ∈× 1024 that represents the embeddings of the protein sequence, we transpose
this matrix () ∈1024∈ , to compute autocorrelation. Thus, computing self-similarity between
(1)
(2)
where ∈{1, . . . , 1024}.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Similarity of Two Sequences</title>
        <p>For the first sequence, we compute (1, ) = 11024× − − 1×  , and for the second sequence,
accordingly (2, ) = 21024× − − 1×  . We then calculate the Pearson correlation coeficient
between each fragment of  of length  from the first sequence relative to all fragments of 
from the second sequence in the given dimension. Let  be the set of  fragments from 1 of
length  where  = {1, . . . ,  −  − 1}, and  be the set of fragments of length  from 2, where
 = {1, . . . ,  −  − 1}. For each fragment of , we compute the Pearson correlation coeficient
with each fragment of . The result is a correlation matrix ∈− − 1× − − 1 , where each
element  represents the correlation coeficient between fragments of  and fragments of .
 =
(, )
 ()×  ()
By applying this function to all dimensions, we calculate a correlation matrix of two sequences
∈1024× − − 1× − − 1. The resulting matrix will contain information about the mutual
similarity between the  of the sequences, describing the local similarity of the two protein sequences.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Algorithm</title>
        <p>The algorithm described below provides a methodology for computing the correlation between two
sets of protein sequence embeddings. It is designed to be invariant to outliers described in the previous
section, since it operates on dimensions of embeddings.
corresponding dimensions of embeddings across the entire sequence. We define the self-similarity
function for the sequence (, ):</p>
        <p>(, ) = (() , ) = 1024× − − 1× 
1 function Autocorrelate(vector V, integer window_size)
2 length = size of V
3 Initialize array results with size (length - window_size + 1)
4 for i from 0 to (length - window_size) do
5 segment = slice of V from i to i + window_size
6 autocorrelation = correlate segment with itself
7 autocorrelation = normalize(autocorrelation)
8 results[i] = autocorrelation
9 end for
10 return results
11 end function</p>
        <p>Listing 1: Autocorrelation Computation</p>
        <p>The  function computes the autocorrelation of a given vector, segment by segment,
within a defined window size. Normalization can be applied to each autocorrelation result to further
ensure that the analysis is not skewed by extreme values.
1 function EmbeddingsCorrelation(matrix S1, matrix S2, integer window_size)
2 smaller, larger = order matrices S1 and S2 by size
3 autocorr_smaller = Autocorrelate(smaller, window_size)
4 autocorr_larger = Autocorrelate(larger, window_size)
5 Initialize correlation matrix
6 for i from 0 to size of autocorr_smaller do
7 for j from 0 to size of autocorr_larger do
8 correlation[i, j] = Pearson correlation of autocorr_smaller[i] and autocorr_larger[j]
9 end for
10 end for
11 return correlation
(3)
(4)
12 end function</p>
        <p>Listing 2: Embedding Correlation Computation
 uses the autocorrelated data to compute the Pearson correlation
coeficients across all pairs of autocorrelated segments between the two input matrices. The process accounts
for the relative nature of the data, which is why the presence of outliers in specific indices does not
distort the analysis.</p>
        <p>Finally,   iterates over each embedding dimension, applying
 to build a correlation matrix for the entire set of embeddings. This
matrix contains a detailed view of the similarities between the two protein sequences across all
embedding dimensions, reflecting both local and global patterns in the data.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Experimental Results</title>
      <p>We analyze several pairs of protein sequences by evaluating the correlation between their embeddings
using the established methodology. We experimentally verify the ability of the ESM-2 model to learn
the dependencies and evolutionary context of sequences and encode this informantion in seuquence
embeddings. The resulting correlation matrix was visualized as a heatmap and compared with an MSA
alignment. We present three cases of protein sequence comparison:</p>
      <p>The left section of Figure 7 presents a heatmap generated by applying formula 4, which computes
the correlation between two protein sequences. The heatmap’s axes correspond to the sequences of
two proteins 101 and 1 1, where diagonally aligned signal indicates similarity or identity,
suggesting functional and structural parallels between the proteins, which corresponds to these proteins’
MSA alignment.</p>
      <p>The right section of Figure 1 displays the sequence alignment for 101 and 1 1. Each block’s
alignment includes a conservation score, where an asterisk ’*’ denotes identical amino acids at that
position, suggesting a perfect match. A colon ’:’ marks positions with chemically similar, yet diferent,
amino acids—indicative of conservative substitutions. A space ’ ’ represents positions where the amino
acids significantly difer, termed non-conservative substitutions. A period ’.’ denotes semi-conservative
substitutions where the amino acids are moderately similar. The color gradient from green to red across
the panel reflects the alignment’s varying quality, from low to high.</p>
      <p>In Figure 8, we observe a correlation heatmap for a pair of protein sequences with more complicated
alignment patterns. The heatmap’s primary diagonal shows areas where the sequences align, indicating
similarity. Notably, in the center, the alignment shifts and later realigns, suggesting a gap followed by a
return to similarity. This observation is reflected in the MSA on the right, where asterisks and colons
mark similar regions, and dashes ’-’ indicate sequence gaps, mirroring the heatmap’s diagonal shifts.</p>
      <p>Figure 9 showcases a scenario where protein sequences 161 and 11 exhibit minimal similarity.
The heatmap lacks distinct patterns, aligning with the infrequent and scattered matches in the MSA,
suggesting that the sequences share only isolated regions of structural or functional commonality.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions</title>
      <p>Preliminary statistical analysis of the embedding arrays from selected protein amino acid sequences has
been conducted. It was found that there are individual characteristic outliers in the numerical values of
certain embeddings projections, namely the 247 and 37 dimensions always represent extreme maximum
and minimum outlier values respectively. Normalizing these by replacing them with the average value
of two adjacent readings allows for further processing and analysis of the data.</p>
      <p>In the statistical analysis of protein sequence embeddings, histograms were employed to examine the
distribution across various dimensions for both similar and dissimilar protein pairs. The results revealed
that similar proteins exhibited overlapping distribution patterns in specific dimensions, suggesting
shared structural or functional features through spatial proximity, while dissimilar proteins showed
shifted distribution indicating varying structural characteristics. Further, the magnitude of these
embeddings was analyzed and conrfimed the stationarity of the process using running mean method,
which allowed to compute statistical measures such as mean, median, and variance. Both similar and
dissimilar protein pairs displayed analogous statistical and magnitude characteristics. Additionally,
the probability distribution analysis showed that embedding dimensions generally follow a normal
distribution, whereas embedding magnitudes adhered to a log-normal distribution, that may reflect the
multiplicative biological processes inherent in protein folding. These findings enhance our
understanding of protein folding process and support the initiative to use correlation and autocorrelation analysis
to develop a Embedding-based Alignment (EBA) method.</p>
      <p>The application of covariance and autocorrelation analysis to ESM-2 sequence embeddings showed
that the model can learn the evolutionary character of sequence development and sequence
interrelationships. By evaluating the correlation between embeddings of diferent protein pairs, we observed
clear patterns, and experimentally verified the results with MSA alignment of these sequences, which
further confirmed the proposed analysis method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Protchatgpt: Towards understanding proteins with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.09649</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2402.09649.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Akin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Smetanin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verkuil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kabeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shmueli</surname>
          </string-name>
          , et al.,
          <article-title>Evolutionary-scale prediction of atomic-level protein structure with a language model</article-title>
          ,
          <source>Science</source>
          <volume>379</volume>
          (
          <year>2023</year>
          )
          <fpage>1123</fpage>
          -
          <lpage>1130</lpage>
          . doi:
          <volume>10</volume>
          .1126/science.ade2574.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kryshtafovych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schwede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Topf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fidelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moult</surname>
          </string-name>
          ,
          <article-title>Critical assessment of methods of protein structure prediction (casp)-round xv</article-title>
          ,
          <source>Proteins: Structure, Function, and Bioinformatics</source>
          <volume>91</volume>
          (
          <year>2023</year>
          )
          <fpage>1539</fpage>
          -
          <lpage>1549</lpage>
          . doi:
          <volume>10</volume>
          .1002/prot.26617.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pantolini</surname>
          </string-name>
          , G. Studer,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          , G. Tauriello, T. Schwede,
          <article-title>Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>40</volume>
          (
          <year>2024</year>
          )
          <article-title>btad786</article-title>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/ btad786.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Rost</surname>
          </string-name>
          ,
          <article-title>Twilight zone of protein sequence alignments</article-title>
          ,
          <source>Protein Engineering, Design and Selection</source>
          <volume>12</volume>
          (
          <year>1999</year>
          )
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          . doi:
          <volume>10</volume>
          .1093/protein/12.2.85.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Eitner</surname>
          </string-name>
          , U. Koch,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gawęda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marciniak</surname>
          </string-name>
          ,
          <article-title>Statistical distribution of amino acid sequences: a proof of Darwinian evolution</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>26</volume>
          (
          <year>2010</year>
          )
          <fpage>2933</fpage>
          -
          <lpage>2935</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/ btq571.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Di Tommaso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moretti</surname>
          </string-name>
          , I. Xenarios,
          <string-name>
            <given-names>M.</given-names>
            <surname>Orobitg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montanyola</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Chang</surname>
            ,
            <given-names>J.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Taly</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Notredame</surname>
          </string-name>
          ,
          <article-title>T-cofee: a web server for the multiple sequence alignment of protein and rna sequences using structural information and homology extension</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>39</volume>
          (
          <year>2011</year>
          )
          <fpage>W13</fpage>
          -
          <lpage>W17</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkr245.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>