8


     Variation in the Structure of Cyanobacteria Genomes
            with the Length of the Sliding Window*

           Vladimir Gustov1[1111–2222–3333–599X], Maria Senashova2[0000–0002–1023–7103]
                        and Mikhail Sadovsky2[0000–0002–1807–0715]
            1 Siberian Federal University, 79 Svobodny st., Krasnoyarsk 660041, Russia
                   2 Institute of Computational Modelling of the Siberian Branch

    of the Russian Academy of Sciences, 50/44 Akademgorodok, Krasnoyarsk, 660036, Russia
                             {msen,msad}@icm.krasn.ru


         Abstract. A seven-cluster pattern for bacterial genomes has been reported. The
         pattern is revealed through the distribution of formally identified short
         fragments of a genome converted into triplet frequency dictionaries. The pattern
         is found to be dependent on the parameters of the genome fragmentation,
         namely the length of a formally identified fragment within the genome, and the
         move step to identify the fragments to obtain an ensemble to reveal the
         clustering pattern. Here, we present the results of the studying the impact of the
         fragment length on the type and shape of the pattern.

         Keywords: Order, Distribution, Clustering, Evolution, Symmetry.


1        Introduction

The diversity of structures found in biological macromolecules grows permanently.
Their analysis is a key issue for researchers studying both classical biology and
mathematics or computer science. Previously, some new structures were reported for
bacteria [3, 2], transcriptomes [6, 8, 9] and chloroplasts [10, 5, 7]. The structure
manifests in the mutual distribution of the formally identified fragments of a genome;
to reveal the structure, one must convert the fragments into the triplet frequency
dictionary W( , ) and trace their distribution in the 64-dimensional metric space.
Various fragments tend to gather into clusters, thus representing the structuredness.
   The choice of the biological matter for such studies makes matter. Keeping aside of
the sequencing, assembling, annotation etc., one faces an extremely high complexity
of the objects under consideration. Prokaryotic organisms seem to be more convenient
for such studies than eukaryotic ones. Prokaryotic genomes are significantly shorter
and (almost always) consist of a single circular chromosome. Organelle genomes are
even more advantageous when compared to prokaryotic ones in such capacity, since
they encode the same function. Thus, there is no functional impact on the study of the


*   Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0).
                                                                                       9


interplay mentioned above, if a researcher studies the organelle genomes of the same
group.
   The papers mentioned above present the study of the inner structuredness for
various genetic entities. The point is that the studies were carried out with the same
parameters of tiling: the latter is the specific coverage of a genome with a set of (may
be, overlapping) windows to identify the fragments for further clustering and
revealing the inner structuredness.
   Here we present some preliminary studies of the impact of variation in the tiling
parameters on the structure revealed in the genome. Essentially, we check what
happens with the structures reported previously, if the length of a tile increases. This
growth means that one takes into consideration longer and longer correlations in the
features found in a genome. Here, both features and correlations should be understood
in broader sense.


2      Material and Methods

Tiling is a set of (overlapping) subsequences covering the genome under
consideration. To do it, consider a genetic sequence of the length L from the four-
letter alphabet ℵ = {A, C, G, T}. No other symbols are stipulated to occur in a
sequence. To reveal the structuredness, tiling is developed. A window of the length Δ
identifying a fragment within a sequence moves with the step t; we take Δ = 603 and
t = 202. Moving the window of the length Δ along the sequence to right with the
given step t, one obtains the tiling. The latter is unambiguously determined by two
parameters: Δ and t.
    As soon as the tiling is developed, each tile is converted into the triplet frequency
dictionary W( , ) (j); here j denotes the tiles location along the sequence. Frequency
dictionary is the list of all the triplets ω = ν ν ν ranging from ω = AAA to ω =
TTT provided with their frequency value f . First, a sliding reading frame moving
along the tile, captures the triplets ω = ν ν ν counting the number of copies n of
each triplet. Next, the frequency of the triplet f is derived by dividing the number n
by the total number of all the triplets:

                                     𝑓 =      ,                                      (1)

where 𝑀 stands for the total number of the triplets. They are counted along the
fragment, with the reading frame shift equal to 3, thus providing neither gaps, nor
overlaps. Further, we omit the subscript in the dictionary notation, unless it makes a
confusion. The conversion of the sequence into an ensemble of the frequency
dictionaries transforms it into a set of points in the 64-dimensional metric space.
   For the purpose of the study, each point was labeled with the location coordinate j
which is the number of central nucleotides of the relevant tile, and the relative phase
index. The latter represents the location of the tile against the coding and non-coding
regions found in the genome. To begin with, we neglected the exon-intron structure of
the genes, and consider them as a rigid coding region. To identify it we followed the
annotation of the genome.
                                                                                       10


    Seven labels of the index labeling are introduced: F , F , F , B , B , B and J. The
tile is indexed as F , 0 ≤ k ≤ 2, if the central nucleotide falls inside the coding region
and the distance from the starting nucleotide of the coding region to the central one
has the remainder k when divided by 3. This labeling holds for the genes located in
the leading strand. Reciprocally, the tile is labeled with B , 0 ≤ k ≤ 2, if the gene is
located at the ladder strand; the distance here is determined from the end of the coding
region, and it is counted in the opposite direction, since the genetic sequence is
presented in the genetic bank with the leading strand only. Finally, the tile is indexed
as 𝐽, if its central nucleotide falls out of the coding region.
    Everywhere below, the following coloring label system for the relative phase
indices is applied:

 𝐽 phase tiles (corresponding to the non-coding regions) are colored in brown;
 the tiles indexed as 𝐹 and 𝐵 are colored in rose and violet, correspondingly;
 the tiles indexed as 𝐹 and 𝐵 are colored in green and cyan, correspondingly;
 the tiles indexed as 𝐹 and 𝐵 are colored in orange and yellow, correspondingly.

   There is a point concerning the coloring scheme: it loses sense as the length of the
fragment Δ grows. Indeed, the longer is the fragment, the greater is the number of the
coding regions falling within the fragment. There is no reason to expect that all the
newly coming coding regions observed within the increased fragment yield the same
relative phase index. Thus, the fragment becomes a multi-phase entity, and the
interfusion of the coding regions eliminates the unambiguity of the relative phase
index. Nonetheless, we used the coloring system to relate the relative phase index to
the very first coding region found in the fragment. Actually, it leads to mixing of the
differently colored fragments within a cluster.


Fig. 1. Cyanobacteria genome structures observed for Δ = 603. The projections on 𝑃𝐶 , 𝑃𝐶
are shown in subfigs. 1(a), 1(c), 1(e) and 1(g); the projections on 𝑃𝐶 , 𝑃𝐶 are shown in
subfigs. 1(b), 1(d), 1(f) and 1(h). (See the genomes in the text).
                                                                                              11


We used the freely distributed software VidaExpert1 to visualize the distribution of
the tiles converted into the triplet frequency dictionaries in the metric space. We
studied the distribution of the points corresponding to the tiling in the principal
component space. To do this, we used the Euclidean metrics.
   Any triplet frequency dictionary maps a tile into a point in the 64-dimensional
metric space. The problem is that the sum of all the frequencies is one:
                                           ∑ТТТ        𝑓 =1                                  (2)

making the frequencies linearly dependent. This linear constraint may cause a false
signal in clustering, so a triplet must be excluded from the analysis. Formally
speaking, any triplet may be excluded; actually, we excluded the triplets yielding the
least standard deviation determined over the set of tiles for each genome.
   We studied 38 genomes of cyanobacteria downloaded from the EMBL–bank2. The
studied species belong to the divisions Synechococcales, Nostocales, Chroococcales,
Pleurocapsales and Oscillatoriales. Cyanobacteria are supposed to have a common
ancestor with the existent chloroplasts, thus we tried to figure out relations between
the structures observed in the latter, and those reported in chloroplasts [5, 7, 10].


Fig. 2. Transformation of the structure, depending on Δ, for Synechococcus sp. WH 8102. Each
subfigure presents two projections: the former is (𝑃𝐶 , 𝑃𝐶 ) (left), and the latter is (𝑃𝐶 , 𝑃𝐶 )
(right).


1   http://bioinfo-out.curie.fr/projects/vidaexpert/
2   https://www.ebi.ac.uk/genomes/bacteria.html
                                                                                        12


3      Results and Discussion

To begin with, let us concentrate on the structures observed for the “standard”
parameters of tiling: that is Δ = 603 and move step is t = 202. Fig. 1 shows these
patterns; subfigs. 1(a), 1(b) show Prochlorococcus sp. MIT 0604 (AC CP007753),
subfigs. 1(c), 1(d) show Nostoc sp. PCC 7107 (AC CP003548), subfigs. 1(e), 1(f)
show Synechococcus elongatus PCC 7942 (AC CP000100), and subfigs. 1(g), 1(h)
show Synechococcus sp. WH 8109 (AC CP006882).
   We analyzed the structures in the bacterial genomes mentioned above for Δ
varying from 603 to 60 000 nucleotides; the move step t was equal to 202 or 201. The
comparison of the structures observed with the extreme lengths of the fragment were
of primary interest for us.
   For Δ = 603 and t = 202 we observed four seven-cluster patterns, as described in
[2, 3]; namely, these are:

 “parallel triangles” for AT-reach genomes (with GC-content close to 0.25), see Fig.
  1;
 “orthogonal triangles” for the genomes with GC-content close to 0.35;
 “coinciding triangles” pattern characteristic of the genomes with GC-content close
  to 0.50, and finally
 a pattern degenerated into a plane with the six-beam cluster structure, for GC-
  content close to 0.60 (see Fig. 1).

   Also, we investigated the variation of the pattern for the fragment lengths equal to
1206, 1809, 2412, 3015, 9015, 15015, 21015, 27015, 33015 and 45000 nucleotides.
Fig. 2 illustrates the changes occurring in the patterns, as the fragment length grows.
Evidently, the structure changes rather smoothly, as the window length increases.
This change of the pattern seems to be universal for any initial structure.


Fig. 3. Transformation Structures identified for Δ = 60 000 for cyanobacteria; subfig. 3(a)
shows Synechococcus elongatus PCC 6301; subfig. 3(b) shows Anabaena variabilis ATCC
29413 and subfig. 3(c) shows Synechococcus sp. CC9902.

We believe there exist 2 essentially different structures, and another substructure for
Δ = 60 000. These two essentially different structures are called ball (Fig. 3(a)) and
Two clusters (Fig. 3(c)); the substructure mentioned above is called Snitch, see Fig.
                                                                                             13


3(b). The Ball structure resembles a ball, regardless of the initial type of the structure;
this structure is characteristic of quite a chaotic distribution of the points
corresponding to all six coding phases, and junk, over the ball. Careful examination
shows that the structure looks like a set of threads; they can be well seen in the
periphery of the ball. The Snitch substructure resembles, more or less, the ball
structure. Three clearly identified threads resembling a protuberance characterize this
substructure; they are extremely extended in size.

Table 1. Correspondence between the structure types for Δ = 603 and Δ = 60 000; 2C stands
for two clusters, S stands for the snitch and B stands for the ball.

                                                 2С           S          B
                  Coinciding triangles            8           0         1
                  Six beams                       2           0         3
                  Parallel triangles              0           0         2
                  Orthogonal triangles            0           5         17

The Two clusters structure consists of long threads which comprise not a ball, but two
extended clusters connected with a bridge. The clusters differ in length, in all the
cases. The clusters comprise the points belonging to two different non-overlapping
parts of the genome: one may identify the points from each cluster (coloring, e. g.)
and trace their location over the genome. Remarkably, the points of the same color
occupy the sites over the genome separately, never mixing with each other.
   Also, all these structures exhibit three (relatively independent) threads, in the
principal component space; the thread composition depends on the divisibility of the
window step t by 3. If the step is divisible, then the ball comprises a single thread;
otherwise, three threads are observed (see Fig. 4).


Fig. 4. Acetobacter pasteurianus genome structure observed for 𝑡 divisible by 3 (Fig. 4(a))
and 𝑡 indivisible by 3 (Fig. 4(b)). Here, the starting thousand of points are shown, to simplify
visualization.

The points in the threads are located sequentially, if t is divisible by 3. Otherwise, the
points in three threads observed for t indivisible by 3 are separated for each thread:
                                                                                        14


the threads comprise the fragments with the numbers giving the same remainder of
the division of the location number of the fragment by 3, and the fragments within the
thread are located sequentially, again.
   Neither relation between the structure type of the genome of cyanobacteria
observed for Δ = 60 000 and for Δ = 603, nor taxonomy was found. Similarly, no
relation of GC-content to these structures was revealed.

     Table 2. Correspondence between the structure types for Δ = 603 and Δ = 60 000.

                                        Ball            Snitch        Two clusters
           Chroococcales                 1                1                 0
           Pleurocapsales                1                0                 0
           Nostocales                    4                3                 0
           Oscillatoriales               5                0                 0
           Synechococcales               13               1                 10

Table 3 shows the GC-content values obtained at Δ = 60 000. It should be noted that
the average GC-content for the two clusters structure exceeds that observed for other
structures. However, it should be said that some genomes exhibiting these two
structures other than two clusters have the GC-content values up to 60 %.

            Table 3. Correspondence between the structure types of GC-content;
                               σ is the standard deviation.
                             GC    ,%          GC , %            GC    ,%        𝜎, %
         Snitch                41.35            42.22             43.87          1.13
         Ball                  31.17            45.60             60.24          6.99
         Two clusters          52.45            58.44             61.37          2.93


4      Conclusion

The majority of studies [12, 4, 11, 1] consider structures in terms of their functional
role, or chemical properties. We focus exclusively on the statistical properties of
biological macromolecules, regardless of their physical issues. The data presented
here unambiguously prove that the inner structuredness in cyanobacteria genomes is
observed at the window length Δ = 603. Similar structuredness is observed for longer
windows (up to Δ = 60 000), as well. The Ball, snitch and two clusters structures
start to manifest themselves at Δ ≈ 30 000 (see Fig. 2); further growth of the window
length does not significantly change the structures.
   Another important issue is the amazing impact of the triplet composition
manifested at any length Δ of the window. Indeed, if the step t of the window move is
divisible by 3, then a single thread pattern is observed; otherwise, one can see three
threads.
                                                                                           15


References
 1. Dittmann, E., Gugger, M., Sivonen, K., Fewer, D.P.: Natural product biosynthetic diversity
    and comparative genomics of the cyanobacteria. Trends in microbiology 23(10), 642–652
    (2015)
 2. Gorban, A.N., Popova, T.G., Zinovyev, A.Y.: Seven clusters in genomic triplet
    distributions.       In      Silico         Biology        3(4),     471–482       (2003),
    http://content.iospress.com/articles/in-silico-biology/isb00110
 3. Gorban, A.N., Popova, T.G., Zinovyev, A.Y.: Four basic symmetry types in the universal
    7-cluster structure of microbial genomic sequences. Silico Biology 5(3), 265–282 (2005),
    http://content.iospress.com/articles/in-silico-biology/isb00185
 4. Kaneko, T., Tabata, S.: Complete genome structure of the unicellular cyanobacterium
    synechocystis sp. pcc6803. Plant and Cell Physiology 38(11), 1171–1176 (1997)
 5. Sadovsky, M., Senashova, M., Malyshev, A.: Eight-cluster structure of chloroplast
    genomes differs from similar one observed for bacteria. ArXiv e-prints (Feb 2018)
 6. Sadovsky, M., Putintseva, Y., Birukov, V., Novikova, S., Krutovsky, K.: De Novo
    assembly and cluster analysis of siberian larch transcriptome and genome. In: Ortuño, F.,
    Rojas, I. (eds.) Bioinformatics and Biomedical Engineering. pp. 455–464. Springer
    International Publishing, Cham (2016)
 7. Sadovsky, M., Senashova, M., Malyshev, A.: Chloroplast genomes exhibit eight cluster
    structuredness and mirror symmetry. In: Rojas, I., Ortuño, F. (eds.) Bioinformatics and
    Biomedical Engineering. pp. 186–196. Springer International Publishing, Cham (2018)
 8. Sadovsky, M.G., Birukov, V.V., Putintseva, Y.A., Oreshkova, N.V., Vaganov, E.A.,
    Krutovsky, K.V.: Symmetry of siberian larch transcriptome. Journal of Siberian federal
    university 8(6), 278–286 (2015)
 9. Sadovsky, M.G., Bondar, E.I., Putintseva, Y.A., Oreshkova, N.V., Vaganov, E.A.,
    Krutovsky, K.V.: Seven-cluster structure of larch chloroplast genome. Journal of Siberian
    federal university 8(6), 268–277 (2015)
10. Sadovsky, M.G., Senashova, M.Y., Putintseva, Y.A.: Chloroplasts and Cytoplasm:
    Structure and Functions, chap. Chapter 2, pp. 25–95. Nova Science Publishers, Inc. (2018)
11. Todorova, A.K., Juettner, F., Linden, A., Pluess, T., von Philipsborn, W.: Nostocyclamide:
    a new macrocyclic, thiazole-containing allelochemical from nostoc sp. 31 (cyanobacteria).
    The Journal of Organic Chemistry 60(24), 7891–7895 (1995)
12. Zervou, S.K., Kaloudis, T., Hiskia, A., Mazur-Marzec, H.: Fragmentation mass spectra
    dataset of linear cyanopeptides-microginins. Data in Brief p. 105825 (2020)