193


         Algorithms for Placing Files in Tiered Storage Using
                           Kohonen Map*

    Tatiana M. Tatarnikova1[0000-0002-6419-0072], Ekaterina D. Poymanova 2[0000-0002-7903-2480]
    1   Russian State Hydrometeorological University, 79, Voronelsraya st., 192007 St. Pe-
                                       tersburg, Russia
                                         tm-tatarn@yandex.ru
         2   Saint-Petersburg State University of Aerospace Instrumentation, 67, Bolshaya
                             Morskaia str., 190000 St. Petersburg, Russia
                                     e.d.poymanova@gmail.com


              Abstract. The data storage task is not limited only to the allocation of volume
              for data placement. It needs a new storage resource management models with
              tiered storage and proactive migration of information. Present storage systems
              are still one-tier. The storage administrator decides about data migration to other
              media. The decision to migrate data is determined by the time that has passed
              since the last access to the data. However, many metrics can be taken into ac-
              count, such as the rate at which the requested data is provided, the cost of data
              loss, the period through which information is transferred to another storage tier.
              The paper proposes a sequence of algorithms for distributing files in tiered stor-
              age. For the first, the algorithm of vertical placement files across storage tiers,
              next horizontal placement of files on physical and logical media, and then migra-
              tion data algorithm. The result of algorithms applying is visualized in the form of
              a matrix, the size of which corresponds to the number of storage tiers and the
              number of physical or logical media. All storage resource management algo-
              rithms are based on the analysis of stored file metadata. The representation of the
              storage system in the form of a matrix allows using the Kohonen neural network
              tool to arrange files by levels and sections of a specific storage system level. Us-
              ing Kohonen neural network allows you to move from sequential execution of
              algorithms to placement in one-step.

              Keywords: Tier Data Storage, Data Storage System, Metadata, Efficient Data
              Placement, File Metadata, Data Migration, Clustering, Kohonen Neural Net-
              work.


1             Introduction

Data storage is a necessity for both an enterprise, a corporation, state structures, and a
person. For enterprises and the corporate sector, the need to store large amounts of data


*       Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License
        Attribution 4.0 International (CC BY 4.0).
                                            194


is determined by existing business processes, in the public sector by the transition to
interdepartmental electronic document management and the creation of departmental
analytical resources. In addition, users who upload their photos to the Internet, videos
and actively exchange multimedia content in social networks create a powerful data
stream.
   The engineering solution for the implementation of storage infrastructure is data
storage systems. The data storage system mainly is segregate into a computing subsys-
tem complex, for example, in a data center [1].
   The data storage system is an architectural solution for connecting external data stor-
age devices of different physical nature [2].
   The main task in the design of storage infrastructure is the effective management of
storage resources, mainly capacitive. Its solution is complicated by the following cir-
cumstances:

 storage heterogeneity: storage devices in a storage system can be different in physi-
  cal nature (for example, magnetic, optical, solid-state) and in architecture (direct or
  network access storage systems) [3];
 different data storage requirements: critical transactional systems such as billing,
  processing, ERP, etc., require highly reliable and productive storage systems; ana-
  lytical systems - high productivity and low cost per storage unit; for work with files
  – functionality and low cost of storage [4];
 the lack of efficient multi-level data storage algorithms with different storage re-
  quirements.

The paper describes the problem of organizing tiered data storage, which assumes using
different storage technologies for different files depending on the guaranteed storage
time.
   Storage engineering solutions analysis allows distinguishing three tiers of the data
storage architecture:

 RAID (Redundant Array of Independent Disks);
 automated libraries;
 long-term storage media.

Each storage tier involves its own storage technologies [1].
   Despite the fact, that modern data storage systems (DSS) are multi-tier associations
[2], the decision on the choice of the storage tier lies with the administrator. This deci-
sion is based on a single metric - the time elapsed since the last access to the information
[3].
   The purpose of the research is to develop a tiered data storage model and data distri-
bution algorithms for storage systems that will partially solve the listed issues of data
storage.


2      Tiered structured

It is proposed to distribute data files using the file metadata analysis.
                                                195


   The distribution process includes the following principles:

 the storage tier selection depending on data storage time;
 the storage local volume selection depending on the file size and the length of the
  logical data block;
 the principle of data migration across tiers depending on the frequency of data file
  access.

Before writing to a specific storage tier, it is necessary to analyze the data in order to
select the optimal file system (FS) for the RAID tier or the type of archive media, which
will allow saving storage space [5]. Thus, the data storage gets a matrix structure, con-
taining data with certain characteristics in each cell (Fig. 1).

                                       FS 1           FS 2          ...          FS n
              RAID
                                     Volum 1        Volum 2         ...         Volum 1      1st Tier

     Automated libraries           Data Media 1 Data Media 1        ...       Data Media 1   2nd Tier


     Long term storage media       Data Media 1 Data Media 1        ...       Data Media 1   3rd Tier


                           Fig. 1. The matrix structure of the data storage

The storage tier selection principle is based on the analysis of organizational metadata
containing data type information:

 ind (initial data) – raw data that is placed on the RAID tier;
 bck (backups) – backups, archived data that is stored at the automated libraries tier;
 ngd (next-generation data) – data of unlimited storage that is stored in the long term
  storage tier.

The storage logic tier selection principle of the storage system is based on the selection
of the file system for the RAID tier and the type of storage media on the lower storage
tiers:

                       If f(fi; fi+1], then  ai+1  F Voli+1,                                        (1)

where f – the size of the saved file F;
fi, fi+1 – size limits of the file F, with which the file system can work;
ai+1 – the logical data block size with which the file system operates;
Voli+1 – the number of a RAID volume that is managed by the corresponding file sys-
tem.
      At the lower storage tiers, it is proposed to divide the capacity according to the types
of storage media: tape drive, DVD, BD for the automated libraries tier and M-disk,
glass disk and DNA – for long-term storage media tier:

                           If f  (fi, fi+1], then F  ali+1 (lti+1),                                   (2)
                                              196


where ali+1 or lti+1 – the type of media at the automated libraries tier (al) or at the long-
term storage tier (lt).
   The principle of data migration across storage tiers depends on the frequency of
data files access:

                             If F  ( λi, λi+1], then F  l,                               (3)

where F – the frequency of the file F request;
i, i+1 – limits of the file request frequency;
l – the number of the storage tier to which the file F is migrated.
    The migration principle allows overcoming the shortcomings of a subjective selec-
tion of the saving files type when implementing the first stage - data recording.
    The complex of principles above allows managing storage capacity and making ra-
tional use of media. Note that all these principles are based on the file's metadata anal-
ysis.
    The result of the proposed storage capacity management principle, in general, is the
storage matrix of size m × n, where m is the number of storage tiers (M), and n is the
number of physical or logical media (N). Elements of the matrix are sets of data files
with certain characteristic values: file type (type), file size (f) (Fig. 2). During the initial
data distribution, the data access frequency (λ) is not taken into account, because of
absence of accumulated statistics of data access at this moment.
                                   N1            ...            Nn

                        M1      type1, f1           ...         type1, fn


                        M2      type2, f1           ...         type2, fn


                        M3      type3, f1           ...         type3, fn


                                     Fig. 2. Storage matrix

Based on the storage technologies analysis, three storage tiers were identified. Accord-
ingly, the matrix will always contain three rows. The number of columns is selected
based on the physical implementation of each data storage tier.
   A block diagram of aggregate algorithms for placing files in the tiered storage is
shown in Fig. 3
   The consistent implementation of the presented above principles requires high-en-
ergy costs. In this regard, to distribute files among the cells of the matrix (Fig.1), it is
proposed to analyze the metadata of the input file stream using Kohonen neural net-
works. Kohonen trained a neural network that is capable of solving this problem not by
the consistent use of principles, but in one-step [6]. It is noteworthy that the neural
network, in this case, is used precisely as a principle for distributing data to the cells of
the storage matrix.
                                                          197


                                                             Input data stream


                                                            Metadata analysis                                    No


                  F(ind)                 No                       F(bck)                    No                   F(ngd)


                   Yes                                             Yes                                             Yes


                                                            Automated libraries                               Tier long-term
                 RAID tier
                                                                  tier                                        storage media


Volume1    Yes   f ϵ(0; f1]                    al1    Yes        f ϵ(0; f1]                      lt1    Yes      f ϵ(0; f1]


                    No                                              No                                             No


Volum2     Yes   f ϵ(f1; f2]                   al2    Yes        f ϵ(f1; f2]                     lt2    Yes     f ϵ(f1; f2]


                    No                                              No                                             No


  ...      Yes       ...       No              ...    Yes            ...         No              ...    Yes         ...        No


                    No                                              No                                             No


                                                                                      Yes                                           Yes
Volumen    Yes   f ϵ(fn; )                     aln    Yes        f ϵ(fn; )                       ltn    Yes     f ϵ(fn; )


          Yes


                                                          File access frequency
                                                                                                        No
                                                                 analysis


                                    λϵ(λ1; )         No         λϵ(λ2; λ1]            No    λϵ[0; λ2]


                   Fig. 3. Algorithm of file allocation in tiered data storage
                                                   198


The result of using a neural network is a topological map in which the input data is
classified into groups (clusters). Thus, each cell of the resulting map must correspond
to a cell of the storage matrix.


3       Kohonen Network as a Data File Clustering Tool

Kohonen neural network, unlike many other types of neural networks, is trained with-
out a teacher.
   The main purpose of the Kohonen network is to solve the problem of cluster analysis
(clustering).
   The Kohonen network includes two layers: input and output. Each neuron of the
input layer is connected to each neuron of the output layer and all connections have a
certain weight, which is corrected in the learning process. The output layer is also called
the topological map layer. Neurons of a topological map are scattered in a two-dimen-
sional field (Fig. 4) [6].
                     Input layer       Output layer            Topological map
                x1
                                                               y1      y2      ...
                                                   y1
                x2
                                                   y2
                x3                      .
                                        .
                         .              .                                      ys
                         .
                         .                          ys

                xn

                               Fig. 4. Kohonen network architecture

   Kohonen maps have a set of input elements, the number of which coincides with the
dimension of the input vectors, and a set of output elements, each of which corresponds
to one cluster (group).
   The essence of the Kohonen network is as follows. When inputting some vector X
to the input, the network must determine to which of the clusters this vector is closest.
As the proximity criterion can be chosen the metric of the square of the Euclidean dis-
tance
                                             n
                                    dij    xik  y jk  ,
                                                         2
                                                                                               (4)
                                            k 0


where dij is the squared distance between point X and cluster Y. The coordinates of
point X are xi1, xi2, ..., xin, and the coordinates of the cluster center Y are yj1, yj2, ..., yjn.
   Vector X is a point in n-dimensional space, where n is the number of vector coordi-
nates that are fed to the input of the neuron network. Calculating the distance between
this point and the centers of different clusters allows you to determine the cluster, the
                                            199


distance to which will be minimal. Then this cluster and the corresponding output neu-
ron is declared the winner.
   The cluster center is defined as a point whose coordinates in n-dimensional space
are the weights of all connections that arrive at a given output neuron from the input
neurons.
   Kohonen maps solve the clustering problem as follows: by submitting a vector to the
network input, one winning cluster will be obtained, which determines the membership
of the input vector to this cluster.
   Kohonen network learning algorithm is recurrent. Any training example is processed
in this sequence [6]:

 take the winner neuron, that is, the one that is closer to the entered example;
 perform correction of the winning neuron so that it becomes as close as possible to
  the entered example, finding the square of the Euclidean distance from the center of
  the winning neuron to the learning example.

When calculating the center of the winning neuron, uses the learning rate factor. This
coefficient gradually decreases in such a way that, at each new stage, the correction of
the weights became more and more close to the given one. As a result, the location of
the center will be established in some position that best reflects the examples for which
this neuron is the winner.
   Because of such a recurrent learning procedure, the neural network will be organized
so that neurons located close to each other in the space of the input layer will be located
close to each other and on a topological map.
   The completion of the Kohonen map is based on the concept of a neighborhood.
   The neighborhood is characterized by a radius R. It is a group of neurons that sur-
round the winner-neuron (Fig. 5). Similarly, the learning rate, the radius of the neigh-
borhood decreases with the learning time in such a way that a significant number of
neurons are located in the neighborhood first (perhaps almost everything located on the
topological map). Then, in the final stages, the neighborhood tends to zero – it includes
only the winner-neuron itself (Fig. 5).


                                               R


                               Fig. 5. Cluster neighborhood

If changes in weights become insignificant, then the training ends.
    After the network has been trained to solve the clustering, we use it as a visualization
tool in the form of a Kohonen map [7].
                                                   200


   The Kohonen network learning algorithm has the following sequence of steps:
   Step 1. For the input vector X, calculate the coordinates of the neurons of the output
layer, calculate dij from X to each of the neurons of the network.
   Step 2. Find the minimum dij from the obtained values, determine the winner neuron.
   Step 3. For the winning neuron, as well as for those neurons that are in a neighbor-
hood with radius R, perform the adjustment of the weights:

                          wij (t  1)  wij (t )  (t )  xi  wij (t )  ,            (5)

where wij (t  1) is the weight at step (t + 1);
wij (t ) – weight at step t;
 – learning rate;
xi – coordinate of the input vector.
   Step 4. Update the values of the learning rate and radius R of the neighborhood.
   Step 5. Continue learning until the condition for stopping learning is fulfilled.
   Learning stops when weights become insignificant.
   The choice of this method for solving the file allocation is due to the peculiarities of
the algorithm that implements the Kohonen network:

1. Uncontrolled learning is used, in which the learning rule of a neuron is based on
   information about its location;
2. There are no reference values of the training set;
3. The result of the algorithm is a topological map, in which the input data are classified
   into groups (clusters).

In the case of the proposed distribution of files in the data storage cells, it is obvious
that the selected file characteristics define the feature value vectors. Thus, each cell of
the resulting map must correspond to an element of the matrix representing the storage
system [8].


4      Experiment Description and Analysis of Results

The experiment involved 5,000 files with different characteristics: the type of file, its
estimated storage time, file size, frequency of accessing data.
   The stream is generated based on the analysis of 43,000 files taken from an experi-
mental file server [9], [10]. The file size does not exceed 1 GB, the frequency of ac-
cessing data is a random variable in the interval [0–300] requests per hour. The fre-
quency of accessing data was considered in absolute units - the number of requests per
hour.
   The data files distribution was implemented in accordance with the structure of the
storage matrix of size 33.
   Normalization of file types was adopted based on the considerations of obtaining
disjoint classes: the file type ind corresponds to the value 1, bck – 2, ngd – 3.
   Normalization of the file size is done in decimal order: if the file size is from 0 to
                                              201


999 B, we assign the value of the attribute 1; from 1000 to 999,999B – 2; from
1,000,000 to 999,999,999B – 3.
   The results of the experiment show that the Kohonen neural network apparatus can
be a principle for solving the problem of distributing files with different characteristics
and storage time requirements. The main difficulty is the choice of classification pa-
rameters and their normalization. An example of the Kohonen map in 3D, which is built
as a result of the experiment, is shown in Fig. 6.


  Fig. 6. The results of the experiment file distribution depending on the type, size, and fre-
                         quency of requests (the Kohonen maps in 3D)

   The decision of the necessity of data files migration to another storage tier is based
on analyzing the value of the files accessing frequency. The storage administrator
should determine the limit values of the request frequency, upon which the files are
migrated (Fig. 7).


                                                        – the limit value of the requestы
                                                        frequency for data migration to
                                                        tier 2;

                                                        – the limit value of the requestы
                                                        frequency for data migration to
                                                        tier 3


     Fig. 7. An example of visualization of the limit values of the files access frequency
                                              202


The results of the experiment showed that the Kohonen neural network can be a tool
for solving the problem of placing files.


5      Conclusion

The paper suggests a tiered data storage model. The file distribution in data storage
system is implemented in accordance with the consistent use of principles for vertical,
horizontal placement and migration of data.
    The initial vertical and horizontal distribution of files in the tiered data storage sys-
tem is formalized in the form of a matrix. Such a presentation allows using the Kohonen
neural network apparatus, whose main purpose is to solve the clustering problem. In
the problem of the distributing file, clusters correspond to cells of the storage matrix.
Before solving the clustering problem, metadata that sets the characteristics of the saved
files were normalized.
    Using the Kohonen neural network allows us to abandon the consistent implemen-
tation of distribution algorithms, and solve the problem of file distribution in one step.


References
 1. Information Storage and Management. 2nd Edition. New Jersey: John Wiley & Sons Inc.,
    (2016).
 2. Mesnier M., Ganger G., Riedel E. Object-Based Storage. IEEE Communications Magazine,
    vol. 41, no. 8, 84-90 (2003).
 3. Farley M. Building Storage Networks. Osborne: McGraw-Hall, (2001).
 4. Sovetov B. Ya., Tatarnikova T. M., Poymanova E. D. Organization of multi-level data stor-
    age. Informatsionno-Upravliaiushchie Sistemy, no. 2, 74-81 (2019). DOI: 10.31799/1684-
    8853-2019-2-68-75
 5. Bogatyrev V.A, Bogatyrev A.V., Bogatyrev S.V., Polyakov V. I. Redundant service of re-
    quest copies by a sequence of groups of reserved nodes. In CEUR Workshop Proceedings:
    Workshop International Scientific and Technical Conference &quot;Fundamental and prac-
    tical aspects of computer technology&quot, FPACT 2018, 5-7 Nov. 2018, Saint Petersburg,
    IET - 2019, pp. 135-140.
 6. Kohonen T. Self-Organizing Maps. New York, (2001).
 7. Stacey M., Salvatore J., Jorgensen A. Visual Intelligence: Microsoft Tools and Techniques
    for Visualizing Data. New Jersey, John Wiley & Sons, (2013).
 8. Morville P., Callender J. Search Patterns: Design for Discovery. O’Reilly, (2010).
 9. Poymanova E.D., Tatarnikova T.M. Models and Methods for Studying Network Traffic. In
    Wave Electronics and its Application in Information and Telecommunication Systems,
    WECONF 2018, pp. 1-5, (2019). DOI: 10.1109/WECONF.2018.8604470
10. Tatarnikova T. M., Volskiy A.V. Estimation of probabilistic-temporal characteristics of net-
    work nodes with traffic differentiation. Informatsionno-Upravliaiushchie Sistemy, no. 3, 54-
    60 (2018). DOI: 10.31799/1684-8853-2019-2-68-75.