Researching of methods for assessing the
 complexity of program code when generating input
                     test data
                    Konstantin Serdyukov                                                             Tatyana Avdeenko
             Novosibirsk State Technical University                                        Novosibirsk State Technical University
                      Novosibirst, Russia                                                           Novosibirst, Russia
                         zores@live.ru                                                              tavdeenko@mail.ru


    Abstract—This article proposes a comparison of methods for                   One of the main goals of testing is to create a test sets that
determining code complexity when generating data sets for                   would ensure a sufficient level of quality of the final product
software testing. The article offers the results of a study for             by checking most of the various paths of the program code,
evaluating one path of program code, the work is not finished               i.e. would provide maximum coverage. Nevertheless, the task
yet, it will be further expanded to select data for testing many            of finding many paths itself consists of several sub-tasks, the
paths. To solve the problem of generating test data sets, it is             solution of which is necessary to find high-quality test sets.
proposed to use a genetic algorithm with various metrics for                One of the local problems that can be solved to find a test set
determining the complexity of program code. A new metrics is                is to determine one of the most complex code paths.
proposed for determining code complexity based on changing
weights of nested operations. The article presents the results                  For the most part, validation and verification of software
and comparison of the generated input test data for the passage             products is difficult to optimize. It is especially difficult to
along the critical path. For each metric considered in the                  automate the generation of test data, which for the most part
article, conclusions are presented to identify specifics depending          is done manually.
on the selected data.
                                                                                Nevertheless, there are many studies using non-standard
    Keywords—test data generation, genetic algorithm, metrics of            algorithms to solve the automation problem. For example, in
asserting code complexity                                                   [1] it is proposed to use a Constraint-Based Algorithm for the
                                                                            Mort system, which uses error testing to find input test data.
                        I. INTRODUCTION                                     Test data is selected in such a way as to determine the
    Software engineering is a comprehensive, systematic                     presence or absence of certain errors.
approach to the development and maintenance of software.                        Quite often, genetic algorithms are used in one way or
When developing programs, the following stages are most                     another to solve this problem. The article [2] compares
often distinguished - analysis, design, programming and                     different methods for generating test data, including genetic
testing. At the stage of analysis, software requirements are                algorithms, a random search method and other heuristic
determined and documentation is performed. At the design                    methods.
stage, the appearance of the program is detailed, its internal
functionality is determined, the product structure is                           In [3] to solve the problem, it is proposed to use
developed, and requirements for subsequent testing are                      Constraint Logic Programming and Symbolic Execution. In
introduced. Writing the source code of a program in one of                  [4], the Constraint Handling Rules are used to help in manual
the programming languages is done at the programming                        verification of problem areas in the program.
stage.
                                                                                Some researchers use heuristic methods to automate the
    One of the most important steps in developing software                  testing process using a data-flow diagram. Studies of
products is testing. Important goals of testing are the                     automation methods using this diagram were done in articles
compliance of the developed program with the specified                      [5, 6, 7, 8]. In [5] it is proposed to additionally use genetic
requirements, adherence to logic in the data processing                     algorithms to generate new input test data sets based on
processes and obtaining correct final results. Therefore, for               previously generated ones.
testing it is very important to generate input test data, on the
                                                                                In articles [9, 10] it is proposed to use hybrid methods for
basis of which the program will be checked for errors and
                                                                            generating test data. In [9], an approach is used that combines
compliance with specified requirements. To esteem the
                                                                            strategies of Random Strategy, Dynamic Symbolic Execution
quality of the input data a code coverage indicator is used,
                                                                            and Search-Based Strategies. The article [10] proposes a
that is percentage of the entire program can the test sets
                                                                            theoretical description of the search method using the genetic
“cover”. It is determined by the ratio of the tested operations
                                                                            algorithm. The approaches to search for local and global
to the total number of operations in the code.
                                                                            extrema on real programs are considered. A hybrid approach
     Some software code testing processes are improving quite               for generating test data is proposed - a Memetic Algorithm.
slowly. The development of most types of test scenarios is
                                                                                The approach in [11] uses a hybrid intelligent search
most often done manually, without the use of any automation
                                                                            algorithm to generate test data. Proposed approach center on
systems. Because of this the testing process becomes
                                                                            the methods of Branches and Borders and the Hill Climbing
incredibly complicated and costly both in time and in
                                                                            to improve intellectual search.
finances, if you approach it in all seriousness. Up to 50% of
all time costs can be spent on testing some programs.                          There are also studies using machine learning, for
                                                                            example, in [12]. It proposes a method using a neural network


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

and user-customizable clustering of input data for sequential             chromosomes. But to increase the rate of convergence of the
learning.                                                                 solution, the initial population can be specified in a certain
                                                                          way, or random values can be analyzed in advance to exclude
    Novelty Search can also be used to generate test data. In             definitely inappropriate ones.
the article [13] it is proposed to use this approach to evaluate
large spaces of input data and is compared with approaches                   Population assessment. Each of the chromosomes is
based on the genetic algorithm.                                           evaluated by a fitness function. Based on the given
                                                                          requirements, chromosomes get the exact value of how well
    The possibilities of generating test data for testing web             they correspond to the problem being solved.
services are also being investigated, for example, in the
WDSL specification [14].                                                         Selection. After each of the chromosomes has its own
                                                                                  fitness value, the best chromosomes are selected.
    For the convenience of generating test data, UML
                                                                                  Selection can be done by different methods, for
diagrams are also used [15, 16]. The articles suggest using
                                                                                  example, from the sorted in order first n chromosomes
genetic algorithms to generate triggers for UML diagrams
                                                                                  are selected, or only the most suitable, but not less
that will allow to find a critical path in the program. The
                                                                                  than n, etc.
article [17] proposes an improved method based on a genetic
algorithm for selecting test data for many parallel paths in                     Crossing. [23]. The first is a significant difference
UML diagrams.                                                                     from standard optimization methods. After selection of
    In addition to UML diagrams, the program can be                               chromosomes suitable for solving the problem, they
described as a Classification-Tree Method developed by                            crossing. Random chromosomes from all the "chosen
Grochtmann and Grimm [18]. In [19] the problem of                                 ones" randomly generate new chromosomes. Crossing
constructing trees is considered and an integrated                                occurs on the basis of the choice of a certain position
classification tree algorithm is proposed, and in [20] was                        in two chromosomes and the replacement of parts of
investigated the developed ADDICT prototype (short for                            each other. After the required number of chromosomes
AutomateD test Data generation using the Integrated                               is generated to create a population, the algorithm
Classification-Tree methodology) for an integrated approach.                      proceeds to the next step.

    This article proposes a comparison of different methods                      Mutation. [24]. Also the step specific to GA. In a
for evaluating code complexity for generating test data. The                      random order, a random gene can change values to a
article is structured as follows. Section 2 introduces                            random one. The main point in a mutation is the same
terminology and provides basic information on the genetic                         as in biology - to bring genetic diversity into a
algorithm. The third section sets the problem to be solved and                    population. The main goal of mutations is to obtain
introduces one of the methods for assessing code complexity.                      solutions that could not be obtained with existing
Section 4 proposes the results of the operation of the input                      genes. This will allow, firstly, to avoid falling into
data generation method using the introduced code estimation                       local extremes, since a mutation can allow the
method. In section 5 there is comparing of different code                         algorithm to be transferred to a completely different
evaluation methods.                                                               branch, and secondly, to “dilute” the population in
                                                                                  order to avoid a situation where in the whole
                    II. GENETIC ALGORITHM                                         population there will be only identical chromosomes
                                                                                  that will not generally move towards a solution.
    Formally, the genetic algorithm is not an optimization
method, at least in the understanding of classical optimization               After all the steps have been passed on, it is defined
methods. Its purpose is not to find the optimal and best                  whether the population has reached the desired accuracy of
solution, but to find close enough to it. Therefore, this                 the decisions or has come to limit the number of populations,
algorithm is not recommended to be used if fast and well-                 and if so, the algorithm stops working. Otherwise, the cycle
developed optimization methods already exist. But at the                  with the new population is repeated until the conditions are
same time, the genetic algorithm perfectly shows itself in                achieved.
solving non-standardized tasks, tasks with incomplete data or
for which it is impossible to use optimization methods                                     III. PROBLEM DESCRIPTION
because of the complexity of implementation or the duration                   The use of genetic algorithms in the testing process allows
of execution [21, 22].                                                    to find the most complex parts of the program in which the
    A genetic algorithm is considered to be completed if a                risks due to errors are greatest. Evaluation occurs due to the
certain number of iterations is passed (it is desirable to limit          use of the fitness function, the parameters of which are the
the number of iterations, since the genetic algorithm works on            weights of each passable operation. Definition of weights, i.e.
the basis of trial and error, which is a rather lengthy process),         the complexity of the program code, occurs due to various
or if a satisfactory value of the fitness function was obtained.          metrics used depending on the requirements for the input sets.
As usual a genetic algorithm solves the problem of                           The task of generating input test data consists of three
maximizing or minimizing and the adequacy of each solution                subtasks:
(chromosome) is evaluated using the fitness function.
                                                                             1.   Search for input data for passing along one of the most
    The genetic algorithm works according to the following                        complex code paths. Difficulty is determined by the
principle:                                                                        chosen metric for code evaluation;
   Initialization. A fitness function is introduced. An initial              2.   The exclusion or reduction of the weights of
population is being formed. In classical theory, the initial                      operations on the path for which the data were
population is formed by randomly filling each gene in the                         selected, based on the fitness function for other paths;


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                300
Data Science

  3.    Generation of input test data for many paths of                   always maximum (random values were limited to 100), the
        program code.                                                     second value is less than the first, but more than the third.
    The limit on the number of sets of input data is
established after the development stage and will allow to
concentrate on certain paths.                                                                      TABLE I. COMPARISON OF RESULTS
                                                                          Population         Test 1          Test 2        Test 3          Test 4
    The whole algorithm is performed cyclically - the                         0          1: 78, 23, 35     1: 97, 3, 6 1: 92, 97, 28   1: 15, 67, 26
procedure for searching for input data for one path is started,                          2: 62, 36, 95 2: 82, 77, 64   2: 38, 66, 52   2: 32, 27, 83
after which operations in this path are excluded from further                            3: 52, 35, 27 3: 24, 47, 57   3: 63, 76, 64   3: 37, 52, 64
                                                                                         4: 17, 77, 73 4: 90, 13, 82    4: 7, 24, 56   4: 70, 49, 64
calculations and the data search for one path is started again.                           5: 75, 9, 96   5: 81, 69, 24  5: 57, 48, 8   5: 67, 29, 94
   As one of the ways to determine the complexity of the                      20         1: 95, 64, 54    1: 97, 80, 4 1: 99, 13, 10   1: 99, 71, 45
                                                                                         2: 95, 64, 29 2: 97, 80, 53   2: 99, 13, 11   2: 99, 71, 15
code, an method is proposed that works as follows:                                       3: 95, 64, 54 3: 97, 80, 28   3: 99, 13, 11    3: 99, 71, 3
       The first operation is assigned a weight of, for                      50         1: 95, 64, 54 1: 97, 80, 29   1: 99, 13, 10   1: 99, 71, 60
                                                                                         2: 95, 64, 29    2: 97, 80, 4 2: 99, 13, 11    2: 99, 71, 3
        example, 100 units.                                                              3: 95, 64, 54 3: 97, 80, 53   3: 99, 13, 11    3: 99, 71, 3
       Each subsequent operation is also assigned a weight -               Result
                                                                            (100)
                                                                                         1: 95, 64, 54    1: 97, 80, 4
                                                                                         2: 95, 64, 29 2: 97, 80, 29
                                                                                                                       1: 99, 13, 10
                                                                                                                       2: 99, 13, 11
                                                                                                                                       1: 99, 71, 60
                                                                                                                                       2: 99, 71, 45
        if there are no conditions or cycles, the weight is taken
        in accordance with the previous operation.                              V. COMPARISON OF METHODS FOR ASSESSING THE
       Conditions share the weight in accordance with the                                   COMPLEXITY OF PROGRAM CODE
        rule - if the condition contains only one branch (only if             For the researching, several tests of the algorithm with
        ...), then the weight of each operation is reduced by             four different metrics were carried out - a modified metric,
        80%. If the condition is divided into several branches            the logic of which was described in Section 3, SLOC metrics
        (if ... else ...), then the weight is divided into equal          for evaluating the number of lines of code, ABC metrics and
        parts - for two branches 50% / 50%, for three 33% /               Jilb metrics.
        33% / 33%, etc.
                                                                              The metric SLOC (abbr. Source Lines of Code) is
       The weights of operations in the cycle remain, but can            determined by the number of lines of code. This metric takes
        also be multiplied by certain weights, if necessary.              into account only the total number of lines of code in the
                                                                          program, which makes it the easiest to understand. In this
       All nested restrictions are summed, for example, for              case, the number of lines refers to the number of commands,
        two nested conditions the weight of operations will be            and not the physical number of lines.
        80% * 80% = 64%
                                                                              The ABC metric, or Fitzpatrick metric, is a metric that is
    Assigned weights can be used to develop test cases using              determined based on three different indicators ABC = <na, nb,
genetic algorithms, that is, to assess how much calculated                nc>. The first indicator na (Assignment) is allocated for lines
weight assigned on one or another branch for certain values               of code that are responsible for assigning variables a certain
of input parameters.                                                      value, for example, int number = 1. The indicator nb (Branch)
    For convenience, we introduce the following notation:                 is responsible for using functions or procedures, that is,
                                                                          operands that work out of sight of the current program code.
    X - data sets;                                                        The indicator nc (Condition) calculates the number of logical
   F (X) is the value of the fitness function for each data set           operands, such as conditions and loops. The metric value is
depending on the calculated values of the weights.                        calculated as the square root of the sum of the squared values
                                                                          of na, nb and nc.
   The challenge is to maximize the objective function, i.e. F
(X) → max                                                                                   F = √n𝑎 2 + n𝑏 2 + n𝑐 2                 (1)
                                                                              It is noteworthy that one line of code can be taken into
               IV. THE RESULTS OF THE METHOD                              account in different parameters, for example, when assigning
    In accordance with the previously proposed option for                 a variable the value of a certain function (double number =
assessing the complexity of program code, this method is                  Math.Pow (2, 3) is assigned both in na and nb). The
being finalized to better meet real requirements. Weights are             disadvantages of this metric include the possible return of a
considered in accordance with the operability of the program,             zero value in some parts of the code.
in other words, the more iterations the program performs, the                The Jilb metric is aimed to determine the complexity of
more weight the initial test version will have.                           program code depending on its saturation with conditional
    The first population is formed by random values. Each                 operands. This metric is useful for determining the
population contains 100 chromosomes. The total number of                  complexity of program code, both for writing and for
the population is also 100. Due to this, a sufficient number of           understanding it:
different options will be formed and the best ones will be
                                                                                                   F = 𝑐𝑙/𝑛,                       (2)
selected. Table 1 presents the test results.
                                                                          where cl – the number of conditional operands, n – the total
    In each of the tests, at least two different versions of the          number of lines of code.
data were generated, in which the considered program code
                                                                              For testing, code is used with many different paths, where
will work the most times, which means that it will go the
                                                                          one is critical which has the largest number of operations. The
greatest number of times in different ways. In addition, you
                                                                          selection of input data for this path will be a solution to the
can see certain patterns in the results - the first value is
                                                                          subtask and from this data it will be possible to determine


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         301
Data Science

how accurately the data is selected. The critical path will be            lines of code. The results are presented in table 3. The
reached if the 1st and 3rd values from the selected data are              algorithm with this metric picked up 3 sets - (63, 72, 91), (68,
greater than 50 and 1 value is less than 3.                               50, 94) and (80, 70, 88). All three satisfy the conditions for
                                                                          passing along the critical path.
   The following genetic algorithm settings are used to
generate input test data:                                                    As with the previous metric, the algorithm in the first
         Number of generations – 100                                     generation picked up suitable data.
         Number of populations in one generation – 100                   C. ABC metric
   Range of received data values – (0, 100)
                                                                              This metric takes into account more variations of the
A. Results using the metric proposed in the article                       values, such as assigning values to variables, logical checks
    An algorithm with this metric selects data with a priority            and function calls. The algorithm with the ABC metric picked
of operations of a higher level. As a result (99th generation),           up 2 options for the input data that pass along the critical path
two data sets were obtained - (70, 9, 78) and (75, 67, 82).               - (69, 46, 78) and (77, 36, 98). The remaining results are
Both sets go along the longest code path, which is the                    presented in table 4.
solution to the subtask. Table 2 presents the first 10 options in                      TABLE IV. RESULTS OF METHOD WITH ABC METRIC
each of the generations.                                                                                ABC metric
                TABLE II. RESULTS OF THE PROPOSED METRIC                       Variant\Gen.        0               1              99
                              Modified metric                                        1        (95, 27, 97)   (77, 36, 98) = (69, 46, 78) =
                                                                                                = 6 351          6 351          6 351
Variant\Gen.            0                    1              99
                                                                                     2        (77, 36, 98)   (69, 46, 78) = (77, 36, 98) =
     1         (70, 9, 78) = 164      (75, 67, 82) =   (70, 9, 78) =
                                                                                                = 6 351          6 351          6 351
                       100               164 100         164 100
                                                                                     3        (69, 46, 78)   (61, 65, 95) = (69, 46, 78) =
     2           (75, 67, 82) =       (61, 29, 94) =   (70, 9, 78) =
                                                                                                = 6 351          6 351          6 351
                    164 100              164 100         164 100
                                                                                     4        (61, 65, 95)   (95, 27, 97) = (69, 46, 78) =
     3           (61, 29, 94) =       (63, 52, 87) =  (75, 67, 82) =
                                                                                                = 6 351          6 351          6 351
                    164 100              164 100         164 100
                                                                                     5       (5, 67, 92) =   (61, 65, 95) = (69, 46, 78) =
     4           (63, 52, 87) =       (63, 49, 83) =  (75, 67, 82) =
                                                                                                 3 538           6 351          6 351
                    164 100              164 100         164 100
                                                                                     6       (5, 87, 95) =   (95, 27, 97) = (69, 46, 78) =
     5           (63, 49, 83) =        (70, 9, 78) =  (75, 67, 82) =
                                                                                                 3 538           6 351          6 351
                    164 100              164 100         164 100
                                                                                     7       (1, 35, 60) =   (61, 65, 95) = (69, 46, 78) =
     6            (5, 68, 90) =       (63, 52, 87) =  (75, 67, 82) =
                                                                                                 3 538           6 351          6 351
                     96 382              164 100         164 100
                                                                                     8       (1, 70, 53) =   (69, 46, 78) = (77, 36, 98) =
     7            (60, 37, 3) =        (70, 9, 78) =  (75, 67, 82) =
                                                                                                 3 538           6 351          6 351
                     32 500              164 100         164 100
                                                                                     9        (60, 30, 12)   (69, 46, 78) = (69, 46, 78) =
     8         (12, 80, 49) = 16       (70, 9, 78) =   (70, 9, 78) =
                                                                                                 = 768           6 351          6 351
                       000               164 100         164 100
                                                                                    10        (60, 49, 73)   (69, 46, 78) = (69, 46, 78) =
     9         (47, 12, 17) = 16       (70, 9, 78) =   (70, 9, 78) =
                                                                                                 = 768           6 351          63 51
                       000               164 100         164 100
    10         (53, 35, 76) = 16      (61, 29, 94) =  (75, 67, 82) =      D. Jilb metric
                       000               164 100         164 100
                                                                              Unlike previous metrics, this one takes into account the
    Can be seeing that the algorithm works quite efficiently              absolute complexity of the program, which is calculated by
and already in the first generation the data was selected for             dividing the number of cycles and conditions by the total
the critical path.                                                        number of operations on the way. The complexity of the
B. SLOC metric                                                            program is determined in a completely different way, which
                                                                          led to the fact that the input data was selected for a different
             TABLE III. RESULTS OF METHOD WITH SLOC METRIC
                              SLOC metric
                                                                          path. The results are presented in table 5.
   Variant\Gen.           0               1              99                             TABLE V. RESULTS OF METHOD WITH JILB METRIC
         1        (64, 14, 96) =    (68, 50, 94) = (63, 72, 91) =                                        Jilb metric
                        6 411           6 411          6 411                   Variant\Gen.         0                 1             99
         2        (68, 50, 94) =    (80, 70, 88) = (68, 50, 94) =                    1        (75, 51, 3) =    (62, 25, 41) = (78, 45, 21) =
                        6 411           6 411          6 411                                       100               100            100
         3        (80, 70, 88) =    (65, 81, 89) = (68, 50, 94) =                    2         (92, 33, 11)    (94, 22, 35) = (63, 36, 10) =
                        6 411           6 411          6 411                                      = 100              100            100
         4        (65, 81, 89) =    (63, 72, 91) = (63, 72, 91) =                    3         (94, 22, 35)    (98, 51, 12) =  (75, 51, 3) =
                        6 411           6 411          6 411                                      = 100              100            100
         5        (63, 72, 91) =    (74, 83, 76) = (80, 70, 88) =                    4         (98, 51, 12)    (80, 42, 20) = (80, 42, 20) =
                        6 411           6 411          6 411                                      = 100              100            100
         6        (74, 83, 76) =    (64, 69, 91) = (68, 50, 94) =                    5         (80, 42, 20)    (78, 45, 21) = (78, 45, 21) =
                        6 411           6 411          6 411                                      = 100              100            100
         7        (64, 69, 91) =    (69, 88, 85) = (68, 50, 94) =                    6         (78, 45, 21)     (80, 59, 8) = (63, 36, 10) =
                        6 411           6 411          6 411                                      = 100              100            100
         8        (69, 88, 85) =    (64, 14, 96) = (63, 72, 91) =                    7        (80, 59, 8) =     (5, 40, 27) = (80, 42, 20) =
                        6 411           6 411          6 411                                       100               100            100
         9         (5, 39, 72) =    (63, 72, 91) = (63, 72, 91) =                    8        (5, 40, 27) =    (99, 38, 29) = (78, 45, 21) =
                        3 618           6 411          6 411                                       100               100            100
        10         (2, 67, 73) =    (68, 50, 94) = (80, 70, 88) =                    9         (99, 38, 29)    (62, 25, 41) =  (75, 51, 3) =
                        3 618           6 411          6 411                                      = 100              100            100
                                                                                    10         (62, 25, 41)    (63, 36, 10) = (80, 42, 20) =
   This metric is the simplest from the point of view of                                          = 100              100            100
implementation, it takes into account only the total number of


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         302
Data Science

    The data obtained are very different both with other                                                  REFERENCES
metrics and within the metric. This is due to the features of             [1]    A.D. Richard and A.O. Jefferson, “Constraint-Based Automatic Test
the tested code - it has one common loop, within which there                     Data Generation,” IEEE Transactions on Software Engineering”, vol.
                                                                                 17, no. 9, pp. 900-910, 1991.
is one common condition. If this condition is not met, then
                                                                          [2]    P. Maragathavalli, M. Anusha, P. Geethamalini and S. Priyadharsini,
none of the operations, except the cycle and conditions, will                    “Automatic Test-Data Generation for Modified Condition. Decision
be taken into account when calculating the metric. A value of                    Coverage Using Genetic Algorithm,” International Journal of
100 indicates that among all operations on the path, all are                     Engineering Science and Technology, vol. 3, no. 2, pp. 1311-1318,
cycles or conditions, i.e. formally selected input data are                      2011.
options when the first condition was not met and other                    [3]    C. Meudec, “ATGen: Automatic Test Data Generation using
operations were not taken into account.                                          Constraint Logic Programming and Symbolic Execution,” Software
                                                                                 Testing Verification and Reliability, 2001.
                        VI. CONCLUSION                                    [4]    R. Gerlich, “Automatic Test Data Generation and Model Checking
                                                                                 with CHR,” 11th Workshop on Constraint Handling Rules, 2014.
    Evolutionary methods work in such a way as to find the                [5]    M.R. Girgis, “Automatic Test Data Generation for Data Flow Testing
best solutions to problems that are impossible or too costly to                  Using a Genetic Algorithm,” Journal of Universal Computer Science,
solve with standard optimization methods. They do not                            vol. 11, no. 6, pp. 898-915, 2005.
always work quickly or efficiently, but in problems with non-             [6]    E.J. Weyuker, “The complexity of data ﬂow criteria for test data
standard approaches it shows superiority.                                        selection,” Inf. Process. Lett., vol. 19, no. 2, pp. 103-109, 1984.
                                                                          [7]    A. Khamis, R. Bahgat and R. Abdelazi, “Automatic test data
    The introduction of various metrics for calculating the                      generation using data ﬂow information,” Dogus University Journal,
fitness function made it possible to add a method for                            vol. 2, pp. 140-153, 2011.
generating input test data of greater variability and the ability         [8]    S. Singla, D. Kumar, H. M. Rai and P. Singla, “A hybrid pso
to introduce new data requirements. Each metric is focused                       approach to automate test data generation for data ﬂow coverage with
on specific code parameters and can be used when data must                       dominance concepts,” Journal of Advanced Science and Technology,
                                                                                 vol. 37, pp. 15-26, 2011.
be selected in accordance with certain requirements. In
addition, in the case when the metric does not select data                [9]    Z. Liu, Z. Chen, C. Fang and Q. Shi, “Hybrid Test Data Generation,”
                                                                                 State Key Laboratory for Novel Software Technology, ICSE
efficiently, it is possible to use other metrics that can overlap                Companion Proceedings of the 36th International Conference on
each other’s shortcomings.                                                       Software Engineering, pp. 630-631, 2014.
                                                                          [10]   M. Harman and P. McMinn, “Theoretical and Empirical Study of
    All analyzed metrics, with the exception of the Jilb                         Search-Based Testing: Local, Global, and Hybrid Search,” IEEE
metric, generated several data sets for the critical path that                   Transactions on Software Engineering, vol. 36, no. 2, pp. 226-247,
was originally selected. It is noticeable that metrics for a                     2010.
small code of 130 lines with several code paths successfully              [11]   Y. Xing, Y. Gong, Y. Wang and X. Zhang, “Hybrid Intelligent
select data in the first generation, which indicates a rather                    Search Algorithm for Automatic Test Data Generation,”
high convergence rate of the algorithm. In subsequent                            Mathematical Problems in Engineering, 2015.
generations, various options are sequentially eliminated.                 [12]   C. Paduraru, and M.C. Melemciuc, “An Automatic Test Data
                                                                                 Generation Tool using Machine Learning,” 13th International
    The conducted studies allow to propose a new method for                      Conference on Software Technologies (ICSOFT), pp. 472-481, 2018.
generating test data based on the genetic algorithm, in which             [13]   M. Boussaa, O. Barais, G. Sunyé and B. Baudry, “Novelty Search
the fitness function will be formed not on the basis of one of                   Approach for Automatic Test Data Generation,” 8th International
the known metrics for assessing code complexity (as in this                      Workshop on Search-Based Software Testing, 2015.
paper), but on the basis of a hybrid metric, which is a                   [14]   M. Lopez, H. Ferreiro and L.M. Castro, “DSL for Web Services
                                                                                 Automatic Test Data Generation,” 25th International Symposium on
weighted sum of the indicators present in metrics considered                     Implementation and Application of Functional Languages, 2013.
in this paper. It also seems promising in terms of increasing             [15]   C. Doungsa-ard, K. Dahal, A.G. Hossain and T. Suwannasart, “An
the degree of code coverage by creating an effective                             automatic test data generation from UML state diagram using genetic
mechanism for regulating (increasing and decreasing) the                         algorithm,” IEEE Computer Society Press, pp. 47-52, 2007.
weights of operations in the fitness function while increasing            [16]   S.Sabharwal, R. Sibal and C. Sharma, “Applying Genetic Algorithm
the nesting level of the code section.                                           for Prioritization of Test Case Scenarios Derived from UML
                                                                                 Diagrams,” IJCSI International Journal of Computer Science Issues,
    In the future, it is planned to expand ways to determine                     vol. 8, no. 3-2, 2011.
the complexity of the code. In addition to using metrics                  [17]   C. Doungsa-ard, K. Dahal, A. Hossain and T. Suwannasart, “GA-
directly, it is planned to develop a method for taking into                      based Automatic Test Data Generation for UML State Diagrams with
account indicators of the number of operations, functions,                       Parallel Paths,” Part of book Advanced design and manufacture to
                                                                                 gain a competitive edge: New manufacturing techniques and their
conditions and cycles with different weights. It is also                         role in improving enterprise performance, pp. 147-156, 2008.
possible to establish the degree of reduction or increase in the          [18]   M. Grochtmann and K. “Grimm, “Classification trees for partition
weights of operations at different levels of nesting. This will                  testing. Software Testing,” Verification and Reliability, vol. 3, no. 2,
allow you to set the priority for input generation when certain                  pp. 63-82, 1993.
requirements arise.                                                       [19]   T.Y. Chen, P.L. Poon and T.H. Tse, “An integrated classification-tree
                                                                                 methodology for test case generation,” International Journal of
                       ACKNOWLEDGMENT                                            Software Engineering and Knowledge Engineering, vol. 10, no. 6, pp.
                                                                                 647-679, 2000.
   The reported study was funded by RFBR, project number
                                                                          [20]   A. Cain, T.Y. Chen, D. Grant, P.L. Poon, S.F. Tang and T.H. Tse,
19-37-90156. The research is supported by Ministry of                            “An Automatic Test Data Generation System Based on the Integrated
Science and Higher Education of Russian Federation (project                      Classification-Tree Methodology,” Software Engineering Research
No. FSUN-2020-0009)                                                              and Applications, Lecture Notes in Computer Science, vol. 3026,
                                                                                 2004.
                                                                          [21]   K.Serdyukov and T. Avdeenko, “Investigation of the genetic
                                                                                 algorithm possibilities for retrieving relevant cases from big data in


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              303
Data Science

     the decision support systems,” CEUR Workshop Proceedings, vol.         [23] W.M. Spears, “Crossover or mutation?” Foundations of Genetic
     1903, pp. 36-41, 2017.                                                      Algorithms, vol. 2, pp. 221-237, 1993.
[22] R.S. Praveen and K. Tai-hoon, “Application of Genetic Algorithm in     [24] H. Mühlenbein, “How genetic algorithms really work: Mutation and
     Software Testing,” International Journal of Software Engineering and        hillclimbing,” Parallel Problem Solving from Nature, vol. 2, 1992.
     Its Applications, vol. 3, no. 4, pp. 87-96, 2009.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                        304