Structural Diagnosis Method for Computer Programs
                   Developed by Trainees

       Daniel Gaydachuk1, Olena Havrylenko1, Juan Pablo Martínez Bastida1,
                               Andrey Chukhray1
              1 National Aerospace University, KhAI, Kharkiv, Ukraine

    dvrch@mail.ru, {lm77191220, jpbastida, achukhray}@gmail.com


       Abstract. The individual professional skills training via intelligent tutoring sys-
       tems is one of the high-priority scientific and applied problems. This paper con-
       sidered the issue that arises while training professional skills of algorithmization
       and programming, specifically – how the sample program stored in ITS will be
       compared with the program developed by a trainee. Structural diagnosis method
       for computer programs developed by trainees is proposed. Its advantages are the
       speed increment in comparison with known methods and a better adjustment of
       tutoring purposes. The results have been verified by means of the module testing,
       the tutoring system prototype implementation and introduction to a studying
       process.

       Keywords: Intelligent Tutoring System, M-ary tree, edit distance, trace, trainee’s
       program, code diagnose.


1      Introduction

Today one of the most actual scientific and engineering problems is ensuring of the
effective professional skills individual learning. Thus, the limited human psychophys-
iological abilities is due to that one teacher cannot adaptively teach every student in a
group with twenty or thirty people. A promising solution of this situation may be the
development and implementation of intelligent tutoring systems (ITS). Such programs
may have potentially unlimited resources and high performance.
   One of the issues that could improve quality of computer tutoring in professional
skills of algorithmization and programming is considered in this paper. The great chal-
lenge is how sample program stored in ITS will be compared with the program devel-
oped by a trainee [3, 4]. Commonly, such tutoring systems check solution written by a
user on some programming language by means of input-output tests. If some test has
failed and the trainee is not able to fix it by himself, then the tutoring system need to
determine where the mistake occurred and help to resolve it.
   Problem statement. It is required to create a method for finding the minimum edit
distance and trace between two m-ary trees. It could be achieved by modernization of
the method published in [6] that finds only tree distance. This method should provide
trainee’s code diagnose on the structural level.
    Assume, that each node in the tree has signed by a label which is a serial number
beginning from the bottom to the top and from the left to the right. Then following edit
operations can be defined: insert operation of the node, delete operation of the node and
exchange of two node labels. The operations are measured by means of the weight fac-
tor, which is a metric described in [5]. In this case the weight of label exchanging with
the same label is equal to zero. The weight of label insertion is equal to the weight of
label deletion. The weight of the two different labels exchanging can’t be greater than
the weights sum of the equivalent deletions and insertions, which transform the source
tree the same way. In this paper, it is assumed that weight of an operation is equal to
zero (label exchanging with the same label) or one (in other cases).
    The example of parameters determining, such as l(T) – an array of the leftmost tree
T leaves and LR_keyroots(T) – a set of such k, for which l (T )[k ]  l (T )[ p(T )[k ]] ,
is represented on Fig. 1. Parameter p(T) is an array of the ancestors for all nodes in the
tree T in ascending order, index in [] specify the number of root node. After finding
trace between two trees one tree can be converted into another.


               Fig. 1. The example of determining l and LR_keyroots for tree T

The method implemented in this article implies that each tree has input data represented
as two arrays: the first – an array of ancestors of every node in the tree: p1 (if it is the
initial tree) or p2 (if it is an ending tree), the second – an array of labels for each of the
nodes (T1 and T2 respectively). Further indexation of array fields will start from zero
(enumeration of tree nodes remains the same).
   In order to solve the problem of finding the tree distance and trace the method covers,
the calculation of l and LR_keyroots for the initial tree (l1 and LR_keyroots1 respec-
tively) and for the tree to which you want to convert the initial one (l2 and LR_keyroots2
respectively), as follows. For each tree, all the nodes have been analyzed starting with
the number one in ascending order.
   For instance, a tree T is considered and nodes that at least once has been read or
modified, become marked. Originally, leftmost leaf array l filled with zero values and
LR_keyroots (as a list) is empty.
   Obviously,
        l (T )[1]  l (T )[ p(T )[1]]  l (T )[ p(T )[ p(T )[1]]]  ...  l (T )[| T |]  1   (1)

where |T| is a number (the cardinality) of tree nodes (hereafter we will use the notation
«nT1» (nT1 = |T1|) and «nT2» (nT2 = |T2|)).
   First, the leftmost leaf array l is filled by these values. Further, the next unmarked
node is determined in turn. If it is found, it will be a tree leaf. For it the relation l(T)[i]
= i is valid. At the same time, the field l(T)[p(T)[i]] is set to i, if the ancestor was not
marked, otherwise value i is added to an array LR_keyroots, and then the shift to the
next unmarked node is done. At the end, tree T root number is added to the array
LR_keyroots. In order to solve the problem-parallelized version of the method presented
in [6], two arrays are used:

1. treedist size nT1 * nT2 (treedist[i1, j1] contains the edit distance between two trees,
   which roots are node T1[i1] of tree T1 and node T2[j1] of tree T2, respectively).
2. dist sizing nT12 * nT22 (dist[i, j][i1, j1] contains the edit distance between two for-
   ests, which includes nodes numbered l1[i] to i1 and the tree T1 from l2[j] to j1 in
   the tree T2, respectively).

   By using an array of that size, allowed to work with trees in which the number of
nodes is not more than two hundred. In this regard, the optimization and extension were
performed for the base parallelized method proposed in [6], excluding the paralleliza-
tion itself.
   Below, there are the changes undergone by this method. In [6] it is proved that

                             dist[i, j ][l1[i ]  1, l 2[ j ]  1]  0
               
               dist[i, j ][i1, l 2[ j ]  1]  dist[i, j ][i1  1, l 2[ j ]  1]  1,
               
                dist[i, j ][l1[i ]  1, j1]  dist[i, j ][l1[i ]  1, j1  1]  1
                                                                                              (2)
               i1  [l1[i]; nT1], j1  [l 2[ j ]; nT 2]
In order to reduce the number of fields in dist array, as one of the options, it is necessary
to replace some of them to analytical expression. As following it is shown:

                              dist[i, j ][l1[i ]  1, l 2[ j ]  1]  0
                             
                              dist[i, j ][i1, l 2[ j ]  1]  i1  l1[i ]  1 ,
                             dist[i, j ][l1[i ]  1, j1]  j1  l 2[ j ]  1
                                                                                              (3)
                             i1  [l1[i]; nT1], j1  [l 2[ j ]; nT 2]
In addition, it should be noted that some of the fields dist[i, j][i1, j1] of array dist con-
taining the distance between two non-empty forests are not used due to the performance
of the following relationships:
                             i  LR _ keyroots1,
                              j  LR _ keyroots2,                                       (4)
                             i1  [l1[i ]; i ], j1  [l 2[ j ]; j ]

Let the source array dist which will be referred to as «dist0», be the so-called “original”,
dist array (different from the original) “image” that includes only the fields which con-
tain the distance between two non-empty forests. It is noticed that between the index of
the field list LR_keyroots and the field value one-to-one ratio exists. It can also be no-
ticed that field of indices that do not satisfy the conditions imposed on i1 and j1 will
never be refered.
   Thus, the dist array is ragged, namely: it can be represented as a two-dimensional
matrix of size |LR_keyroots1|*|LR_keyroots2|, the elements of which are two-dimen-
sional matrices; each has the size determined by the expression:
                           ( LR _ keyroots1[i 0] 
                            l1[ LR _ keyroots1[i 0]]  1) 
                                                                                        (5)
                            ( LR _ keyroots2[ j 0] 
                            l 2[ LR _ keyroots2[ j 0]]  1)

where i0, j0 are the row and column numbers of the first matrix, respectively. The match
of array dist0 fields to array dist may be written as:
                             dist 0[i , j ][i1, j1] 
                              dist[i 0, j 0][i1 _ i , j1 _ j ],
                             {i  LR _ keyroots1[i 0],
                                                                                        (6)
                             j  LR _ keyroots2[ j 0],
                             i1 _ i  i1  l1[i ],
                             j1 _ j  j1  l 2[ j ]}

Furthermore, the treedist array is converted and it can replace it with an array dist0 in
accordance with the following identity:
                        treedist[i1, j1]  dist 0[u , v][i1, j1],
                        {u  LR _ keyroots1, l1[u ]  l1[i1],                           (7)
                         v  LR _ keyroots2, l 2[v]  l 2[ j1]}

It should be noted that one-to-one relationship is observed between the specified value
u  LR _ keyroots(T ) and value l (T )[u ] .
    Let specify array LR_keyroots_inverse(T), for which the equality is as follows:
                 ( LR _ keyroots _ inverse(T )[l (T )[i ]  1]  i 0) 
                                                                                           (8)
                  (l (T )[i ]  l (T )[i 0]  i 0  LR _ keyroots(T ))

then,
                       treedist[i1, j1]  dist[
                       LR _ keyroots1 _ inverse[l1[i1]  1],
                       LR _ keyroots2 _ inverse[l 2[ j1]  1]][
                       i1  l1[i1], j1  l 2[ j1]],
                                                                                           (9)
                      {LR _ keyroots1 _ inverse 
                        LR _ keyroots _ inverse(T 1),
                       LR _ keyroots2 _ inverse 
                        LR _ keyroots _ inverse(T 2)}

Thus, the distance between trees T1 and T2 is the value equal to:
                              dist[| LR _ keyroots1 | 1,
                              | LR _ keyroots2 | 1][                                  (10)
                              nT 1  1, nT 2  1]

The search for trace between two trees is carried out based on the third and fifth lemmas
of the article [6] by using the result of the calculation from the array dist.
   The algorithm is implemented as code in C# programming language. Table 2 pre-
sents the information about the approximated time the algorithm took for selected sizes
of trees, and the sizes of both trees are the same but their depth is equal to two.

                      Table 1. Data-time comparison of the algorithm

                                  Approximate run time
  The number of
                           the distance                                    The exception
 nodes of each tree                           the trace search
                              search
 not more than 100        less than 0.1 s      less than 0.1 s
        250                    1.15 s               0.15 s
        500                     9s                  0.25 s                      –
       1000                  1 m 11 s                0.4 s
       2000                  9 m 21 s                1.4 s
       4000               1 h 15 m 21 s                –                  Stack Over Flow
  more than 6500                          –                               Out Of Memory
2      Conclusion

The proposed algorithm is a modification of the algorithm Shasha-Zhang. Nevertheless,
the latter does not allow determining the trace between two trees. In the modified algo-
rithm the used memory grows slower than in the Shasha-Zhang algorithm, when the
number of trees nodes increases. If count of tree nodes was 6500 or more, the algorithm
could be interrupted with a message about insufficient memory.
    The prototype of the tutoring system was developed to verify method’s reliability. It
analyzes trainee’s C++ code by means of a set of tests. In case a user could not solve a
task, the system starts lexical code analysis (comparison of token arrays), and then parse
it (comparison of abstract syntax trees), in order to identify errors and prompts hints to
assist the trainee in accomplishing the proper writing of the required algorithm.


References
 1. Chukhray, A.: On a method of verification of professional skills algorithmization (in Rus-
    sian). In: Journal Radio-electronic and computer systems. vol.4(38), pp.84–86. (2009)
 2. Chukhray, A.: Models and methods for adaptive computer systems support of the acquisition
    of knowledge and skills in solving algorithmic learning tasks (in Russian). In: Journal Radio-
    electronic and computer systems. vol. 5(64), pp. 390–402. (2013)
 3. Horwitz, S.: Identifying the semantic and textual differences between two versions of a pro-
    gram. In: Proceedings of the SIGPLAN 90 Conference on Programming Language Design
    and Implementation, ACM SIGPLAN Notices. vol. 25(6), pp. 234–245. White Plain, New
    York. (1990)
 4. Yang, W.: Identifying syntactic differences between two programs. In: Software-Practice
    and Experience. vol. 21(7), pp. 739–755. (1991)
 5. Tai, K.-C.: The tree-to-tree correction problem. In: Journal of the ACM. vol. 26(3), pp. 422–
    423. (1979)
 6. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and
    related problems. In: SIAM Journal of Computing. vol. 18(6), pp. 1245–1262. (1989)