=Paper= {{Paper |id=Vol-1348/maics2013_paper_14 |storemode=property |title=Using Graph Matching: Program Recognition of the Selection Sort Algorithm |pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_14.pdf |volume=Vol-1348 |dblpUrl=https://dblp.org/rec/conf/maics/Finkbine13 }} ==Using Graph Matching: Program Recognition of the Selection Sort Algorithm== https://ceur-ws.org/Vol-1348/maics2013_paper_14.pdf
                      Using Graph Matching: Program Recognition of the Selection Sort Algorithm


                                                      Ronald Finkbine, Ph.D.
                                                    Indiana University Southeast
                                                    New Albany, Indiana 47250
                                                          rfinkbin@ius.edu


                          Abstract                                   equations. Each of these problems increases the difficulty
The field of program understanding attempts to determine             in a software maintenance programming attempting to
the function of a code segment without programmer                    understand a software component. Graph matching is
intervention and for this to occur, it is necessary to have a        considered one of the most complex problems in
model (plan) against which to attempt to match the code              computing (Bienenstock 1987).
segment of interest. This paper traces in detail the pattern
recognition of the selection sort algorithm.                         The first problem, parameter identification, is the most
                                                                     simple. It involves searching the source code for variables
                       Introduction                                  that are assigned values within assignment statements (no
The purpose of this research is to develop a general-                reads) one time. Any usage, thereafter, is only on the right-
purpose algorithm recognition system, capable of                     hand side of assignment statements and is a reference to
recognizing any well-defined and well-written algorithm.             the variable, not a modification to the variable. Therefore,
This project uses plans (Wills 1992) to recognize common             these types of variables, or constants, can be identified by
forms (code segments) within existing software in an                 the parameter statement which indicates their true usage.
attempt to gain knowledge about a legacy system (Sartipi
2003) from its source code (Biggerstaff 1990) by matching            The second problem, plan recognition, is comprised of
against a large defined set of common algorithms, rather             identifying code segments that are replaceable by calls to
than attempting to deduce what a code segment performs               commercial libraries (such as IMSL). This will involve
from its specific dataflow (Rugaber et. al. 1990). This              detecting codes similar to those used within commercial
research uses an intermediate representation of an abstract          libraries.
syntax tree (AST), standard in compiler toolsets. This AST
representation is output as a flattened tree into a fact list,       The third problem, duplicate removal, consists of detecting
which is the operation underpinning of expert systems.               and removing duplicate code to the user’s library. This
                                                                     allows the user to designate a section of code as common
                   Targeted Problems                                 and to look through their remaining programs searching for
This research concentrates on design recovery from legacy            codes that are copies of the target.
software, written in older languages and with fewer
techniques applicable to modern software development.                The fourth problem, algorithm separation, involves
This is because that recently written software is often              detection/separation of overlapping algorithms within the
written in a more modern language, but this leaves a large           same section of code. In Figure 1 it can be seen that there
bulk of older, operational software, orphaned to endless             are two initializations of arrays occurring within the same
software maintenance until it is rewritten.                          do-loop. This is good for optimizing computer resources,
                                                                     but not for optimizing the programmers’ time for
Legacy software, in general, exhibits a number of the                understanding and maintaining a program.
following     problems: 1) parameter identification, 2)
identifying code segments that are replaceable by calls to           The fifth problem, algorithm aggregation, involves
commercial libraries (such as IMSL), 3) removing                     combining disparate codes into single equations. As
duplicate code to user library, 4) separation of intertwined         displayed in Figure 2, an equation 1) can be coded in
components, and 5) combining disparate codes into single             multiple ways. Though the computations are equivalent,

                                                    DO 10 I = 1, N
                                                      A(I) = 0
                                                      B(I) = 0
                                             10     CONTINUE

                                                  Figure 1: Intertwined Algorithms
                                          Table 1: Selection Sort Algorithm Rule Flow
                                   Firing set     Prefix            Rules
                                   First          Defines           00
                                                    General            00
                                                    Expressions        00, 02, 07, 09, 10, 11, 17
                                                    Structures         00
                                                    Evaluations        01, 02
                                   Second           Variables          00, 02, 04, 05, 06, 07, 08
                                                    swaps              00, 01
                                   Third            loop               00, 01, 02, 03
                                                    min                00, 01
                                   Complete         SSA                00, 01

the recognition of them must take these variations into                the expert system or the order in which they are listed
account.                                                               within the system. A rule set once started will continue to
                                                                       fire until all rules have attempted to fire one final time with
This paper describes a portion of the High-Level                       none successful. This procedure allows each rule to fire as
Algorithm Recognizer (HLAR) project (Finkbine1994),                    many times as possible, only halting when all have been
which recognizes three algorithms selection sort (SSA),                unsuccessful during the final pass.
quick sort and heap sort from four languages, C, Scheme,
Postscript and COBOL. This research is unique in that it                                 SSA Recognition Trace
recognizes algorithms of significant size (currently 50 lines          One of the algorithms currently recognized is the Selection
of code), and detects these algorithms directly from                   Sort Algorithm (SSA). Figure 2 is a depiction of the
multiple third-generation programming languages instead                component parts, also known as plans (or sub-plans),
of from one language or directly from an intermediate form             within the SSA. The minimization plan and the swap plan
(Ning 1989).                                                           are contained, respectively, with the ssort plan. As well as
                                                                       proper containment and ordering of the sub-plans in this
The first step in recognizing common algorithms is to                  Figure, it is necessary that the plans have identifiers
compile the input source program into an intermediate                  (variables) in common. For example, for this function to
representation (Seemann 1998). The bulk of the                         perform correctly, it is necessary that the indexing variable
recognition efforts will be made by CLIPS, a rule-based                of the containing for loop be one of the positions of the
forward-chaining expert system, therefore the source                   data structure with the swap plan. These additional
programs will be expressed as a facts list (a tree                     requirements are necessary for the proper execution of the
represented as a linked list). Once the CLIPS system is                algorithm and its subsequent recognition.
initialized and pattern recognition begins, the general flow
of the pattern recognition process is listed in Table 1.               This section details the recognition of the SSA within
                                                                       HLAR. This algorithm was chosen because it is a common
This first phase is the initial fact generation. In the                algorithm within computing literature and the computer
following discussion, rule names are listed in separate                programming community. It has a complex plan structure,
phases (or firing sets). This is necessary since in a rule-            providing enough challenge to be of interest within the
based expert system, rules can fire at the time their                  program understanding/re-engineering community. Figure
conditional elements are satisfied. Within each phase, the             2 depicts the general flow of the recognition process
rules can (and do) fire in an order determined within the              necessary for the SSA.
expert system itself, not the order they are introduced into

                                           For Loop through structure (less 1)
                                             with “i” index

                                                Minimization by position of
                                                 partial structure

                                                Swap in structure of
                                                  position “I” with
                                                  position “small”

                                                        Figure 2: SSA Plans
                        Initial Facts                                  satisfy the requirements for each of the rules within the
After a third generation source language program is                    HLAR system. This section describes the rules that are
translated into the intermediate form, a program will                  fired and the facts they modify by retracting, asserting or
traverse it and generate a list of facts that are input into the       leaving them alone. Each of the statements referenced in
HLAR system. The structure and purpose of these facts                  this section is from Figure 3, and the rule firing is
are further explained in the remainder of this section. In             summarized in Table 2. In general, rules within a firing set
general, a statement will become a series of facts, roughly            can fire in any order; however, some rules in this first
equivalent to tokens in traditional compiler technology.               firing set generate data that fire other rules. This rule firing
                                                                       set recognizes three categories of rules: expressions,
                        Initial Rules                                  structures and evaluation clauses (used to determine the
There are two rules that fire first due to their salience value        path of execution within an if- statement and the exit of a
regardless of the subject program being examined. Rule                 loop).
gen_00 fires and establishes the number of the maximum
generalNode used. Next, the def_00 rule fires, establishing            Figure 4 is included to display the intermediate code,
the last-general-node field of every defineRoutineNode,                produced by a C language parser, which is being
thus establishing the span of control of each routine. This            recognized. The general layout of a program is a sequence
is not possible in a one-pass translator, such as is used to           of global variable definitions and assignments followed by
generate the fact intermediate form from the standard                  sub-program definitions. In general, intermediate form
intermediate form, which is necessary for input into a rule-           statements use prefix notation and each statement is a
based expert system. In the case of multiple routines                  function call followed by the parameters passed to or
within a program, the control of each routine extends to the           returned from the function.
beginning of the next routine. The control of the last
routine extends to the last generalNode, the value
calculated in the gen_00 rule.

For the SSA, the annotated explanation of the recognition
process appears in this section. Figure 4 contains the
intermediate form code used for explanation of the
recognition. In general, processing takes place from the
lowest level (tokens and expressions) to higher levels
(loops and if statements). Table 1 displays the general rule
flow in the SSA recognition. To support the passing of
information form rule within HLAR, it is necessary to have
a set of abstract data types known as templates.

                   First Rule Firing Set
After the facts generated from the SSA program are input,
the CLIPS system reviews these facts and attempts to
                            [1] (define-routine sort
                            [2] (parameters (inout numbers) (in count)
                            [9] (assign i 0)
                            [10] (loop
                            [11] (eval (gt I (minus count 1)))
                            [12] (assign big I)
                            [13] (assign j (plus i 1))
                            [14] (loop
                            [15] (eval (gt j count))
                            [16] (if
                            [17] (eval (gt (select numbers (key j) (field))
                            [18] (select numbers (key big) (field)))
                            [19] (assign big j))
                            [20] (assign j (plus j 1)))
                            [21] (assign temp (select numbers (key big) field)))
                            [22] (assign (select numbers (key big) (field))
                            [23] (select numbers (key I) (field))
                            [24] (assign (select numbers (key I) (field)) temp)
                            [25] (assign I (plus I 1))))

                                                   Figure 4: SSA Intermediate Form
                                                         Table 2: SSA Firing Phase One
                                        Rule                 Rule name                  Statements
                                      exp_00       detect_exp_0                    9
                                      exp_02       detect_exp_plus_id_1            13, 20, 25
                                      exp_07       detect_exp_gt_id_id             15
                                      exp_09       detect_exp_gtlop_id_min_id_1 11
                                      exp_10       detect_exp_id                   12, 17-19, 21-24
                                      exp_11       detect_exp_gt_strucid_strucid   17, 18
                                      exp_17       detect_exp_minus_id_1           11
                                      struc_0      detect_strucref_id              17, 18, 21, 22, 23, 24
                                      0
                                      eval_01      detect_gt_strucid_strucid        17, 18
                                      eval_02      detect_gt_id_id                  15

                                                                               expression in statement 11. Rule eval_02 detects a
                                      Special-if                               comparison of simple identifiers in statement 15.
                                                                               Rule exp_10 detects an expression of a simple identifier in
                                                                               statements 12, 17, 18, 19, 21, 22, 23, and 24. This
     Potential swap             Pot-minimization                               identification of simple identifiers (variable references)
                                                                               produces output that is part of the conditional input of rule
                                                                               struc_00, which detects an array referenced by a simple
       Actual swap              Act-minimization                               identifier in statements 17, 18, 21, 22, 23 and 24. Also,
                                                                               rule eval_01 fires, recognizing the comparison of two
                                                                               positions of the same array with the greater-than operator.
                      Potential SSA                                                             Second Ruling Firing Set
                                                                               Due to the extensive rule firing that occurs in this set, the
                                                                               rule execution is displayed in Table 3. Rule var_00 detects
                       Actual SSA                                              an identifier occurring on both the left-hand- and right-
                                                                               hand-side of an incrementing assignment in statements 21
      Figure 3: SSA CLIPS Template Flow                                        and 25. Rule var_04 recognizes a somewhat similar x = y +
                                                                               1 statement in 13. Rule var_05 recognizes a simple save-
                                                                               value assignment statement in 12 and 19.
Expressions are recognized within the first firing set. They
are the least-common denominator of program                                    Rule var_06 recognizes the save-value of an array position
understanding and appear on the right-hand side of                             in statement 21. Rule var_07 recognizes a save between
assignment statements and as the single operand of                             two locations of the same array, and rule var_08
evaluation statements. Their variety (such as x = x + 1                        recognizes an assignment from a simple identifier to an
versus x = 1 + x) lead to the increased complexity of                          array position. These three assignment statement rules
pattern-matching algorithms.                                                   (var_06, var_07 and var_08) produce input to rule
                                                                               swap_00. Swap rule swap_00 fires on statements 21, 22,
The rule exp_00 detects an expression of zero in statement                     23 and 24. This rule sets a controlling condition that has
9. Rule exp_02 detects an expression of an identifier plus                     prevented the not- rules from firing. After the assertion, an
one in statements 13, 20, and 25. Rule exp_07 detects a                        interfering statement within the swap segment would be
Boolean expression of an identifier greater-than another                       detected, if one existed. And since there are no interfering
identifier in statement 15. Rule exp_17 fires on statement                     statements rule swap_01 fires successfully.
11 followed by rule exp_09, which detects a complex
                                            Table 4: Algorithm rule firings
                                            Algorithm             Rule Firings
                                       Selection Sort                        50
                                       Quick Sort                            75
                                       Heap Sort                            150


                                                                   The SSA recognition rule, ssort_01 then will fire due to the
                   Third Rule Firing Set                           correct statement ordering, proper containment and non-
Loop rule loop_00 fires on statements 9, 10, 11, 5,                interference.
recognizing the index variable initialization, increment,
and testing. This is followed by rule loop_01 firing on                                      Summary
these same statements since there is no interference with          The HLAR system currently recognizes the three
the index variable within the loop statement and all               algorithms (written in the C programming language) in the
statements are within the same routine. Loop rule loop_02          number of rule firings listed in Table 4. In addition, it has
fires on statements 13, 14, 15, 20, recognizing the index          recognized the SSA in the COBOL, Scheme and Postscript
variable initialization by an expression, increment and            programming languages. Currently, this project is being
testing. This is followed by loop rule loop_03 which fires         redesigned which will involve a platform change in order
on statements 13, 14, 15, 20 since there is no interference        to build a more appropriate GUI as well as to be able to
of the index variable within the loop statement and all            distribute the recognition tasks across a network.
statements are within the same routine.
                                                                   To limit the need for outside assistance from a programmer
Minimum rule min_00 fires on statements 13 through 20,             (Ning 1989), the HLAR system has been designed (and
recognizing the form of a degenerative minimization by             redesigned) to accept multiple forms of algorithms. Future
position. This is followed by rule min_01, which fires on          work includes development of a subsystem to construct the
statements 13 through 20, which verifies the statements            plans by compiling from source and to not have the
have the correct ordering, non-interference of variables and       recognition rules written expressly by a programmer.
proper containment.
                                                                                          References
The variable count was established as the variable that            Biggerstaff, Ted, 1990. Design Recovery for Maintenance
contains the initialized length or the number of structure         and Reuse, IEEE Computer: July.
positions with actual values that are not undefined. This
must be input by the user and not performed automatically          Bauer, D., S. L. Hakimi, and E. Schmeichel, 1990.
by HLAR. In the future, it will be part of the system, but         Recognizing Tough Graphs is NP-Hard: Discrete Applied
would involve recognition of additional algorithms that are        Mathematics:28, 191.195.
not currently part of this research.
                                                                   Bienenstock, E and von der Malsburg, C.1987. A neural
                  Completion of Rule Firing                        network for invariant pattern recognition, Europhysics
SSA recognition rule potential_ssort_00 recognizes the             Letters: 4 121-126.
form of a containing-loop, a contained-degenerative-
minimization-plan and a swap plan with all identifiers             Finkbine, Ronald 1994. B., Recognition of High-Level
matching appropriately.         This triggers a search for         Algorithms, Ph.D. Dissertation, Department of Computer
interfering statements that will hopefully find no reason to       Science, New Mexico Institute of Mining and Technology.
terminate the search. An example of which would be an
intervening statement that sets the loop-indexing variable         Ning, Jim Qun 1989. A Knowledge-Based Approach to
to zero (illegal in Pascal, legal in C).                           Program Analysis, Ph.D. Dissertation, Department of
                                                                   Computer Science, University of Illinois at Urbana-

                                                  Table 3: SSA Firing SetTwo
                                   Rule                   Rule name                   Stmt
                                 var_00      detect_assign_sca_inc_1_self          21, 25
                                 var_02      detect_assign_sca_array_base_c        9
                                 var_04      detect_assign_sca_inc_1_other         13
                                 var_05      detect_assign_sca_sca                 12, 19
                                 var_06      detect_assign_sca_struid              21
                                 var_07      detect_assign_strucid_strucid         22, 23
                                 var_08      detect_assign_struciddd_sca           24
                                 swap_00     detect_potential_id_swap              21-24
                                 swap_01     detect_actual_id_id_swap              21-24
Champaign.

Rugaber, Spencer, Stephen B. Ornburn, and Richard
LeBlanc, Jr., 1990. Recognizing Design Decision in
Programs, IEEE Computer, July.

Sartipi, Kamran and Kontongiannis, Kostas 2003. On
modeling software architechture recovery as graph
matching. Proceedings of International Conference on
Software Maintenance, September.

Seemann, Jochen and von Gudenberg, Jurgen 1998.
Pattern-Based Design Recovery of Java Software,
Proceedings of the 6th ACM SIGSOFT International
Symposium on Foundations of Software Engineering, 10-
16.

Wills, Linda 1992. Automated Program Recognition by
Graph Parsing, Ph.D. Dissertation, Artificial Intelligence
Laboratory MIT, July.