Using Graph Matching: Program Recognition of the Selection Sort Algorithm Ronald Finkbine, Ph.D. Indiana University Southeast New Albany, Indiana 47250 rfinkbin@ius.edu Abstract equations. Each of these problems increases the difficulty The field of program understanding attempts to determine in a software maintenance programming attempting to the function of a code segment without programmer understand a software component. Graph matching is intervention and for this to occur, it is necessary to have a considered one of the most complex problems in model (plan) against which to attempt to match the code computing (Bienenstock 1987). segment of interest. This paper traces in detail the pattern recognition of the selection sort algorithm. The first problem, parameter identification, is the most simple. It involves searching the source code for variables Introduction that are assigned values within assignment statements (no The purpose of this research is to develop a general- reads) one time. Any usage, thereafter, is only on the right- purpose algorithm recognition system, capable of hand side of assignment statements and is a reference to recognizing any well-defined and well-written algorithm. the variable, not a modification to the variable. Therefore, This project uses plans (Wills 1992) to recognize common these types of variables, or constants, can be identified by forms (code segments) within existing software in an the parameter statement which indicates their true usage. attempt to gain knowledge about a legacy system (Sartipi 2003) from its source code (Biggerstaff 1990) by matching The second problem, plan recognition, is comprised of against a large defined set of common algorithms, rather identifying code segments that are replaceable by calls to than attempting to deduce what a code segment performs commercial libraries (such as IMSL). This will involve from its specific dataflow (Rugaber et. al. 1990). This detecting codes similar to those used within commercial research uses an intermediate representation of an abstract libraries. syntax tree (AST), standard in compiler toolsets. This AST representation is output as a flattened tree into a fact list, The third problem, duplicate removal, consists of detecting which is the operation underpinning of expert systems. and removing duplicate code to the user’s library. This allows the user to designate a section of code as common Targeted Problems and to look through their remaining programs searching for This research concentrates on design recovery from legacy codes that are copies of the target. software, written in older languages and with fewer techniques applicable to modern software development. The fourth problem, algorithm separation, involves This is because that recently written software is often detection/separation of overlapping algorithms within the written in a more modern language, but this leaves a large same section of code. In Figure 1 it can be seen that there bulk of older, operational software, orphaned to endless are two initializations of arrays occurring within the same software maintenance until it is rewritten. do-loop. This is good for optimizing computer resources, but not for optimizing the programmers’ time for Legacy software, in general, exhibits a number of the understanding and maintaining a program. following problems: 1) parameter identification, 2) identifying code segments that are replaceable by calls to The fifth problem, algorithm aggregation, involves commercial libraries (such as IMSL), 3) removing combining disparate codes into single equations. As duplicate code to user library, 4) separation of intertwined displayed in Figure 2, an equation 1) can be coded in components, and 5) combining disparate codes into single multiple ways. Though the computations are equivalent, DO 10 I = 1, N A(I) = 0 B(I) = 0 10 CONTINUE Figure 1: Intertwined Algorithms Table 1: Selection Sort Algorithm Rule Flow Firing set Prefix Rules First Defines 00 General 00 Expressions 00, 02, 07, 09, 10, 11, 17 Structures 00 Evaluations 01, 02 Second Variables 00, 02, 04, 05, 06, 07, 08 swaps 00, 01 Third loop 00, 01, 02, 03 min 00, 01 Complete SSA 00, 01 the recognition of them must take these variations into the expert system or the order in which they are listed account. within the system. A rule set once started will continue to fire until all rules have attempted to fire one final time with This paper describes a portion of the High-Level none successful. This procedure allows each rule to fire as Algorithm Recognizer (HLAR) project (Finkbine1994), many times as possible, only halting when all have been which recognizes three algorithms selection sort (SSA), unsuccessful during the final pass. quick sort and heap sort from four languages, C, Scheme, Postscript and COBOL. This research is unique in that it SSA Recognition Trace recognizes algorithms of significant size (currently 50 lines One of the algorithms currently recognized is the Selection of code), and detects these algorithms directly from Sort Algorithm (SSA). Figure 2 is a depiction of the multiple third-generation programming languages instead component parts, also known as plans (or sub-plans), of from one language or directly from an intermediate form within the SSA. The minimization plan and the swap plan (Ning 1989). are contained, respectively, with the ssort plan. As well as proper containment and ordering of the sub-plans in this The first step in recognizing common algorithms is to Figure, it is necessary that the plans have identifiers compile the input source program into an intermediate (variables) in common. For example, for this function to representation (Seemann 1998). The bulk of the perform correctly, it is necessary that the indexing variable recognition efforts will be made by CLIPS, a rule-based of the containing for loop be one of the positions of the forward-chaining expert system, therefore the source data structure with the swap plan. These additional programs will be expressed as a facts list (a tree requirements are necessary for the proper execution of the represented as a linked list). Once the CLIPS system is algorithm and its subsequent recognition. initialized and pattern recognition begins, the general flow of the pattern recognition process is listed in Table 1. This section details the recognition of the SSA within HLAR. This algorithm was chosen because it is a common This first phase is the initial fact generation. In the algorithm within computing literature and the computer following discussion, rule names are listed in separate programming community. It has a complex plan structure, phases (or firing sets). This is necessary since in a rule- providing enough challenge to be of interest within the based expert system, rules can fire at the time their program understanding/re-engineering community. Figure conditional elements are satisfied. Within each phase, the 2 depicts the general flow of the recognition process rules can (and do) fire in an order determined within the necessary for the SSA. expert system itself, not the order they are introduced into For Loop through structure (less 1) with “i” index Minimization by position of partial structure Swap in structure of position “I” with position “small” Figure 2: SSA Plans Initial Facts satisfy the requirements for each of the rules within the After a third generation source language program is HLAR system. This section describes the rules that are translated into the intermediate form, a program will fired and the facts they modify by retracting, asserting or traverse it and generate a list of facts that are input into the leaving them alone. Each of the statements referenced in HLAR system. The structure and purpose of these facts this section is from Figure 3, and the rule firing is are further explained in the remainder of this section. In summarized in Table 2. In general, rules within a firing set general, a statement will become a series of facts, roughly can fire in any order; however, some rules in this first equivalent to tokens in traditional compiler technology. firing set generate data that fire other rules. This rule firing set recognizes three categories of rules: expressions, Initial Rules structures and evaluation clauses (used to determine the There are two rules that fire first due to their salience value path of execution within an if- statement and the exit of a regardless of the subject program being examined. Rule loop). gen_00 fires and establishes the number of the maximum generalNode used. Next, the def_00 rule fires, establishing Figure 4 is included to display the intermediate code, the last-general-node field of every defineRoutineNode, produced by a C language parser, which is being thus establishing the span of control of each routine. This recognized. The general layout of a program is a sequence is not possible in a one-pass translator, such as is used to of global variable definitions and assignments followed by generate the fact intermediate form from the standard sub-program definitions. In general, intermediate form intermediate form, which is necessary for input into a rule- statements use prefix notation and each statement is a based expert system. In the case of multiple routines function call followed by the parameters passed to or within a program, the control of each routine extends to the returned from the function. beginning of the next routine. The control of the last routine extends to the last generalNode, the value calculated in the gen_00 rule. For the SSA, the annotated explanation of the recognition process appears in this section. Figure 4 contains the intermediate form code used for explanation of the recognition. In general, processing takes place from the lowest level (tokens and expressions) to higher levels (loops and if statements). Table 1 displays the general rule flow in the SSA recognition. To support the passing of information form rule within HLAR, it is necessary to have a set of abstract data types known as templates. First Rule Firing Set After the facts generated from the SSA program are input, the CLIPS system reviews these facts and attempts to [1] (define-routine sort [2] (parameters (inout numbers) (in count) [9] (assign i 0) [10] (loop [11] (eval (gt I (minus count 1))) [12] (assign big I) [13] (assign j (plus i 1)) [14] (loop [15] (eval (gt j count)) [16] (if [17] (eval (gt (select numbers (key j) (field)) [18] (select numbers (key big) (field))) [19] (assign big j)) [20] (assign j (plus j 1))) [21] (assign temp (select numbers (key big) field))) [22] (assign (select numbers (key big) (field)) [23] (select numbers (key I) (field)) [24] (assign (select numbers (key I) (field)) temp) [25] (assign I (plus I 1)))) Figure 4: SSA Intermediate Form Table 2: SSA Firing Phase One Rule Rule name Statements exp_00 detect_exp_0 9 exp_02 detect_exp_plus_id_1 13, 20, 25 exp_07 detect_exp_gt_id_id 15 exp_09 detect_exp_gtlop_id_min_id_1 11 exp_10 detect_exp_id 12, 17-19, 21-24 exp_11 detect_exp_gt_strucid_strucid 17, 18 exp_17 detect_exp_minus_id_1 11 struc_0 detect_strucref_id 17, 18, 21, 22, 23, 24 0 eval_01 detect_gt_strucid_strucid 17, 18 eval_02 detect_gt_id_id 15 expression in statement 11. Rule eval_02 detects a Special-if comparison of simple identifiers in statement 15. Rule exp_10 detects an expression of a simple identifier in statements 12, 17, 18, 19, 21, 22, 23, and 24. This Potential swap Pot-minimization identification of simple identifiers (variable references) produces output that is part of the conditional input of rule struc_00, which detects an array referenced by a simple Actual swap Act-minimization identifier in statements 17, 18, 21, 22, 23 and 24. Also, rule eval_01 fires, recognizing the comparison of two positions of the same array with the greater-than operator. Potential SSA Second Ruling Firing Set Due to the extensive rule firing that occurs in this set, the rule execution is displayed in Table 3. Rule var_00 detects Actual SSA an identifier occurring on both the left-hand- and right- hand-side of an incrementing assignment in statements 21 Figure 3: SSA CLIPS Template Flow and 25. Rule var_04 recognizes a somewhat similar x = y + 1 statement in 13. Rule var_05 recognizes a simple save- value assignment statement in 12 and 19. Expressions are recognized within the first firing set. They are the least-common denominator of program Rule var_06 recognizes the save-value of an array position understanding and appear on the right-hand side of in statement 21. Rule var_07 recognizes a save between assignment statements and as the single operand of two locations of the same array, and rule var_08 evaluation statements. Their variety (such as x = x + 1 recognizes an assignment from a simple identifier to an versus x = 1 + x) lead to the increased complexity of array position. These three assignment statement rules pattern-matching algorithms. (var_06, var_07 and var_08) produce input to rule swap_00. Swap rule swap_00 fires on statements 21, 22, The rule exp_00 detects an expression of zero in statement 23 and 24. This rule sets a controlling condition that has 9. Rule exp_02 detects an expression of an identifier plus prevented the not- rules from firing. After the assertion, an one in statements 13, 20, and 25. Rule exp_07 detects a interfering statement within the swap segment would be Boolean expression of an identifier greater-than another detected, if one existed. And since there are no interfering identifier in statement 15. Rule exp_17 fires on statement statements rule swap_01 fires successfully. 11 followed by rule exp_09, which detects a complex Table 4: Algorithm rule firings Algorithm Rule Firings Selection Sort 50 Quick Sort 75 Heap Sort 150 The SSA recognition rule, ssort_01 then will fire due to the Third Rule Firing Set correct statement ordering, proper containment and non- Loop rule loop_00 fires on statements 9, 10, 11, 5, interference. recognizing the index variable initialization, increment, and testing. This is followed by rule loop_01 firing on Summary these same statements since there is no interference with The HLAR system currently recognizes the three the index variable within the loop statement and all algorithms (written in the C programming language) in the statements are within the same routine. Loop rule loop_02 number of rule firings listed in Table 4. In addition, it has fires on statements 13, 14, 15, 20, recognizing the index recognized the SSA in the COBOL, Scheme and Postscript variable initialization by an expression, increment and programming languages. Currently, this project is being testing. This is followed by loop rule loop_03 which fires redesigned which will involve a platform change in order on statements 13, 14, 15, 20 since there is no interference to build a more appropriate GUI as well as to be able to of the index variable within the loop statement and all distribute the recognition tasks across a network. statements are within the same routine. To limit the need for outside assistance from a programmer Minimum rule min_00 fires on statements 13 through 20, (Ning 1989), the HLAR system has been designed (and recognizing the form of a degenerative minimization by redesigned) to accept multiple forms of algorithms. Future position. This is followed by rule min_01, which fires on work includes development of a subsystem to construct the statements 13 through 20, which verifies the statements plans by compiling from source and to not have the have the correct ordering, non-interference of variables and recognition rules written expressly by a programmer. proper containment. References The variable count was established as the variable that Biggerstaff, Ted, 1990. Design Recovery for Maintenance contains the initialized length or the number of structure and Reuse, IEEE Computer: July. positions with actual values that are not undefined. This must be input by the user and not performed automatically Bauer, D., S. L. Hakimi, and E. Schmeichel, 1990. by HLAR. In the future, it will be part of the system, but Recognizing Tough Graphs is NP-Hard: Discrete Applied would involve recognition of additional algorithms that are Mathematics:28, 191.195. not currently part of this research. Bienenstock, E and von der Malsburg, C.1987. A neural Completion of Rule Firing network for invariant pattern recognition, Europhysics SSA recognition rule potential_ssort_00 recognizes the Letters: 4 121-126. form of a containing-loop, a contained-degenerative- minimization-plan and a swap plan with all identifiers Finkbine, Ronald 1994. B., Recognition of High-Level matching appropriately. This triggers a search for Algorithms, Ph.D. Dissertation, Department of Computer interfering statements that will hopefully find no reason to Science, New Mexico Institute of Mining and Technology. terminate the search. An example of which would be an intervening statement that sets the loop-indexing variable Ning, Jim Qun 1989. A Knowledge-Based Approach to to zero (illegal in Pascal, legal in C). Program Analysis, Ph.D. Dissertation, Department of Computer Science, University of Illinois at Urbana- Table 3: SSA Firing SetTwo Rule Rule name Stmt var_00 detect_assign_sca_inc_1_self 21, 25 var_02 detect_assign_sca_array_base_c 9 var_04 detect_assign_sca_inc_1_other 13 var_05 detect_assign_sca_sca 12, 19 var_06 detect_assign_sca_struid 21 var_07 detect_assign_strucid_strucid 22, 23 var_08 detect_assign_struciddd_sca 24 swap_00 detect_potential_id_swap 21-24 swap_01 detect_actual_id_id_swap 21-24 Champaign. Rugaber, Spencer, Stephen B. Ornburn, and Richard LeBlanc, Jr., 1990. Recognizing Design Decision in Programs, IEEE Computer, July. Sartipi, Kamran and Kontongiannis, Kostas 2003. On modeling software architechture recovery as graph matching. Proceedings of International Conference on Software Maintenance, September. Seemann, Jochen and von Gudenberg, Jurgen 1998. Pattern-Based Design Recovery of Java Software, Proceedings of the 6th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 10- 16. Wills, Linda 1992. Automated Program Recognition by Graph Parsing, Ph.D. Dissertation, Artificial Intelligence Laboratory MIT, July.