Design Pattern Detection using Software Metrics and Machine Learning Satoru Uchiyama Atsuto Kubo Hironori Washizaki Aoyama Media Laboratory Yoshiaki Fukazawa Tokyo, Japan Dept. Computer Science and Engineering kubo@nii.ac.jp Waseda University Tokyo, Japan s.uchiyama1104@toki.waseda.jp washizaki@waseda.jp fukazawa@waseda.jp Abstract—The understandability, maintainability, and Many studies on using automatic pattern detection to reusability of object-oriented programs could be improved by solve the above problems have used static analysis. However, automatically detecting well-known design patterns in static analysis has difficulty identifying patterns in which programs. Many existing detection techniques are based on class structures are similar and patterns with few features. In static analysis and use strict conditions composed of class addition, there is still a possibility that software developers structure data. Hence, it is difficult for them to detect design might overlook patterns if they use strict conditions like the patterns in which the class structures are similar. Moreover, it class structure analysis, and if the applied patterns vary from is difficult for them to deal with diversity in design pattern the intended conditions even a little. applications. We propose a design pattern detection technique We propose a pattern detection technique that uses using metrics and machine learning. Our technique judges software metrics (hereafter, metrics) and machine learning. candidates for the roles that compose the design patterns by using machine learning and measurements of metrics, and it Although our technique can be classified as a type of static detects design patterns by analyzing the relations between analysis, unlike previous detection techniques it detects candidates. It suppresses false negatives and distinguishes patterns by using identifying elements derived by machine patterns in which the class structures are similar. We learning based on measurement of metrics without using conducted experiments that showed that our technique was strict condition descriptions (class structural data, etc.). A more accurate than two previous techniques. metric is a quantitative measure of a software property that can be used to evaluate software development. For example, Keywords—component; Object-oriented software, Design one such metric, number of methods (NOM), refers to the pattern, Software metrics, Machine learning number of methods in a class [2]. Moreover, by using machine learning, we can in some cases obtain previously I. INTRODUCTION unknown identifying elements from combinations of metrics. To cover a diverse range of pattern applications, our method Design patterns (hereafter, patterns) are defined as uses a variety of learning data because the results of our descriptions of communicating classes that form a common technique may depend on the kind and number of learning solution to a common design problem. Gang of Four (GoF) data used during the machine learning process. Finally, we patterns [1] are representative patterns for object-oriented conducted experiments comparing our technique with two software. Patterns are composed of classes that describe the previous techniques and found that our approach was the roles and abilities of objects. For example, Figure 1 shows most accurate of the three. one GoF pattern named the State pattern. This pattern is composed of roles named Context, State, and II. PREVIOUS DESIGN PATTERN DETECTION TECHNIQUES ConcreteState. The use of patterns enables software AND THEIR PROBLEMS development with high maintainability, high reusability, and Most of the existing detection techniques use static improved understandability, and it facilitates smooth analysis [3][4]. These techniques chiefly analyze information communications between developers. such as class structures that satisfy certain conditions. If they Programs implemented by a third party and open source vary from the intended strict conditions even a little, or two software may take a lot of time to understand, and patterns or more roles are assigned in a class, there is a possibility may be applied without explicit class names, comments, or that developers might overlook patterns. attached documents in existing programs. Thus, pattern There is a technique that detects patterns based on the detection improves the understandability of programs. degrees of similarity between graphs of pattern structure and However, manually detecting patterns in existing programs graphs of programs to be detected [3]. However, is inefficient, and patterns may be overlooked. Other detection techniques use dynamic analysis. These methods identify patterns by referring to the execution route information of programs [10][11]. However, it is difficult to analyze the entire execution route and use fragmentary class sets in an analysis. Additionally, the results of dynamic analysis depend on the representativeness of the execution sequences. Some detection techniques use a multilayered (multiphase) approach [12][13]. Lucia et al. use a two-phase, static analysis approach [12]. This method has difficulty, Figure 1. State pattern however, in detecting creational and behavioral patterns because it analyzes pattern structures and source code level conditions. Guéhéneuc et al. use “DeMIMA,” an approach that consists of three layers: two layers to recover an abstract model of the source code, including binary class relationships, and a third layer to identify patterns in the abstract model. However, distinguishing the State pattern from the Strategy pattern is difficult because their structures are identical. Our technique can detect patterns in all categories and distinguish the State pattern from the Strategy pattern using metrics and machine learning. Finally, one existing technique detects patterns using formal OWL (Web Ontology Language) definitions [14]. Figure 2. Strategy pattern However, false negatives arise because this technique does not accommodate the diversity in pattern applications. The distinguishing the State pattern from the Strategy technique [14] is available to the public via the web as an pattern is difficult because their class structures are similar Eclipse plug-in. (see Figure 1 and 2). Unlike this method, we distinguish the We suppress false negatives by using metrics and patterns to which the structure is similar by the identification machine learning to accommodate diverse pattern of the roles from the quantity and the ratio of metrics by the applications and to distinguish patterns in which the class machine learning. In addition, this technique [3] is available structures are similar. It should be noted that only techniques to the public as a web-based tool. [3], [14] discussed above have been released as publicly There is a technique that outputs pattern candidates based accessible tools. on features derived from metric measurements [5]. However, it requires manual confirmation; this technique can roughly III. OUR TECHNIQUE identify pattern candidates, but the final choice depends on Our technique is composed of a learning phase and a the developer's skill. Our technique detects patterns without detection phase. The learning phase is composed of three manual filtering by metrics and machine learning but also by processes, and the detection phase is composed of two analyzing class structure information. Moreover, this processes, as shown in Figure 3. Each process is described technique uses general metrics concerning an object-oriented below, with pattern specialists and developers included as system without using metrics for each role. Our technique the parties concerned. Pattern specialists mean persons that uses metrics that specialize in each role. have the knowledge about the patterns. Developers mean Another existing technique improves precision by persons that maintain the object-oriented software. Our filtering detection results using machine learning. This technique currently uses Java as the program language. technique inputs measurements of the classes and methods of [Learning Phase] each pattern [6]. However, it uses the existing static P1. Define Patterns analytical approach, whereas our technique instead uses Pattern specialists determine the detectable patterns and machine learning throughout the entire process. define the structures and roles composing these patterns. One current technique analyzes programs both before and P2. Decide Metrics after patterns are applied [7]. This method requires a revision Pattern specialists determine useful metrics to judge the history of the programs used. Our technique detects patterns roles defined in P1 by using the Goal Question Metric by analyzing only the current programs. decision technique. Yet another approach detects patterns from the class P3. Machine Learning structure and behavior of a system after classifying its Pattern specialists input programs applied patterns into patterns [8][9]. It is difficult to use, however, when patterns the metrics measurement system, and obtain measurements are applied more than once and when pattern application is for each role. And specialists input these measurements into diverse. In contrast, our technique copes well with both of the machine learning simulator to learn. After machine these challenges. learning they verify the judgment for each role, and if Figure 3. Process of our technique the verification results are not good, they return to P2 and specialists decide metrics to identify roles by the quantity revise the metrics. and the ratio of measurements. Therefore, they decide [Detection Phase] questions by paying attention to the attributes and operations P4. Role Candidate Judgment of the roles by reading the description of the pattern Developers input programs to be detected into the definition. They decide simple metrics concerning the static metrics measurement system, and obtain measurements for aspect like structure first to improve the recall. However, the each class. And developers input these measurements into lack of questions might occur because GQM is qualitative. the machine learning simulator. Machine learning simulator Therefore, if the machine learning results were unsatisfactory identifies role candidates. by irregular measurements of metrics, the procedure loops P5. Pattern Detection back to P2 to reconsider metrics also concerning behavior. Developers input role candidates judged in P4 to the Moreover, it will be possible to apply GQM to roles with pattern detection system by using the pattern structure new patterns in the future. definitions defined in P1. This system detects patterns For example, Figure 4 illustrates the goal of making a automatically. The structure definitions correspond to the judgment about the AbstractClass role in the letters P, R, and E of section III-B. TemplateMethod pattern. AbstractClass roles have A. Learning Phase abstract methods or methods using written logic that use abstract methods as shown in Figure 5. The P1. Define Patterns AbstractClass role can be distinguished by the ratio of Currently, our technique considers five GoF patterns (Singleton, TemplateMethod, Adapter, State, and Strategy) and 12 roles. The GoF patterns are grouped into creational patterns, structural patterns, and behavioral patterns. Our technique uses these patterns to cover all these groups. P2. Decide Metrics Pattern specialists decide on useful metrics to judge roles by using the Goal Question Metric decision technique [14] Figure 4. Example of GQM(AbstractClass role) (hereafter, GQM). GQM is a top-down approach used to clarify relations between goals and metrics. We experimented with judging roles by using general metrics without GQM. However, the machine learning results were unsatisfactory because the measurements of some metrics were irregular. Consequently, we chose GQM so that the machine learning could function appropriately by stable metrics in each role. With our technique, the pattern specialists set as a goal the accurate judgment of each role. To achieve this goal they defined a set of questions to be evaluated. Next, they decided on useful metrics to help answer the questions they had established. The pattern Figure 5. Example of source code (AbstractClass role) TABLE I. RESULTS OF APRLING GQM Pattern Role Goal Question Metric Is the static field defined? NSF Identification of Singleton Singleton Is the constructor called from other class? NOPC Singleton role Is the method that return singleton instance? NSM Identification of Are abstract methods defined? NOAM AbstractClass Template AbstractClass role Is the template method defined? NOM Method Identification of ConcreteClass ConcreteClass role Is the override method defined? NORM Identification of Are abstract methods defined? NOAM Target Target role Is the class defined as an interface? NOI Are override methods defined? NORM Identification of Adaper Adapter Adapter role NOF Is Adaptee field defined? NOOF Identification of Are methods used by Adapter role defined? NOM Adaptee Adaptee role Is the class referred by other classes? NCOF Are methods to set states defined? NOM Identification of Context NOF Context role Is State field defined? NOOF Are abstract methods defined? NOAM Identification of State State Is the class defined as an interface? NOI State role Is the class referred by other classes? NCOF Is the override method defined? NORM Concrete Identification of ConcreteState role Is the method that describes change state NOM State defined? NMGI Are methods to set states defined? NOM Identification of Context NOF Context role Is Strategy field defined? NOOF Are abstract methods defined? NOAM Strategy Identification of Strategy Is the class defined as an interface? NOI Strategy role Is the class referred by other classes? NCOF Identification of Concrete ConcreteStrategy Is the override method defined? NORM Strategy role the number of methods to the number of abstract methods a pattern of two groups by using linear input elements. because with this role the former exceeds the latter. However, we chose a neural network because it outputs the Therefore, the number of abstract methods (NOAM) and values to all roles, taking into consideration the dependency number of methods (NOM) are useful metrics for judging among the different metrics. Therefore, it can deal with cases this role. Table I shows the results of applying GQM to all in which one class plays two or more roles. roles. The details of metrics are described in Table II of A neural network is composed of an input layer, hidden paragraph IV-A. layers, and an output layer, as shown in Figure 6, and each The previous technique [5] uses GQM, too. In this layer is composed of elements called units. Values are given technique, the goal is set as “Recover design patterns”. And a weight when they move from unit to unit, and a judgment this technique uses general metrics concerning an object- rule is acquired by changing the weights. A typical algorithm oriented system without deciding metrics at each role. On the for adjusting weights is back propagation. Back propagation other hand, our technique uses metrics that specialize in each calculates an error margin between output result y and the role. correct answer T, and it sequentially adjusts weights from the P3. Machine Learning layer nearest the output to the input layer, as shown in Figure Machine learning is a technique that analyzes sample 7. These weights are adjusted until the output error margin of data by computer and acquires useful rules with which to the network reaches a certain value. make forecasts about unknown data. We used the machine Our technique uses a hierarchical neural network learning so as to be able to evaluate patterns with a variety of simulator [17]. This simulator uses back propagation. The application forms. Machine learning suppresses false hierarchy number in the neural network is set to three, the negatives and achieves extensive detection. number of units in the input layer and the hidden layer are set Our technique uses a neural network [16] algorithm. A to the number of decided metrics, and the number of units of support vector machine [16] could also be used to distinguish the output layer is set to the number of roles being judged. is set as a threshold, and the role agreement values that are higher than the threshold are taken to be role candidates. The threshold is 1/12 (i.e., 0.0834) because we treat 12 roles at present. The sum of the output values is different at each input in the neural net work. Therefore, to use a common threshold for all the output, our technique normalizes the output value. For example, Figure 8 shows the role candidate judgment results with NOM of 3 and NOAM of 2 and other metrics of 0; the output value of AbstractClass is the highest value. By regularizing the values of Figure 8, the roles are judged to be AbstractClass and Target. P5. Pattern Detection Developers input role candidates judged in P4 into the pattern detection system by using the pattern structure Figure 6. Neural network definitions defined in P1. And, this system detects patterns by matching the direction of the relations between role candidates and the roles of pattern in programs. The matching moves sequentially from the role candidate with the highest agreement value to that with the lowest value. The pattern detection system searches all combinations of role candidates that accord with the pattern structures. The pattern detection system detects patterns when the directions of relations between role candidates accord with the pattern structure and when the role candidates accord with roles at both ends of the relations. Moreover, the relation agreement values reflect the kind of relation. Currently, our method deals with inheritance, interface implementation, and aggregation relations. The kind of relations will increase as more patterns get added in the Figure 7. Back propagation future. The relation agreement value is 1.0 when the kind agrees with the relation of the pattern, and it is 0.5 when the The input consists of the metric measurements of each role in kind does not agree. If the relation agreement value is 0 then a program to which patterns have already been applied, and the kind of relation does not agree, the pattern agreement the output is an expected role. Pattern specialists obtain value might become 0, and these classes will not be detected measurements for each role by using metrics measurement as patterns. In such cases, a problem similar to those of the system. And, specialists input these measurements into the previous detection techniques will occur because the machine learning simulator to learn. The learning repetitions difference in the kind of relation is not recognized. cease when the error margin curve of the simulator The pattern agreement value is calculated from the role converges. The specialists verify the convergence of the agreement values and the relation agreement values. The error margin curve manually at present. After machine pattern to be detected is denoted as P, the role set that learning they verify the judgment for each role, and if the composes the pattern is denoted as R, and the relation set is verification results are not good, they return to P2 and revise denoted as E. Moreover, the program that is the target of the metrics. detection is defined as P’, the set of classes comprising the role candidates is R’, and the set of relations between elements of R' is denoted as E’. The role agreement value is B. Detection Phase denoted as Role, and the relation agreement is denoted as Rel. P4. Role Candidate Judgment Role means the function which is input the element of R and Developers input programs to be detected into the the one of R' , and . Rel means the function which is input the metrics measurement system, and obtain measurements for element of E and the one of E'. The product of the average of each class. And developers input these measurements to the two roles at both ends of the relation and Rel is denoted as machine learning simulator. This simulator outputs values Com, and the average of Com is denoted as Pat. Moreover, between 0–1 to all roles to be judged. The output values are the average of two Roles is calculated when Com is normalized such that the sum of all values becomes 1. These calculated, and the average value of Com is calculated to output values are called role agreement values. A larger role adjust Pat and Role to values from 0 to 1 when Pat is agreement value means that the role candidate is more likely calculated. If the directions of the relations do not agree, Rel correct. The reciprocal of the number of roles to be detected is assumed to be 0. Figure 8. Example of machine learning output Figure 9. Example of pattern detection (TemplateMethod pattern) P  ( R, E ) P   ( R , E ) Samples  ( R, E) R  {r1 , r2 , , ri } R   {r1, r2 , , rk } R  {SampleA, SampleB, SampleC} E  {e1 , e 2 ,  , e j }  R  R E   {e1 , e 2 ,  , el}  R   R  E  {SampleA   SampleB, SampleA ◇SampleC} Role (rm , rn)  The role agreement value rm  R, rn  R  (   : inheritance, ◇ : aggregation) Role ( AbstractCl ass, SampleA)  0.82 Role (ConcreteCl ass, SampleB )  0.45 R el (e p , eq )  The relation agreement value e p  E, e q  E  Role (ConcreteCl ass, SampleC )  0.57 Role (ra , rb )  Role (rc , rd ) R el ( AbstractCl ass  ConcreteCl ass, SampleA   SampleB )  1.0 Com(e p , eq )   R el (e p , eq ) 2 R el ( AbstractCl ass  ConcreteCl ass, SampleA ◇SampleC )  0.5 ra , rc  R, rb , rd  R , e p  (ra , rc ), ep  (rb , rd ) 0.82  0.45 1 Com( AbstractCl ass  ConcreteCl ass, SampleA   SampleB )   1.0  0.635 Pat ( P, P )     e E , e E  p q Com(e p , e q ) 2 0.82  0.57 (e p , e q )  E  E  R el (e p , e q )  0 Com( AbstractCl ass  ConcreteCl ass, SampleA ◇SampleC )   0.5  0.348   2 Pat (TemplateMethod , Samples )   0.635  0.348  0.492 Figure 9 shows an example of detecting the 1 2 TemplateMethod pattern. In this example, it is assumed In the program shown in Figure 9, the pattern agreement that class SampleA has the highest role agreement value for value of the TemplateMethod pattern was calculated to an AbstractClass. The pattern agreement value between be 0.492. Pattern agreement values are normalized from 0 to the program Samples and the TemplateMethod pattern is 1, just like role agreement values. Our technique uses the calculated with the following algorithm. same threshold among of pattern agreement value as role agreement value because a lot of classes are detected as the P  TemplateMe thod  ( R, E ) pattern that composed of the only class like the singleton R  { AbstractCl ass, ConcreteCl ass} pattern. Classes with a pattern agreement value that exceeds E  { AbstractCl ass  ConcreteCl ass} the threshold are output as the detection result. The reciprocal of the number of roles for detection is taken to be the threshold, similar to the case of role candidate judgment, We focused our attention on recall because the purpose and pattern agreement values that are higher than the of our technique was detection covering diverse pattern threshold are output as the detection result. applications. Recall indicates the degree to which detection In Figure 9, SampleA, SampleB, and SampleC were results are free of leakage, whereas precision shows how free detected as TemplateMethod patterns. Moreover, of disagreement these result are. The data in Table III was SampleA and SampleB, SampleA and SampleC can used to calculate recall. wr, xr, yr, and zr are numbers of roles, also be considered to match the TemplateMethod and wp, xp, yp, and zp are numbers of patterns. Recall was pattern. In this case, the relation of calculated from the data in Table III by the following “SampleA   SampleB” is more similar to a expressions. wr TemplateMethod pattern than the relation of Recall of role candidate judgment: Re r  wr  xr “SampleA ◇ SampleC” because its agreement value of Table IV shows the average recall for each role. Role the former pair is 0.635 while that of the latter pair is only candidates must be judged accurately because the State 0.348. pattern and Strategy pattern have the same class structure. IV. EVALUATION AND DISCUSSION Therefore, our technique regards the roles of the patterns other than State and Strategy patterns as role We determined whether the machine learning simulator candidates when the role agreement value was above the derived identifying elements of the roles after learning. threshold, whereas our technique regards the roles of State Moreover, we compared our technique with two previous techniques to verify the precision and recall of our approach and Strategy patterns as role candidates when the role and to confirm whether it could match its detected patterns agreement value was above the threshold and the roles of with similar structures and diverse patterns. both patterns were distinguished State pattern from Strategy pattern. A. Verification of Role Candidate Judgement As shown in Table IV, the recalls for the large-scale We used cross-validation to verify the role candidate programs were lower than those for the small-scale programs. judgment. In cross-validation, data are divided into n groups, Accurate judgment of large-scale programs was more and a test to verify a role candidate judgment is executed difficult because these programs possessed attributes and such that the testing data are one data group and the learning operations that were unnecessary for pattern composition. data are n-1 data groups. We executed the test five times by Therefore, it will be necessary to collect a significant amount dividing the data into five groups. In this paper, programs of learning data to adequately cover a variety of large-scale such as programs in the reference [18], etc., are called small programs. scale, whereas programs in practical use are called large The results in Table IV pertain to instances where the scale. We used the set of programs where patterns are State and Strategy patterns could be distinguished. The applied in small-scale programs (60 in total) 1 [18][19] and Context role had high recall, but State and large-scale programs (158 in total from the Java library ConcreteState roles had especially low recalls for large- 1.6.0_13 [20], JUnit 4.5 [21], and Spring Framework 2.5 scale programs. However, the candidates for the State role RC2 [22]) as experimental data. We judged manually and were output with high recall when the threshold was qualitatively whether the patterns were appropriately applied exceeded. Therefore, the State pattern can be distinguished in this set of programs. by initiating searching from the Context role in P5, and Table II shows the metrics that were chosen for the this improves recall. small-scale and large-scale programs. We used different metrics depending on the magnitude of the programs. For instance, we chose the metric called number of methods TABLE II. CHOSEN METRICS generating instance (NMGI) for small-scale programs Abbreviation Content because the method for transit states in the NOF Number of fields ConcreteState role in the State pattern generates NSF Number of static fields other ConcreteState roles in small-scale programs. We NOM Number of methods guessed that the difference appeared in ratios of metrics NSM Number of static methods about State and Strategy, so we used the same metrics for the NOI Number of interfaces large-scale programs without NMGI. Because State pattern NOAM Number of abstract methods treats the states in State role and treats the actions of the NORM Number of overridden methods states in the Context role. On the other hand Strategy pattern NOPC Number of private constructors encapsulates the processing of each algorithm into a Strategy Number of constructors with argument role, and Context processing becomes simpler compared NOTC of object type with that of State pattern. NOOF Number of object fields Number of other classes with field of NCOF own type 1 All small-scale code: Number of methods to generate NMGI http://www.washi.cs.waseda.ac.jp/ja/paper/uchiyama/dp.html instances TABLE III. INTERSECTION PROCESSION TABLE V. PRECISION AND RECALL RATIO OF PATTERN DETECTION detected not detected Number of Precision (%) Recall (%) wr, wp xr, xp test data correct (true positive) (false negative) Small- Large- Small- Large- Small- Large- yr, yp zr, zp Pattern scale scale scale scale scale scale incorrect programs programs programs programs programs programs (false positive) (true negative) Singleton 6 6 60.0 63.6 100.0 100.0 TABLE IV. RECALL OF CANDIDATE ROLE JUDGMENT ( AVERAGE) Template 6 7 85.7 71.4 100.0 83.3 Method Average recall (%) Small-scale Large-scale Adapter 4 7 100.0 100.0 90.0 60.0 Pattern Role State 2 6 50.0 40.0 100.0 66.6 programs programs Singleton Singleton 100.0 84.7 Strategy 2 6 66.7 30.8 100.0 80.0 Template AbstractClass 100.0 88.6 Method ConcreteClass 100.0 58.5 metrics. The State pattern was identified by searching Target 90.0 75.2 from the Context role, for instance, in the State pattern Adapter Adapter 100.0 66.7 detection in the large-scale programs, and the recall of the Adaptee 90.0 60.9 pattern detection was higher than the recall of role candidate Context 60.0 70.0 judgment. Table V shows holistically that our technique State State 60.0 46.7 suppresses false negatives because the recall is high. ConcreteState 82.0 46.6 Context 80.0 55.3 C. Experiment Comparing Previous Detection Techniques Strategy Strategy 100.0 76.7 We experimentally compared our technique with ConcreteStrategy 100.0 72.4 previous detection techniques [3][14]. These previous techniques have been publicly released, and they consider three or more patterns addressed by our own technique. Both B. Pattern Detection Results target Java programs, as does our own. The technique Our technique detects patterns using test data in both the proposed by Tsantails’s technique [3](hereafter, TSAN) has small-scale and large-scale programs, and this result is four patterns in common with ours (Singleton, evaluated. We used 40 sets of programs where patterns are TemplateMethod, Adapter and State/Strategy). applied in small-scale programs and 126 sets of programs Because this technique cannot distinguish the State pattern where patterns are applied in large-scale programs as from the Strategy pattern, these are detected as one learning data. We judged manually and qualitatively whether pattern. Dietrich’s technique [14] (hereafter, DIET) has three the patterns were appropriately applied in the detection patterns in common (Singleton, TemplateMethod, results. Table V shows precision and recall of the detected Adapter) with our own. TSAN detects patterns based on patterns. Precision and recall were calculated from the data the degree of similarity between the graphs of the pattern in Table III by the following expressions: structure and graphs of the programs to be detected, whereas DIET detects patterns by using formal OWL (Web Ontology wp Recall of pattern detection : Re p  Language) definitions. Patterns were detected and evaluated wp  x p with the small-scale and large-scale test data. Moreover, the wp test data and learning data were different. Precision of pattern detection : Pr p  wp  y p Figure 10 shows the recall and precision graphs for our Small-scale and large-scale programs shared a common technique and TSAN, and Figure 11 shows the point in that they both had recalls that were higher than corresponding graphs for our technique and DIET. We precisions. However, there were many non-agreements about ranked the detection results of our technique with the pattern the State patterns and Strategy patterns in the large- agreement values. Next, we calculated recall and precision scale programs. Recall was 90% or more with the small- according to the ranking and plotted them. Recall and scale programs, but it dropped as low as 60% with the large- precision were calculated from the data in Table III by using scale programs. the expressions of paragraph IV -B. In the results of TSAN The large-scale programs resulted in especially low recall and DIET, we alternately plotted results because these for the Adapter pattern. Table IV shows the cause: The previous detection techniques output no value to rank. In the recall of the role candidate judgment for the Adapter recall and precision graphs higher values are better. pattern was not high enough. It is necessary to show that all Figure 10 and 11 show particularly good results for all roles that compose patterns have agreement values that are techniques when small-scale programs was examined. This above the threshold so that patterns will be detected. There is because small-scale programs do not include unnecessary were many cases in which neither of the roles that composed attributes and operations in the composition of patterns. patterns was judged as a role candidate for the Adapter Table VI and VII show the average F measure for each pattern. It will be necessary to return to P2 and choose new plot of Figure 10 and 11. The F measure is calculated with of State pattern. Table VIII shows 45 fields and 204 methods as the largest in Context role in State pattern (18 and 31 respectively in Context role of Strategy pattern). Therefore, the complexity of Context role of both patterns appears in the number of fields and the number of methods, and these are distinguishing elements. Figure 10 shows that our technique is especially good because State pattern and Strategy pattern could not be distinguished with TSAN. Detection of Subspecies of Patterns Figure 11 shows that the recall of DIET is low in the case Figure 10. Recall-precision graph of detection results (vs. TSAN) of large-scale programs because this technique doesn't accommodate the diversity in pattern applications. Additionally, large-scale programs not only contain many attributes and operations in the composition of patterns but also subspecies of patterns. Our technique detected subspecies of patterns. For example, our technique detected the source code of the Singleton pattern that used the boolean variable as shown in Figure 12. This Singleton pattern was not detected in TSAN or DIET. However, unlike the previous techniques, our technique is affected by false positives because it is a gradual detection using metrics and machine learning instead of strict conditions. False positives of the Singleton pattern especially stood out because Singleton pattern is Figure 11. Recallprecision graph of detection results (vs. DIET) composed of only one role. It will be necessary to use metrics that are specialized to one or a few roles to make TABLE VI. THE AVERAGE OF F MEASURE (VS. TSAN) judgments about patterns composed of one role like the Singleton pattern (P4). Small-scale programs Large-scale programs Therefore, our technique is superior to previous one Our technique 0.67 0.56 Previous technique because the curve of our technique is above the previous in 0.39 0.36 Figures 10 and 11. (TSAN) TABLE VII. THE AVERAGE OF F MEASURE (VS. DIET) TABLE VIII. MEASUREMENTS OF THE CONTEXT ROLE’S METRICS Small-scale programs Large-scale programs Pattern - Role Number of fields Number of methods Our technique 0.69 0.55 12 58 Previous technique State - Context 45 204 0.50 0.35 (DIET) 11 72 18 31 recall and precision calculated by the above-mentioned Strategy - Context 3 16 expression as follows. 3 5 2  Pr p  Re p F  measure  Pr p  Re p A large F measure means higher accuracy, and these tables show that our technique had a larger F measure than the previous techniques had. Distinction between State pattern and Strategy pattern Our technique distinguished State pattern Strategy pattern. Table VIII is an excerpt of the metrics measurements for the Context role in State pattern and Strategy pattern that were distinguished by the experiment on the large-scale programs. State pattern treats the states in State role and treats the actions of the states in the Context role. Strategy pattern encapsulates the Figure 12. Example of diversity in pattern application (Singleton pattern) processing of each algorithm into a Strategy role, and Context processing becomes simpler compared with that V. CONCLUSION AND FUTURE WORK [6] R. Ferenc, A. Beszedes, L. Fulop, and J. Lele. Design Pattern Mining Enhanced by Machine Learning. 21st IEEE We devised a pattern detection technique using metrics International Conference on Software Maintenance, pp. 295- and machine learning. Role candidates are judged using 304 2005. machine learning that relies on measured metrics, and patterns are detected from the relations between classes. We [7] H. Washizaki, K. Fukaya, A. Kubo, and Y. Fukazawa. worked on the problems associated with overlooking patterns Detecting Design Patterns Using Source Code of Before and distinguishing patterns in which the class structures are Applying Design Patterns. 8th IEEE/ACIS International Conference on Computer and Information Science, pp. 933- similar. 938, 2009. We demonstrated that our technique was superior to two previous detection techniques by experimentally [8] N. Shi and R.A. Olsson. Reverse Engineering of Design distinguishing patterns in which the class structures are Patterns from Java Source Code. 21st IEEE/ACM similar. Moreover, subspecies of patterns were detected, so International Conference on Automated Software Engineering, we could deal with a very diverse set of pattern applications. pp. 123-134, 2006. However, our technique was more susceptible to false [9] H. Lee, H. Youn, and E. Lee. Automatic Detection of Design positives because it does not use strict conditions such as Pattern for Reverse Engineering. 5th ACIS International those used by the previous techniques. Conference on Software Engineering Research, Management We have several goals for our future work. First, we plan and Applications, pp. 577-583, 2007. to add more patterns to be detected. Our technique can [10] L. Wendehals and A. Orso. Recognizing Behavioral Patterns currently cope with five patterns. However, we predict it will at Runtime Using Finite Automata. In 4th ICSE 2006 be possible to detect other patterns if we can decide upon Workshop on Dynamic Analysis, pp. 33–40, 2006. metrics to identify them. It is also necessary to collect more [11] S. Hayashi, J. Katada, R. Sakamoto, T. Kobayashi, and M. learning data to cover the diversity in pattern applications. Saeki. Design Pattern Detection by Using Meta Patterns. Moreover, we plan to more narrowly adapt the metrics to IEICE Transactions, Vol. 91-D, No. 4, pp. 933–944, 2008. each role by returning to step P2 because results might [12] A. Lucia, V. Deufemia, C. Gravino and M. Risi. Design depend on the data. This process would lead to the pattern recovery through visual language parsing and source enhancement of recall and precision. code analysis. Journal of Systems and Software, Vol.82 (7), Second, we currently qualitatively and manually judge pp. 1177-1193, 2009. whether to return to step P2 and to apply GQM again; hence, in the future, we should find an appropriate automatic [13] Y. Guéhéneuc and G. Antoniol. DeMIMA: A Multilayered judgment method. Approach for Design Pattern Identification. IEEE Trans. Software Engineering. Vol.34, No. 5, pp. 667–684, 2008. Third, we plan to prove the validity of the expressions and the parameters of agreement values and thresholds. We [14] J. Dietrich, C. Elgar. Towards a Web of Patterns. In consider that it is possible to reduce the false positive rate by Proceedings of First International Workshop Semantic Web deciding on the optimum thresholds for role agreement Enabled Software Engineering, pp. 117-132, 2005. values and pattern agreement values. [15] V. R. Basili and D.M. Weiss. A Methodology for Collecting Finally, we plan to determine the learning number of Valid Software Engineering Data. IEEE Transactions on times automatically and examine the correlation of the Software Engineering, Vol. 10, No. 6, pp. 728–738, 1984. learning number of times and precision. [16] T. Segaran. Programming Collective Intelligence. O’Reilly. 2007. REFERENCES [17] H. Hirano. Neural network implemented with C++ and Java. [1] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Personal Media. 2008. Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1994. [18] H. Yuki. An introduction to design pattern to study by Java. http://www.hyuki.com/dp/ [2] M. Lorenz and J. Kidd Object-Oriented Software Metrics. Prentice Hall, 1994. [19] H. Tanaka. Hello World with Java! http://www.hellohiro.com/pattern/ [3] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S. Halkidis. Design Pattern Detection Using Similarity Scoring. [20] Oracle Technology Network for Java Developers. IEEE Trans. Software Engineering, Vol.32, No.11, pp. 896- http://www.oracle.com/technetwork/java/index.html 909 2006. [21] JUnit.org. Resources for Test Driven Development. [4] A. Blewitt, A. Bundy, and L. Stark. Automatic Verification of http://www.junit.org/ Design Patterns in Java. In Proceedings of the 20th [22] SpringSource.org. Spring Source. International Conference on Automated Software Engineering, http://www.springsource.org/ pp. 224–232, 2005. [5] H. Kim and C. Boldyreff. A Method to Recover Design Patterns Using Software Product Metrics. In Proceedings of the 6th International Conference on Software Reuse: Advances in Software Reusability, pp. 318-335, 2000.