4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) Improving Recall in Code Search by Indexing Similar Codes under Proper Terms Abdus Satter∗ and Kazi Sakib† Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh Email: ∗ bit0401@iit.du.ac.bd, † sakib@iit.du.ac.bd Abstract—The recall of a code search engine is reduced, difficult. The reason is that automatically perceiving the intent if feature-wise similar code fragments are not indexed under of a code block is still a research challenge [6]. Another chal- common terms. In this paper, a technique named Similarity Based lenge is to select proper terms that best represent similar code Method Finder (SBMF) is proposed to alleviate this problem. The technique extracts all the methods from a source code corpus and fragments. For example, assume that there are two methods converts these into reusable methods (i.e., program slice) through that perform bubble sort - “x” and “sort”. Here, between two resolving data dependency. Later, it finds similar methods by terms, “sort” is semantically more relevant name than “x”. It checking signature (i.e., input and output types) and executing is a challenging task to automatically determine that “sort” is methods for a randomly generated set of input values. Methods the better keyword to represent these methods. Again, a code are considered as feature-wise similar if these produce the same output set. In order to index these methods against common and fragment may contain terms, which are not useful to express proper terms, SBMF selects the terms that are found in most its intent (i.e., implemented feature) properly. Indexing based of the methods. Finally, query expansion is performed before on these keywords reduces matching probability between user searching the index to solve the vocabulary mismatch problem. In query and these keywords. It happens because, user query order to evaluate SBMF, fifty open source projects implementing defines functionality but the extracted keywords do not express nine different functionalities or features were used. The results were compared with two types of techniques - Keyword Based the feature properly. So, instead of using these keywords, more Code Search (KBCS) and Interface Driven Code Search (IDCS). meaningful terms need to be selected that best match the query. On an average, SBMF retrieves 38% and 58% more relevant Researchers have proposed various techniques to improve methods than KBCS and IDCS, respectively. Moreover, it is the performance of code search engines where recall is con- successful for all the features by retrieving at least one relevant sidered as one of the performance indicators. These techniques method representing each feature whereas IDCS and KBCS are successful for 3 and 7 features out of 9 respectively. can be broadly classified into four types like Keyword Based Index Terms—code search, code reuse, method search Code Search (KBCS), Interface Driven Code Search (IDCS), Test Driven Code Search (TDCS), and Semantic Based Code I. I NTRODUCTION Search (SBCS). In KBCS [2], [7], [8], [9], [10], source codes are indexed based on the terms generated from the code and The recall of a code search engine, indicated by the number searching is performed on the index. As this approach does of relevant codes that is retrieved from the code repository, not consider similarity between source codes having different usually depends on the indexing mechanism and query for- keywords, it cannot retrieve more relevant codes. In order to mulation techniques. Proper indexing and query understanding define required component interface as query, and find relevant help to retrieve relevant code snippets that satisfy user needs components, IDCS [11], [12], [13] was proposed. It is possible [1]. Most of the code search engines employ Information to have two or more code fragments that contain different Retrieval (IR) centric approaches for indexing source code interfaces but perform the same task. IDCS considers that [2]. The working principle behind these approaches is to these code fragments are different due to having different construct a term-based index, by extracting keywords from interfaces. Thus, it does not retrieve these all together. To source codes. A common problem of these approaches is automatically find and adapt reusable components, TDCS [14], that a pair of codes - having same functionality, but written [15] and SBCS [16], [17] were proposed. These are effective in using different keywords are indexed against different terms. terms of precision as test cases are employed on the retrieved A traditional code search engine misses some important code codes. In these approaches, most of the test cases fail not only fragments, because of this keyword matching policy. It results for functional requirements mismatch but also for syntactic in a low recall code search engine with poor performance on mismatch of the interface definition [15]. For this reason, benchmark datasets [3]. semantically relevant code fragments cannot be retrieved and To improve recall of a code search engine, similar code the recall is decreased. fragments should be indexed under the same terms. However, In this paper a technique named Similarity Based Method it is challenging to automatically and efficiently determine Finder (SBMF) is proposed to retrieve more relevant methods that two code fragments are identical or similar [4]. Although from code base. The technique first parses all the methods identical code fragments can be detected through keywords from the source code to construct a repository of methods. matching [5], detecting feature wise similar code blocks is It generates data dependency graph for each method and 35 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) converts the method into reusable method (i.e., program slice) A. Keyword based Code Search (KBCS) through resolving data dependency, and redefining parameters In KBCS, source code is considered as plain text document and return type. Later, all the methods are clustered into a where traditional IR centric approaches are employed to in- number of clusters where methods in the same cluster perform dex the code and query over the index [20]. Besides, other the same task. To detect feature-wise similarity among a set metadata such as comments, file name, commit message, etc. of methods’ signatures (i.e., parameters and return types) of are used to retrieve relevant code fragments from a repository these methods are checked. Methods having the same signature of source codes. One of the techniques related to KBCS are then executed against a set of randomly generated input is JSearch which indexes source code against the keywords values. Among these methods, those which produce the same extracted from the code [2]. However, it cannot retrieve all output are considered as feature-wise similar and a cluster is the code snippets that implement the same feature but contain constructed to store these methods. To identify proper terms different keywords. This is because, it does not check feature- for a cluster, keywords are obtained from the methods in the wise similarity to detect common terms for these fragments. cluster and method frequency is calculated for each term. Several techniques like Sourcerer [8], Codifier [9], krugle Such terms are considered as representative terms if these [10], etc. were proposed to provide infrastructure for large are found in most of the methods of the cluster. All the scale code search. These techniques use both structural and methods of the cluster are then indexed against the terms so semantic information of source code to construct index. Struc- that these are retrieved all together if a query term matches tural information comprises language, source file, related doc- one of these methods. At last, user query is expanded by uments, classes, methods, dependencies, and so on. Semantic adding synonyms of each query term to increase the matching information of a program is gathered by generating terms probability between the query terms and index terms [7]. from method name, class name, field name, comments, etc. In order to evaluate the proposed technique, a tool was Although these techniques adopt both types of information developed. Two types of code search techniques, KBCS and to fetch more relevant code fragments, these cannot retrieve IDCS, were compared with SBMF to show its efficiency. feature-wise similar code blocks simultaneously. The reason An existing system named Sourcerer [8] was used for the is that all these information are stored following IR based implementation of KBCS and IDCS. However, SBCS and indexing mechanism, and no checking is performed to index TDCS were not considered for comparison, because these similar code snippets under common proper terms. were proposed to improve precision rather than recall. For B. Semantic Based Code Search (SBCS) comparative result analysis, three metrics were used which are recall, number of methods retrieved and feature successful- As open source codes are increasing day by day, it is thought ness. Here, feature successfulness determines whether at least that a significant amount of code that is written today, has one relevant method is retrieved or not against user queries already been available in the internet. However, reusing these provided for a feature. In the context of this paper, A feature existing codes often does not directly meet user needs or can be considered as a requirement given to a developer to requires modifications. In order to find existing codes that implement. 50 open source projects were selected to carry out support user requirements, a technique in form of SBCS was the experiment. The result analysis shows that on an average proposed by Steven [16]. It takes keywords that represent user SBMF increases recall by 38% and 58% more than KBCS requirements, and retrieves relevant code fragments containing and IDCS, respectively against 170 queries. Besides, SBMF these keywords. Later, it runs user provided test cases on is successful for all the features whereas KBCS and IDCS are the fetched code snippets and passed codes are delivered as successful for 7 and 3 features out of 9 respectively. final search result. It performs well in terms of precision but recall is reduced since proper terms are not determined while indexing feature-wise similar codes. So, some semantically similar code fragments cannot be fetched due to indexing these II. R ELATED W ORKS under inappropriate terms. Sometimes, developers need to convert one type of ob- Reusing existing code fragments reduces development time ject to another. To get example code implementing such and effort [18]. For this reason, searching for reusable code conversion, Niyana proposed a technique named XSnippet snippets has become a common task among the developers [17]. It creates graph from source code by adopting code during software development [19]. Various techniques have mining algorithm. The graph represents data flow within the been proposed in the literature to improve the performance corresponding source code. Moreover, user query is defined of code search engine in terms of recall, precision, query by providing input type and output type. For a user query, successfulness, etc. These techniques can be broadly classified all the generated graphs are searched to find those code into four categories which are Keyword Based Code Search fragments that convert the input type into the output type. (KBCS), Interface Driven Code Search (IDCS), Semantic In this technique. developers need to provide exact input type Based Code Search (SBCS), and Test Driven Code Search and output type for getting example code blocks. Otherwise, (TDCS). Significant works related to each category are dis- it cannot retrieve code fragments that may satisfy user needs. cussed in the following subsections. However, according to the searching behavior, developers are 36 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) more interested in using keywords rather than concrete data III. P ROPOSED T ECHNIQUE types to define their query [21]. In this paper, a technique named Similarity based Method Finder (SBMF) has been proposed to improve recall in C. Test Driven Code Search (TDCS) code search. The technique comprises several steps such as TDCS is a special type of SBCS where test cases are Reusable Method Generation, Clustering Similar Methods, used to obtain program semantics. Lemos et al. proposed a Proper Term Selection, Handling Methods Having API/Library TDCS technique named CodeGenie to support method level Function call, Index Construction, and Query Expansion. Each searching [14]. The technique takes method signature as query of the steps is discussed as follows. from the test cases written by developers. It uses Sourcerer A. Reusable Method Generation infrastructure to retrieve relevant functions against the query. In this step, the proposed technique first parses the source Next, all the test cases are executed for each retrieved method. code to identify all the methods in the code. For each method, Resultant methods are ranked based on the number of test it checks whether the body of that method contains any API/- cases successfully passed. Although the technique increases function call statement or not. If no such statement is found, precision, it produces low recall. The reason is that it per- the technique takes the method to convert it into reusable forms keyword matching to fetch methods from index without method (i.e., program slice that can execute independently justifying the appropriateness of the keywords. without having any dependency on other methods). Later, a Usually retrieved methods may not pass corresponding data dependency graph is constructed for the corresponding test cases due to different order of the parameters, return method to determine its input and output types. Although the type or parameter type. To resolve these issues, Janjic et al. signature of the method expresses the input and output types, proposed a technique that refactors the code to adapt with the this is not sufficient enough to convert into reusable function program context [15]. It applies every possible adaptations like for several scenarios. For example, a method may have return reordering parameters, using super type or sub class type of type void but it may manipulate one or more variables that a given return or parameter type, converting primitive type to are declared outside the body of the method. A method may reference type, etc. Thus, it improves TDCS by finding more not have any parameter (i.e., void) but use variables that are relevant methods. However, it produces low recall because it defined outside the body of the method. Again, the signature does not index similar methods under common terms. of a method may explicitly state the input and output types but some variables may be used or manipulated by it and D. Inreface Driven Code Search (IDCS) these are declared outside the method body. Considering all IDCS helps the developers to define their queries in a more of these scenarios, the technique generates data dependency structured form rather than just a set of keywords joined by graph to redefine the signature and convert into reusable boolean expression. Signature matching was the first proposed method. Each node in the graph denotes the variable and an IDCS technique to find relevant functions within a software edge from a to b (a → b) denotes variable a depends on library [13]. The approach crawls all the methods in the library, variable b. After constructing the graph, nodes that have in and uses signature of each method for indexing. Other code degree zero and variables denoted by these nodes are declared search techniques such as Sourcecer, ParseWeb, and Strath- outside the method body, are considered as input parameters. cona also support IDCS to improve the performance in code Besides, nodes that have out degree zero are considered as search [7]. Although IDCS assists to formulate user query, output variables of the method. If multiple output nodes are it does not select appropriate terms during indexing similar found, a complex data type is created where each field of codes that perform the same functionality. Thus, functionally the type denotes each node. The reason is that a method related code fragments will not be retrieved all together since return type can be a single data type - either primitive or these are indexed against inappropriate terms. complex data type. The technique uses the variables found in In order to find reusable code fragments, four types of the nodes containing in degree zero to generate parameters of techniques have been proposed in the literature which are the method. If a single node is found which out degree is zero, KBCS, IDCS, TDCS, and SBCS. All these techniques extract the type of the variable denoted by the node is used as return keywords from source code to generate terms, and index cor- type of the method. Otherwise, generated composite data type responding code against the terms. However, none of the tech- as discussed earlier is used. The signature of the method is niques checks the appropriateness of the terms with respect to redefined by combining the return type, method name, and implemented feature. As a result, the number of relevant codes parameters. It is possible to have one or more variables that retrieved is reduced due to indexing against improper term. are declared outside the method body. In the data dependency Moreover, if two or more code snippets implement similar graph, nodes representing these variables may have at least feature but contain different terms, existing techniques cannot one in degree and one out degree. In this case, the technique retrieve all these code fragments simultaneously. The reason is parses the source code and checks the declaration statements that these are indexed against different terms. So, to improve of the variables to determine the types of the variables. Using recall in code search, feature-wise similar codes should be this information, it adds declaration statement for each of the indexed under common appropriate terms. variables at the beginning of the function body. Thus, the 37 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) technique makes the method self-executable without having Algorithm 1 Cluster Similar Methods any external data dependency. Require: A list of methods (M ) for which search index will be constructed B. Clustering Similar Methods 1: procedure C LUSTER S IMILAR M ETHODS (M ) To improve code search, it is required to check the similarity 2: C = ∅; among methods found in the code base. Two or more methods 3: for each m ∈ M do may perform the same task in different ways. So, feature-wise 4: if IsInAnyCluster(m, C) == true then similar methods needs to be detected to retrieve the similar 5: continue methods all together. In Algorithm 1, the procedure named 6: end if ClusterSimilarM ethods takes a list of reusable methods 7: cl = ∅ (M ) as input which is constructed following the previous step. 8: cl.add(m) A variable C is declared to store different clusters of similar 9: inputset = generate a set of input data randomly methods where each cluster contains the methods that perform for m the same functionality (Algorithm 1 Line 2). A for loop is 10: outputset = execute m and generate correspond- declared that iterates on M to construct cluster of similar ing output for inputset methods. The procedure IsInAnyCluster is invoked to check 11: for each m0 ∈ M do whether each method m (belongs to M ) is added to any cluster 12: if IsInAnyCluster(m0 , C) == true then or not previously (Algorithm 1 Lines 4-5). If m does not 13: continue belong to any cluster, a variable cl is declared to contain all the 14: end if methods similar to m. A set of input data is generated based 15: if m0 .paramtersT ypes == on the type of parameters found in the signature of m and m.parametersT ypes & m0 .returnT ype = corresponding output is generated by executing m (Algorithm m.returnT ype then 1 Lines 9-10). Here inputset and outputset determine the 16: outputset0 = execute m0 and generate cor- intent of m. Another for loop is declared to identify other responding output for inputset methods that are similar to m. In each iteration, the signature 17: if outputset == outputset0 then of each method m0 (in M ) is matched with the signature of 18: cl.add(m0 ) m to check whether the input data set can be fed into the 19: end if method and return type is identical to m (Algorithm 1 Line 20: end if 15). If the signatures of both methods are identical, the method 21: end for m0 is executed for inputset and generated output is stored to 22: C.add(cl) outputset0 . If outputset and outputset0 are found the same, 23: end for m0 is considered similar to m as both methods produce same 24: end procedure output for the same input data set (Algorithm 1 Lines 17-19). 25: procedure I S I NA NY C LUSTER (m, C) m0 is then added to cl to store all the methods similar to m. 26: f ound = f alse At last, cl is inserted to the list of all identified clusters (C). 27: for c ∈ C do 28: if m ∈ c then C. Proper Term Selection 29: f ound = true In order to retrieve more relevant methods, it is required to 30: break; identify proper terms for each method before indexing. When 31: end if two or more methods have different names or signatures, but 32: end for implement the same functionality, these methods should be 33: return f ound indexed under common appropriate terms. As a result, all these 34: end procedure methods will be obtained against user query. So, after getting all the clusters from the previous step, representative terms are selected for each cluster. For a cluster, terms are obtained of a method but matches with the API invocation statements, from the methods found in the cluster through extracting, the method is retrieved as API usage example code. tokenizing, and stemming keywords found in the methods. Terms that are found in most of the methods are considered E. Index Construction and Query Expansion as final representative terms for each of these methods. After generating appropriate terms for each method and merging similar ones, an index is built for searching desired D. Handling Methods Having API/Library Function call methods. A posting list is created to construct index, which As developers also search for example code to understand maps terms with corresponding methods. Later, user query is the usage of an API, in this step, methods that have API call expanded to retrieve more relevant methods against the query. statements are gathered. For each identified method, terms are Two procedures named ConstructIndex and Query, are generated from API call statements to index against the terms. presented in Algorithm 2 to build index of methods obtained As a result, if a query term does not match with the signature from the previous steps, and refine user query, respectively. To 38 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) construct the index, an empty posting list is declared, which Algorithm 2 Index Construction and Query Expansion maps each term to corresponding methods (Algorithm 2 Lines Require: A list of methods (M ) containing signature, body 2). A nested for loop is defined, where the outer loop iterates and terms of each method on a list of methods (M ) given as input to the procedure 1: procedure C ONSTRUCT I NDEX (M ) (Algorithm 2 Lines 3-4). The inner loop iterates to get all the 2: M ap < String, List < M ethod >> postingList terms of each method in M . In addition, each term is checked 3: for each m ∈ M do whether the posting list contains it or not to add a new term in 4: for each t ∈ m.terms do the list (Algorithm 2 Lines 5-7). Next, the method is added to 5: if !postingList.keys.contains(t) then the posting list against the term so that, when a query term will 6: postingList.keys.add(t) match with the term, the method will be retrieved (Algorithm 7: end if 2 Lines 8). After adding all the methods, the list is returned 8: postingList[t].add(m) by the procedure (Algorithm 2 Lines 11). 9: end for In procedure Query, a boolean query is given as an argu- 10: end for ment, from which terms are separated and stored in a variable 11: return postingList named queryT erms to expand the query (Algorithm 2, Lines 12: end procedure 13-14). A nested for loop is defined, where the outer loop 13: procedure Q UERY (booleanQueryStr) iterates on these terms (Algorithm 2, Lines 15-17). In each 14: queryT erms=get all terms from booleanQueryStr iteration, a temporary variable (expandedT erm) initialized 15: for each qt ∈ queryT erms do with the corresponding term, is used to store synonyms of 16: expandedT erm = qt the term. To expand each term, synonyms are appended 17: for each syn ∈ synonyms of qt do to expandedT erm in the inner loop (Algorithm 2, Lines 18: expandedT erms+ =” OR ”+syn 17-19). Later, each term in queryT erms is replaced with 19: end for corresponding expandedT erm for the expansion of the query 20: queryT erms.replace(qt, expandedT erm) (Algorithm 2, Lines 20). As a result of the expansion, the 21: end for probability of matching a query string with the terms defined 22: methods = obtain method from the index satisfying in the index increases. Finally, the query is executed in the queryT erms index to retrieve intended methods, which are returned by the 23: return methods procedure (Algorithm 2, Lines 22-23). 24: end procedure IV. I MPLEMENTATION AND R ESULT A NALYSIS In order to perform comparative result analysis, the pro- statistically sound and representatives of open source projects posed technique (SBMF) was implemented in form of a [22]. software tool. 50 open source projects were selected as data A set of features were selected from the existing works sources for the experimental analysis. To evaluate the proposed in code search [7], [16], [23], [24] as shown in Table I. On technique, 170 queries representing 9 different features were the other hand, to evaluate the proposed technique, a set of executed by the tool. For comparative analysis, same queries queries is selected from [7]. Here, each query is related to a were also run on Sourcerer that supports KBCS and IDCS. particular functionality shown in Table I and all the queries are created randomly. 15 subjects were employed to identify A. Environmental Setup relevant methods for the functionalities. Among 15 subjects, This section outlines the softwares and frameworks required 5 of them were senior Java developers and rest 10 were for the experimental analysis. SBMF was implemented using masters student. The reason of choosing students in this study C# programming language. Moreover, some other tools were is that they can play important role in software engineering also used, which are addressed as follows: experiments [25]. All the experimental datasets are available • JavaParser: An open source library used to parse Java in this link1 . source code (https://github.com/javaparser) • Apache Lucene: A popular search engine infrastructure C. Comparative Result Analysis used to index java methods and query over the index For comparative result analysis, SBMF was run on the (https://lucene.apache.org/) experimental datasets and the relevance of retrieved methods • Luke: Open source lucene client used to execute query were checked for each user query. Moreover, Sourcerer which on the lucene index and visualize the search results supports KBCS and IDCS, was also run on the same datasets (https://github.com/DmitryKey) and search results obtained by this were compared to SBMF. Three metrics were used to evaluate the performance of SBMF B. Dataset Selection in comparison with KBCS and IDCS. These were recall, num- In order to perform experimental analysis, 50 open source ber of retrieved methods, and feature successfulness. Detailed projects from sourceforge (https://sourceforge.net/) were se- lected. Fraser and Arcuri showed that these projects are 1 http://tinyurl.com/zdqmoqz 39 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) TABLE I S ELECTED F UNCTIONALITIES WITH F REQUENCY # Functionality # methods # queries 1 decoding String 13 20 2 encrypting password 3 27 3 decoding a URL 3 21 4 generating MD5 hash 3 16 5 rotating array 2 25 6 resizing image 3 25 7 scaling Image 3 19 8 encoding string to html 2 6 9 joining string 47 36 result analysis with respect to each of the metrics is discussed as follows. Fig. 1. Recall Analysis Recall Analysis: Recall is one of the most commonly used metrics to measure the performance of traditional IR system. As the intent of the paper is to improve recall in code search, it is considered as an important metric to evaluate the proposed technique. In this experiment, recall is defined as follows. number of retrieved relevant methods recall = number of relevant methods in the repository Fig. 1 depicts a comparative recall analysis among SBMF, KBCS, and IDCS where X axis denotes the feature no. as shown in TABLE I and Y axis represents the measured recall. For feature 1 (Decoding String), approximately 15% recall is shown in Fig. 1 for both KBCS and IDCS whereas Fig. 2. Number of Retrieved Methods 100% recall is found for SBMF. There are 13 methods in the repository that implement the feature. Among these, two methods are found which contain keywords decode and string There are 3 relevant methods in the experimental projects in method name and parameter respectively. As a result, these that implement feature no. 3 (Decoding a URL). According to methods are retrieved by both KBCS and IDCS. However, Fig. 1 only a single method is retrieved by SBMF that produces these techniques cannot retrieve other 11 methods because recall 33.33%. On the contrary, KBCS and IDCS cannot signatures of these methods do not contain any term related to retrieve any method related to the feature. This is because decode. While analyzing the source code of these methods, it no method contains decode and URL simultaneously in the is seen that the bodies of these methods use third party APIs signature. Although one of these methods named getPath does like URLDecoder.decode(String, String), Hex.decode(String), not provide any semantic information representing the fea- Base64.decode(base64), etc. to implement the feature. SMBF ture, it invokes a library method - URLDecoder.decode(path, takes terms from API call statements and indexes against the ”UTF-8”) which implements the feature. SBMF considers the terms to provide example codes regarding API usage. So, it invocation statement for getting more relevant terms and thus, retrieves all these 13 methods. retrieves this method. Two other methods cannot be retrieved For feature 2 (Encrypting Password), IDCS cannot find any by SBMF due to finding no structural similarity among these methods but 66.67% and 33.3% relevant methods are retrieved and no keywords representing the feature. by SBMF and KBCS respectively as shown in Fig. 1. To get According to Fig. 1, 100% recall is obtained for SBMF, the methods that implement this feature, the following query and 33.33% for KBCS and IDCS individually with respect to is provided to IDCS. feature no. 4 (Generating MD5 hash). It is clear that SBMF name:(encrypt) AND return:(String) AND parameter:(String) Although there is a single method found in the code base that has encrypt keyword in its name but does not have String in its parameter. So, IDCS cannot obtain this method but KBCS retrieves because query keyword matches with the method name. However, SBMF retrieves one more method having signature crypt(String strpw,String strsalt). The reason is that encrypt and crypt both express the same intent as detected by the query expansion part of SBMF (Algorithm 2). Fig. 3. Feature Successfulness Analysis 40 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) has higher recall than other two approaches. The reason is that (such as feature No, 2, 3, 5, 6, 7, 9). The reason is that most of the methods implementing this feature do not have user queries do not have proper parameter type or return proper names to represent their intent. There are 5 methods type. This scenario is common when developers have little or relevant to this feature and only one method is found having no knowledge about the implementation of a feature. KBCS name consistent with the feature. KBCS and IDCS fail to re- and SBMF mitigate the problem by retrieving more relevant trieve all these methods because both techniques extract terms methods adopting free text search. In order to determine from individual method and do not consider appropriateness whether a feature is successful or not, a metric named feature of the terms. However, SBMF finds that these methods are successfulness is introduced. A feature is said to be successful semantically similar. So, these methods are indexed under if at least one relevant method is retrieved that implements common terms. As a result, when user query matches with the feature. Fig. 3 presents the number of successful features one of these methods, other three methods are also retrieved among SBMF, KBCS, and IDCS. According to this figure, with this. SBMF is successful for all the 9 features whereas 7 and 3 For feature no. 5, 6 and 7, IDCS cannot retrieve any successful features are found for KBCS and IDCS respectively. method from the code base used in this experiment. The This measure provides a notion that having higher precision reason is that appropriate parameter type is not determined is not effective if number of successful feature is low. In in the user queries used for this feature. However, KBCS addition, improving recall increases the chances of having shows 50%, 33.33%, and 66.67% recall for feature no. 5, 6 higher number of successful feature. For this reason, SBMF and 7 respectively. On the other hand, SBMF shows 100% performs better than KBCS and IDCS. for features no. 5 and 7, and 33.33% for feature no. 6 as V. T HREATS TO VALIDITY illustrated in Fig. 1. For feature no. 5, two relevant methods are found which names are transpose and rotate correspondingly. In this section, limitations of the experimental study are These two methods are feature-wise similar which is detected discussed in terms of internal, external, and construct validity. a) Internal Validity: In the experiment, there was no by SBMF and indexed under common terms (i.e., rotate and control over the skills of the subjects. However, the risks of transpose). On the other hand, KBCS does not check similarity, this threat are reduced by applying repetitive measurement and analyzes each method individually during indexing. So, approach because same user created queries for KBCS, IDCS, only rotate method is retrieved by KBCS. For feature no. 7, and SBMF and evaluated the search results. SBMF retrieves one more method than KBCS because this b) External Validity: The set of features selected may not method does not contain any term related to image but it uses generalize to the population of software functions. However, a field of type Image. SBMF considers this usage since scaling these features are among the most common features used for operation is performed on this field by the method, and adds the evaluation in code search. Another possible threat is that additional term Image against the method. projects used in the experiment may not be sufficient enough. SBMF, KBCS, and IDCS show equal performance for However, these projects are statistically representative of open feature no. 8 (Encoding String to HTML) in terms of recall. source projects as highlighted in [7]. However, 50% relevant methods cannot be retrieved because c) Construct Validity: Existing code clone detection no HTML keyword is found in these method. technique can be used to improve recall in code search. Only SBMF is able to retrieve 21 relevant methods whereas However, SBMF differs from code clone detection in several other techniques cannot fetch a single method for feature no. 9 points. SBMF can detect similar methods written in different (Joining String). Here, SBMF outperforms because it identifies programming languages and only the execution of method is many structurally similar methods which have different names platform dependent. Another point is that code clone detection but all these perform string concatenation. Among these, technique may provide false positive results to feature-wise several methods are found which have proper keywords in clone detection (usually known as Type IV) if values of certain their body. These keywords are attached to the term list of parameters are not defined properly [26]. As a result search en- each similar method by SBMF. As a result, these are indexed gine may retrieve irrelevant methods. However, SBMF checks under common appropriate terms and all these are retrieved dynamic behavior through executing method and matches the simultaneously. However, other 26 relevant methods cannot be output for corresponding input to detect feature-wise similar retrieved since no signature matching is found among these. methods. Such mechanism ensures that methods providing the Number of Retrieved Methods (NRM) and Feature same output, are feature-wise similar and thus no irrelevant Successfulness Analysis: As NRM is an important measure to method is added to these methods. Besides recall, two other perceive recall of a search engine, a comparative result analysis metrics are used in the study to observe the effectiveness of with respect to NRM is performed here. A bar diagram is the technique. Although precision is not shown directly in the shown in Fig. 2 depicting feature-wise NRM by SBMF, KBCS, result analysis due to space limitation, it can be obtained by and IDCS. According to the diagram, SBMF retrieves more using data given in TABLE I and Fig. 2. methods than KBCS and IDCS because of adding common terms to each method. VI. C ONCLUSION Although IDCS produces better precision than KBCS and The recall of a code search engine reduces if similar code SBMF, it cannot retrieve a single method for some features fragments are indexed under common proper terms. So, a 41 4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016) technique named SBMF is proposed in this paper which [8] Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, indexes both syntactically and semantically similar methods Pierre Baldi, and Cristina Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st under common terms. The technique is implemented as a ACM SIGPLAN symposium on Object-oriented programming systems, complete software, that constructs index and retrieves relevant languages, and applications, pages 681–682. ACM, 2006. methods against the user query. [9] Andrew Begel. Codifier: a programmer-centric search user interface. In Proceedings of the workshop on human-computer interaction and SBMF first identifies all the methods in the code base by information retrieval, pages 23–24, 2007. parsing the source code. It converts the methods into reusable [10] Susan Elliott Sim and Rosalva E Gallardo-Valencia. Finding source methods by resolving data dependency and redefining method code on the web for remix and reuse. Springer, 2013. [11] Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant signature. Feature-wise similar methods are detected through for reusing open source code on the web. In Proceedings of the twenty- checking signature and executing methods. Here, methods that second IEEE/ACM international conference on Automated software produce the same output set for a randomly generated set of engineering, pages 204–213. ACM, 2007. [12] Reid Holmes, Robert J Walker, and Gail C Murphy. Strathcona example input values, are considered as similar methods and these are recommendation tool. In ACM SIGSOFT Software Engineering Notes, kept under a cluster. Thus, all the methods are distributed volume 30, pages 237–240. ACM, 2005. into a set of clusters where each cluster contains feature-wise [13] Amy Moormann Zaremski and Jeannette M Wing. Signature matching: a tool for using software libraries. ACM Transactions on Software similar methods and any two clusters differ from one another Engineering and Methodology (TOSEM), 4(2):146–170, 1995. in implemented feature. All these methods are indexed against [14] Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel the terms that are found in more than half of the methods in Ossher. Codegenie:: a tool for test-driven source code search. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented the cluster. At last, query expansion is performed to increase programming systems and applications companion, pages 917–918. the probability of retrieving more methods. ACM, 2007. [15] Werner Janjic and Colin Atkinson. Leveraging software search and reuse For experimental analysis of the technique, 50 open source with automated software adaptation. In Search-Driven Development- projects were selected to build the code base and 9 features Users, Infrastructure, Tools and Evaluation (SUITE), 2012 ICSE Work- were chosen to generate queries. An existing technique named shop on, pages 23–26. IEEE, 2012. [16] Steven P Reiss. Semantics-based code search. In Proceedings of the Sourcerer was used to compare the results to SBMF. While 31st International Conference on Software Engineering, pages 243–253. analyzing the results it has been seen that SBMF shows 38% IEEE Computer Society, 2009. improvement in recall than KBCS and 58% than IDCS. It [17] Naiyana Sahavechaphan and Kajal Claypool. Xsnippet: mining for sample code. ACM Sigplan Notices, 41(10):413–430, 2006. also retrieves relevant methods for all the 9 features, whereas [18] Wayne C Lim. Effects of reuse on quality, productivity, and economics. KBCS and IDCS retrieves for 7 and 3 features, respectively. IEEE software, 11(5):23–30, 1994. In future, the experiment will be conducted on a large scale [19] Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. Code reuse in open source software. Management Science, 54(1):180–193, 2008. dataset to observe the behavior of the technique. [20] William B Frakes and Brian A Nejmeh. Software reuse through information retrieval. In ACM SIGIR Forum, volume 21, pages 30–36. ACKNOWLEDGMENT ACM, 1986. [21] Susan Elliott Sim, Charles LA Clarke, and Richard C Holt. Archetypal source code searches: A survey of software developers and maintainers. This research is supported by ICT Division, Ministry In Program Comprehension, 1998. IWPC’98. Proceedings., 6th Interna- of Posts, Telecommunications and Information Technology, tional Workshop on, pages 180–187. IEEE, 1998. Bangladesh. 56.00.0000.028.33.065.16-747, 14-06-2016. [22] Gordon Fraser and Andrea Arcuri. Sound empirical evidence in software testing. In Proceedings of the 34th International Conference on Software Engineering, pages 178–188. IEEE Press, 2012. R EFERENCES [23] OtáVio Augusto Lazzarini Lemos, Sushil Bajracharya, Joel Ossher, Paulo Cesar Masiero, and Cristina Lopes. A test-driven approach to [1] Ruben Prieto-Diaz. Implementing faceted classification for software code search and its application to the reuse of auxiliary functionality. reuse. Communications of the ACM, 34(5):88–97, 1991. Information and Software Technology, 53(4):294–306, 2011. [2] Renuka Sindhgatta. Using an information retrieval system to retrieve [24] Otávio AL Lemos, Adriano C de Paula, Felipe C Zanichelli, and source code samples. In Proceedings of the 28th international confer- Cristina V Lopes. Thesaurus-based automatic query expansion for ence on Software engineering, pages 905–908. ACM, 2006. interface-driven code search. In Proceedings of the 11th Working [3] Hinrich Schütze. Introduction to information retrieval. In Proceedings of Conference on Mining Software Repositories, pages 212–221. ACM, the international communication of association for computing machinery 2014. conference, 2008. [25] Barbara A Kitchenham, Shari Lawrence Pfleeger, Lesley M Pickard, [4] Randy Smith and Susan Horwitz. Detecting and measuring similarity in Peter W Jones, David C. Hoaglin, Khaled El Emam, and Jarrett code clones. In Proceedings of the International workshop on Software Rosenberg. Preliminary guidelines for empirical research in software Clones (IWSC), 2009. engineering. IEEE Transactions on software engineering, 28(8):721– [5] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: a 734, 2002. multilinguistic token-based code clone detection system for large scale [26] Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun source code. IEEE Transactions on Software Engineering, 28(7):654– Qiu, and Tao Xie. Xiao: tuning code clones at hands of engineers 670, 2002. in practice. In Proceedings of the 28th Annual Computer Security [6] Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Applications Conference, pages 369–378. ACM, 2012. Baldi. Mining concepts from code with probabilistic topic models. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 461–464. ACM, 2007. [7] Otávio Augusto Lazzarini Lemos, Adriano Carvalho de Paula, Hitesh Sajnani, and Cristina V Lopes. Can the use of types and query expansion help improve large-scale code search? In Source Code Analysis and Manipulation (SCAM), 2015 IEEE 15th International Working Conference on, pages 41–50. IEEE, 2015. 42