4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)


       Improving Recall in Code Search by Indexing
            Similar Codes under Proper Terms
                                                 Abdus Satter∗ and Kazi Sakib†
                          Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh
                                         Email: ∗ bit0401@iit.du.ac.bd, † sakib@iit.du.ac.bd


   Abstract—The recall of a code search engine is reduced,                 difficult. The reason is that automatically perceiving the intent
if feature-wise similar code fragments are not indexed under               of a code block is still a research challenge [6]. Another chal-
common terms. In this paper, a technique named Similarity Based            lenge is to select proper terms that best represent similar code
Method Finder (SBMF) is proposed to alleviate this problem. The
technique extracts all the methods from a source code corpus and           fragments. For example, assume that there are two methods
converts these into reusable methods (i.e., program slice) through         that perform bubble sort - “x” and “sort”. Here, between two
resolving data dependency. Later, it finds similar methods by              terms, “sort” is semantically more relevant name than “x”. It
checking signature (i.e., input and output types) and executing            is a challenging task to automatically determine that “sort” is
methods for a randomly generated set of input values. Methods              the better keyword to represent these methods. Again, a code
are considered as feature-wise similar if these produce the same
output set. In order to index these methods against common and             fragment may contain terms, which are not useful to express
proper terms, SBMF selects the terms that are found in most                its intent (i.e., implemented feature) properly. Indexing based
of the methods. Finally, query expansion is performed before               on these keywords reduces matching probability between user
searching the index to solve the vocabulary mismatch problem. In           query and these keywords. It happens because, user query
order to evaluate SBMF, fifty open source projects implementing            defines functionality but the extracted keywords do not express
nine different functionalities or features were used. The results
were compared with two types of techniques - Keyword Based                 the feature properly. So, instead of using these keywords, more
Code Search (KBCS) and Interface Driven Code Search (IDCS).                meaningful terms need to be selected that best match the query.
On an average, SBMF retrieves 38% and 58% more relevant                       Researchers have proposed various techniques to improve
methods than KBCS and IDCS, respectively. Moreover, it is                  the performance of code search engines where recall is con-
successful for all the features by retrieving at least one relevant        sidered as one of the performance indicators. These techniques
method representing each feature whereas IDCS and KBCS are
successful for 3 and 7 features out of 9 respectively.                     can be broadly classified into four types like Keyword Based
   Index Terms—code search, code reuse, method search                      Code Search (KBCS), Interface Driven Code Search (IDCS),
                                                                           Test Driven Code Search (TDCS), and Semantic Based Code
                       I. I NTRODUCTION                                    Search (SBCS). In KBCS [2], [7], [8], [9], [10], source codes
                                                                           are indexed based on the terms generated from the code and
   The recall of a code search engine, indicated by the number             searching is performed on the index. As this approach does
of relevant codes that is retrieved from the code repository,              not consider similarity between source codes having different
usually depends on the indexing mechanism and query for-                   keywords, it cannot retrieve more relevant codes. In order to
mulation techniques. Proper indexing and query understanding               define required component interface as query, and find relevant
help to retrieve relevant code snippets that satisfy user needs            components, IDCS [11], [12], [13] was proposed. It is possible
[1]. Most of the code search engines employ Information                    to have two or more code fragments that contain different
Retrieval (IR) centric approaches for indexing source code                 interfaces but perform the same task. IDCS considers that
[2]. The working principle behind these approaches is to                   these code fragments are different due to having different
construct a term-based index, by extracting keywords from                  interfaces. Thus, it does not retrieve these all together. To
source codes. A common problem of these approaches is                      automatically find and adapt reusable components, TDCS [14],
that a pair of codes - having same functionality, but written              [15] and SBCS [16], [17] were proposed. These are effective in
using different keywords are indexed against different terms.              terms of precision as test cases are employed on the retrieved
A traditional code search engine misses some important code                codes. In these approaches, most of the test cases fail not only
fragments, because of this keyword matching policy. It results             for functional requirements mismatch but also for syntactic
in a low recall code search engine with poor performance on                mismatch of the interface definition [15]. For this reason,
benchmark datasets [3].                                                    semantically relevant code fragments cannot be retrieved and
   To improve recall of a code search engine, similar code                 the recall is decreased.
fragments should be indexed under the same terms. However,                    In this paper a technique named Similarity Based Method
it is challenging to automatically and efficiently determine               Finder (SBMF) is proposed to retrieve more relevant methods
that two code fragments are identical or similar [4]. Although             from code base. The technique first parses all the methods
identical code fragments can be detected through keywords                  from the source code to construct a repository of methods.
matching [5], detecting feature wise similar code blocks is                It generates data dependency graph for each method and


                                                                      35
             4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

converts the method into reusable method (i.e., program slice)         A. Keyword based Code Search (KBCS)
through resolving data dependency, and redefining parameters              In KBCS, source code is considered as plain text document
and return type. Later, all the methods are clustered into a           where traditional IR centric approaches are employed to in-
number of clusters where methods in the same cluster perform           dex the code and query over the index [20]. Besides, other
the same task. To detect feature-wise similarity among a set           metadata such as comments, file name, commit message, etc.
of methods’ signatures (i.e., parameters and return types) of          are used to retrieve relevant code fragments from a repository
these methods are checked. Methods having the same signature           of source codes. One of the techniques related to KBCS
are then executed against a set of randomly generated input            is JSearch which indexes source code against the keywords
values. Among these methods, those which produce the same              extracted from the code [2]. However, it cannot retrieve all
output are considered as feature-wise similar and a cluster is         the code snippets that implement the same feature but contain
constructed to store these methods. To identify proper terms           different keywords. This is because, it does not check feature-
for a cluster, keywords are obtained from the methods in the           wise similarity to detect common terms for these fragments.
cluster and method frequency is calculated for each term.                 Several techniques like Sourcerer [8], Codifier [9], krugle
Such terms are considered as representative terms if these             [10], etc. were proposed to provide infrastructure for large
are found in most of the methods of the cluster. All the               scale code search. These techniques use both structural and
methods of the cluster are then indexed against the terms so           semantic information of source code to construct index. Struc-
that these are retrieved all together if a query term matches          tural information comprises language, source file, related doc-
one of these methods. At last, user query is expanded by               uments, classes, methods, dependencies, and so on. Semantic
adding synonyms of each query term to increase the matching            information of a program is gathered by generating terms
probability between the query terms and index terms [7].               from method name, class name, field name, comments, etc.
   In order to evaluate the proposed technique, a tool was             Although these techniques adopt both types of information
developed. Two types of code search techniques, KBCS and               to fetch more relevant code fragments, these cannot retrieve
IDCS, were compared with SBMF to show its efficiency.                  feature-wise similar code blocks simultaneously. The reason
An existing system named Sourcerer [8] was used for the                is that all these information are stored following IR based
implementation of KBCS and IDCS. However, SBCS and                     indexing mechanism, and no checking is performed to index
TDCS were not considered for comparison, because these                 similar code snippets under common proper terms.
were proposed to improve precision rather than recall. For
                                                                       B. Semantic Based Code Search (SBCS)
comparative result analysis, three metrics were used which
are recall, number of methods retrieved and feature successful-           As open source codes are increasing day by day, it is thought
ness. Here, feature successfulness determines whether at least         that a significant amount of code that is written today, has
one relevant method is retrieved or not against user queries           already been available in the internet. However, reusing these
provided for a feature. In the context of this paper, A feature        existing codes often does not directly meet user needs or
can be considered as a requirement given to a developer to             requires modifications. In order to find existing codes that
implement. 50 open source projects were selected to carry out          support user requirements, a technique in form of SBCS was
the experiment. The result analysis shows that on an average           proposed by Steven [16]. It takes keywords that represent user
SBMF increases recall by 38% and 58% more than KBCS                    requirements, and retrieves relevant code fragments containing
and IDCS, respectively against 170 queries. Besides, SBMF              these keywords. Later, it runs user provided test cases on
is successful for all the features whereas KBCS and IDCS are           the fetched code snippets and passed codes are delivered as
successful for 7 and 3 features out of 9 respectively.                 final search result. It performs well in terms of precision but
                                                                       recall is reduced since proper terms are not determined while
                                                                       indexing feature-wise similar codes. So, some semantically
                                                                       similar code fragments cannot be fetched due to indexing these
                    II. R ELATED W ORKS
                                                                       under inappropriate terms.
                                                                          Sometimes, developers need to convert one type of ob-
   Reusing existing code fragments reduces development time            ject to another. To get example code implementing such
and effort [18]. For this reason, searching for reusable code          conversion, Niyana proposed a technique named XSnippet
snippets has become a common task among the developers                 [17]. It creates graph from source code by adopting code
during software development [19]. Various techniques have              mining algorithm. The graph represents data flow within the
been proposed in the literature to improve the performance             corresponding source code. Moreover, user query is defined
of code search engine in terms of recall, precision, query             by providing input type and output type. For a user query,
successfulness, etc. These techniques can be broadly classified        all the generated graphs are searched to find those code
into four categories which are Keyword Based Code Search               fragments that convert the input type into the output type.
(KBCS), Interface Driven Code Search (IDCS), Semantic                  In this technique. developers need to provide exact input type
Based Code Search (SBCS), and Test Driven Code Search                  and output type for getting example code blocks. Otherwise,
(TDCS). Significant works related to each category are dis-            it cannot retrieve code fragments that may satisfy user needs.
cussed in the following subsections.                                   However, according to the searching behavior, developers are


                                                                  36
             4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

more interested in using keywords rather than concrete data                              III. P ROPOSED T ECHNIQUE
types to define their query [21].                                          In this paper, a technique named Similarity based Method
                                                                         Finder (SBMF) has been proposed to improve recall in
C. Test Driven Code Search (TDCS)                                        code search. The technique comprises several steps such as
   TDCS is a special type of SBCS where test cases are                   Reusable Method Generation, Clustering Similar Methods,
used to obtain program semantics. Lemos et al. proposed a                Proper Term Selection, Handling Methods Having API/Library
TDCS technique named CodeGenie to support method level                   Function call, Index Construction, and Query Expansion. Each
searching [14]. The technique takes method signature as query            of the steps is discussed as follows.
from the test cases written by developers. It uses Sourcerer             A. Reusable Method Generation
infrastructure to retrieve relevant functions against the query.
                                                                            In this step, the proposed technique first parses the source
Next, all the test cases are executed for each retrieved method.
                                                                         code to identify all the methods in the code. For each method,
Resultant methods are ranked based on the number of test
                                                                         it checks whether the body of that method contains any API/-
cases successfully passed. Although the technique increases
                                                                         function call statement or not. If no such statement is found,
precision, it produces low recall. The reason is that it per-
                                                                         the technique takes the method to convert it into reusable
forms keyword matching to fetch methods from index without
                                                                         method (i.e., program slice that can execute independently
justifying the appropriateness of the keywords.
                                                                         without having any dependency on other methods). Later, a
   Usually retrieved methods may not pass corresponding
                                                                         data dependency graph is constructed for the corresponding
test cases due to different order of the parameters, return
                                                                         method to determine its input and output types. Although the
type or parameter type. To resolve these issues, Janjic et al.
                                                                         signature of the method expresses the input and output types,
proposed a technique that refactors the code to adapt with the
                                                                         this is not sufficient enough to convert into reusable function
program context [15]. It applies every possible adaptations like
                                                                         for several scenarios. For example, a method may have return
reordering parameters, using super type or sub class type of
                                                                         type void but it may manipulate one or more variables that
a given return or parameter type, converting primitive type to
                                                                         are declared outside the body of the method. A method may
reference type, etc. Thus, it improves TDCS by finding more
                                                                         not have any parameter (i.e., void) but use variables that are
relevant methods. However, it produces low recall because it
                                                                         defined outside the body of the method. Again, the signature
does not index similar methods under common terms.
                                                                         of a method may explicitly state the input and output types
                                                                         but some variables may be used or manipulated by it and
D. Inreface Driven Code Search (IDCS)
                                                                         these are declared outside the method body. Considering all
   IDCS helps the developers to define their queries in a more           of these scenarios, the technique generates data dependency
structured form rather than just a set of keywords joined by             graph to redefine the signature and convert into reusable
boolean expression. Signature matching was the first proposed            method. Each node in the graph denotes the variable and an
IDCS technique to find relevant functions within a software              edge from a to b (a → b) denotes variable a depends on
library [13]. The approach crawls all the methods in the library,        variable b. After constructing the graph, nodes that have in
and uses signature of each method for indexing. Other code               degree zero and variables denoted by these nodes are declared
search techniques such as Sourcecer, ParseWeb, and Strath-               outside the method body, are considered as input parameters.
cona also support IDCS to improve the performance in code                Besides, nodes that have out degree zero are considered as
search [7]. Although IDCS assists to formulate user query,               output variables of the method. If multiple output nodes are
it does not select appropriate terms during indexing similar             found, a complex data type is created where each field of
codes that perform the same functionality. Thus, functionally            the type denotes each node. The reason is that a method
related code fragments will not be retrieved all together since          return type can be a single data type - either primitive or
these are indexed against inappropriate terms.                           complex data type. The technique uses the variables found in
   In order to find reusable code fragments, four types of               the nodes containing in degree zero to generate parameters of
techniques have been proposed in the literature which are                the method. If a single node is found which out degree is zero,
KBCS, IDCS, TDCS, and SBCS. All these techniques extract                 the type of the variable denoted by the node is used as return
keywords from source code to generate terms, and index cor-              type of the method. Otherwise, generated composite data type
responding code against the terms. However, none of the tech-            as discussed earlier is used. The signature of the method is
niques checks the appropriateness of the terms with respect to           redefined by combining the return type, method name, and
implemented feature. As a result, the number of relevant codes           parameters. It is possible to have one or more variables that
retrieved is reduced due to indexing against improper term.              are declared outside the method body. In the data dependency
Moreover, if two or more code snippets implement similar                 graph, nodes representing these variables may have at least
feature but contain different terms, existing techniques cannot          one in degree and one out degree. In this case, the technique
retrieve all these code fragments simultaneously. The reason is          parses the source code and checks the declaration statements
that these are indexed against different terms. So, to improve           of the variables to determine the types of the variables. Using
recall in code search, feature-wise similar codes should be              this information, it adds declaration statement for each of the
indexed under common appropriate terms.                                  variables at the beginning of the function body. Thus, the


                                                                    37
              4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

technique makes the method self-executable without having                  Algorithm 1 Cluster Similar Methods
any external data dependency.                                              Require: A list of methods (M ) for which search index will
                                                                               be constructed
B. Clustering Similar Methods                                               1: procedure C LUSTER S IMILAR M ETHODS (M )
   To improve code search, it is required to check the similarity           2:     C = ∅;
among methods found in the code base. Two or more methods                   3:     for each m ∈ M do
may perform the same task in different ways. So, feature-wise               4:         if IsInAnyCluster(m, C) == true then
similar methods needs to be detected to retrieve the similar                5:             continue
methods all together. In Algorithm 1, the procedure named                   6:         end if
ClusterSimilarM ethods takes a list of reusable methods                     7:         cl = ∅
(M ) as input which is constructed following the previous step.             8:         cl.add(m)
A variable C is declared to store different clusters of similar             9:         inputset = generate a set of input data randomly
methods where each cluster contains the methods that perform                   for m
the same functionality (Algorithm 1 Line 2). A for loop is                 10:         outputset = execute m and generate correspond-
declared that iterates on M to construct cluster of similar                    ing output for inputset
methods. The procedure IsInAnyCluster is invoked to check                  11:         for each m0 ∈ M do
whether each method m (belongs to M ) is added to any cluster              12:             if IsInAnyCluster(m0 , C) == true then
or not previously (Algorithm 1 Lines 4-5). If m does not                   13:                 continue
belong to any cluster, a variable cl is declared to contain all the        14:             end if
methods similar to m. A set of input data is generated based               15:             if       m0 .paramtersT ypes             ==
on the type of parameters found in the signature of m and                      m.parametersT ypes         &     m0 .returnT ype       =
corresponding output is generated by executing m (Algorithm                    m.returnT ype then
1 Lines 9-10). Here inputset and outputset determine the                   16:                 outputset0 = execute m0 and generate cor-
intent of m. Another for loop is declared to identify other                    responding output for inputset
methods that are similar to m. In each iteration, the signature            17:                 if outputset == outputset0 then
of each method m0 (in M ) is matched with the signature of                 18:                     cl.add(m0 )
m to check whether the input data set can be fed into the                  19:                 end if
method and return type is identical to m (Algorithm 1 Line                 20:             end if
15). If the signatures of both methods are identical, the method           21:         end for
m0 is executed for inputset and generated output is stored to              22:         C.add(cl)
outputset0 . If outputset and outputset0 are found the same,               23:     end for
m0 is considered similar to m as both methods produce same                 24: end procedure
output for the same input data set (Algorithm 1 Lines 17-19).              25: procedure I S I NA NY C LUSTER (m, C)
m0 is then added to cl to store all the methods similar to m.              26:     f ound = f alse
At last, cl is inserted to the list of all identified clusters (C).        27:     for c ∈ C do
                                                                           28:         if m ∈ c then
C. Proper Term Selection                                                   29:             f ound = true
   In order to retrieve more relevant methods, it is required to           30:             break;
identify proper terms for each method before indexing. When                31:         end if
two or more methods have different names or signatures, but                32:     end for
implement the same functionality, these methods should be                  33:     return f ound
indexed under common appropriate terms. As a result, all these             34: end procedure
methods will be obtained against user query. So, after getting
all the clusters from the previous step, representative terms
are selected for each cluster. For a cluster, terms are obtained           of a method but matches with the API invocation statements,
from the methods found in the cluster through extracting,                  the method is retrieved as API usage example code.
tokenizing, and stemming keywords found in the methods.
Terms that are found in most of the methods are considered                 E. Index Construction and Query Expansion
as final representative terms for each of these methods.                      After generating appropriate terms for each method and
                                                                           merging similar ones, an index is built for searching desired
D. Handling Methods Having API/Library Function call                       methods. A posting list is created to construct index, which
   As developers also search for example code to understand                maps terms with corresponding methods. Later, user query is
the usage of an API, in this step, methods that have API call              expanded to retrieve more relevant methods against the query.
statements are gathered. For each identified method, terms are                Two procedures named ConstructIndex and Query, are
generated from API call statements to index against the terms.             presented in Algorithm 2 to build index of methods obtained
As a result, if a query term does not match with the signature             from the previous steps, and refine user query, respectively. To


                                                                      38
             4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

construct the index, an empty posting list is declared, which            Algorithm 2 Index Construction and Query Expansion
maps each term to corresponding methods (Algorithm 2 Lines               Require: A list of methods (M ) containing signature, body
2). A nested for loop is defined, where the outer loop iterates              and terms of each method
on a list of methods (M ) given as input to the procedure                 1: procedure C ONSTRUCT I NDEX (M )
(Algorithm 2 Lines 3-4). The inner loop iterates to get all the           2:    M ap < String, List < M ethod >> postingList
terms of each method in M . In addition, each term is checked             3:    for each m ∈ M do
whether the posting list contains it or not to add a new term in          4:        for each t ∈ m.terms do
the list (Algorithm 2 Lines 5-7). Next, the method is added to            5:            if !postingList.keys.contains(t) then
the posting list against the term so that, when a query term will         6:                 postingList.keys.add(t)
match with the term, the method will be retrieved (Algorithm              7:            end if
2 Lines 8). After adding all the methods, the list is returned            8:            postingList[t].add(m)
by the procedure (Algorithm 2 Lines 11).                                  9:        end for
   In procedure Query, a boolean query is given as an argu-              10:    end for
ment, from which terms are separated and stored in a variable            11:    return postingList
named queryT erms to expand the query (Algorithm 2, Lines                12: end procedure
13-14). A nested for loop is defined, where the outer loop               13: procedure Q UERY (booleanQueryStr)
iterates on these terms (Algorithm 2, Lines 15-17). In each              14:    queryT erms=get all terms from booleanQueryStr
iteration, a temporary variable (expandedT erm) initialized              15:    for each qt ∈ queryT erms do
with the corresponding term, is used to store synonyms of                16:        expandedT erm = qt
the term. To expand each term, synonyms are appended                     17:        for each syn ∈ synonyms of qt do
to expandedT erm in the inner loop (Algorithm 2, Lines                   18:            expandedT erms+ =” OR ”+syn
17-19). Later, each term in queryT erms is replaced with                 19:        end for
corresponding expandedT erm for the expansion of the query               20:        queryT erms.replace(qt, expandedT erm)
(Algorithm 2, Lines 20). As a result of the expansion, the               21:    end for
probability of matching a query string with the terms defined            22:    methods = obtain method from the index satisfying
in the index increases. Finally, the query is executed in the                queryT erms
index to retrieve intended methods, which are returned by the            23:    return methods
procedure (Algorithm 2, Lines 22-23).                                    24: end procedure

      IV. I MPLEMENTATION AND R ESULT A NALYSIS
   In order to perform comparative result analysis, the pro-             statistically sound and representatives of open source projects
posed technique (SBMF) was implemented in form of a                      [22].
software tool. 50 open source projects were selected as data                A set of features were selected from the existing works
sources for the experimental analysis. To evaluate the proposed          in code search [7], [16], [23], [24] as shown in Table I. On
technique, 170 queries representing 9 different features were            the other hand, to evaluate the proposed technique, a set of
executed by the tool. For comparative analysis, same queries             queries is selected from [7]. Here, each query is related to a
were also run on Sourcerer that supports KBCS and IDCS.                  particular functionality shown in Table I and all the queries
                                                                         are created randomly. 15 subjects were employed to identify
A. Environmental Setup
                                                                         relevant methods for the functionalities. Among 15 subjects,
   This section outlines the softwares and frameworks required           5 of them were senior Java developers and rest 10 were
for the experimental analysis. SBMF was implemented using                masters student. The reason of choosing students in this study
C# programming language. Moreover, some other tools were                 is that they can play important role in software engineering
also used, which are addressed as follows:                               experiments [25]. All the experimental datasets are available
   • JavaParser: An open source library used to parse Java               in this link1 .
     source code (https://github.com/javaparser)
   • Apache Lucene: A popular search engine infrastructure               C. Comparative Result Analysis
     used to index java methods and query over the index                    For comparative result analysis, SBMF was run on the
     (https://lucene.apache.org/)                                        experimental datasets and the relevance of retrieved methods
   • Luke: Open source lucene client used to execute query               were checked for each user query. Moreover, Sourcerer which
     on the lucene index and visualize the search results                supports KBCS and IDCS, was also run on the same datasets
     (https://github.com/DmitryKey)                                      and search results obtained by this were compared to SBMF.
                                                                         Three metrics were used to evaluate the performance of SBMF
B. Dataset Selection
                                                                         in comparison with KBCS and IDCS. These were recall, num-
   In order to perform experimental analysis, 50 open source             ber of retrieved methods, and feature successfulness. Detailed
projects from sourceforge (https://sourceforge.net/) were se-
lected. Fraser and Arcuri showed that these projects are                  1 http://tinyurl.com/zdqmoqz


                                                                    39
               4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

                               TABLE I
             S ELECTED F UNCTIONALITIES WITH F REQUENCY

         #     Functionality             # methods   # queries
         1     decoding String           13          20
         2     encrypting password       3           27
         3     decoding a URL            3           21
         4     generating MD5 hash       3           16
         5     rotating array            2           25
         6     resizing image            3           25
         7     scaling Image             3           19
         8     encoding string to html   2           6
         9     joining string            47          36


result analysis with respect to each of the metrics is discussed
as follows.
                                                                                               Fig. 1. Recall Analysis
   Recall Analysis: Recall is one of the most commonly used
metrics to measure the performance of traditional IR system.
As the intent of the paper is to improve recall in code search, it
is considered as an important metric to evaluate the proposed
technique. In this experiment, recall is defined as follows.
                  number of retrieved relevant methods
   recall =
               number of relevant methods in the repository
   Fig. 1 depicts a comparative recall analysis among SBMF,
KBCS, and IDCS where X axis denotes the feature no. as
shown in TABLE I and Y axis represents the measured
recall. For feature 1 (Decoding String), approximately 15%
recall is shown in Fig. 1 for both KBCS and IDCS whereas                                 Fig. 2. Number of Retrieved Methods
100% recall is found for SBMF. There are 13 methods in
the repository that implement the feature. Among these, two
methods are found which contain keywords decode and string                   There are 3 relevant methods in the experimental projects
in method name and parameter respectively. As a result, these             that implement feature no. 3 (Decoding a URL). According to
methods are retrieved by both KBCS and IDCS. However,                     Fig. 1 only a single method is retrieved by SBMF that produces
these techniques cannot retrieve other 11 methods because                 recall 33.33%. On the contrary, KBCS and IDCS cannot
signatures of these methods do not contain any term related to            retrieve any method related to the feature. This is because
decode. While analyzing the source code of these methods, it              no method contains decode and URL simultaneously in the
is seen that the bodies of these methods use third party APIs             signature. Although one of these methods named getPath does
like URLDecoder.decode(String, String), Hex.decode(String),               not provide any semantic information representing the fea-
Base64.decode(base64), etc. to implement the feature. SMBF                ture, it invokes a library method - URLDecoder.decode(path,
takes terms from API call statements and indexes against the              ”UTF-8”) which implements the feature. SBMF considers the
terms to provide example codes regarding API usage. So, it                invocation statement for getting more relevant terms and thus,
retrieves all these 13 methods.                                           retrieves this method. Two other methods cannot be retrieved
   For feature 2 (Encrypting Password), IDCS cannot find any              by SBMF due to finding no structural similarity among these
methods but 66.67% and 33.3% relevant methods are retrieved               and no keywords representing the feature.
by SBMF and KBCS respectively as shown in Fig. 1. To get                     According to Fig. 1, 100% recall is obtained for SBMF,
the methods that implement this feature, the following query              and 33.33% for KBCS and IDCS individually with respect to
is provided to IDCS.                                                      feature no. 4 (Generating MD5 hash). It is clear that SBMF
name:(encrypt) AND return:(String) AND parameter:(String)
Although there is a single method found in the code base that
has encrypt keyword in its name but does not have String in
its parameter. So, IDCS cannot obtain this method but KBCS
retrieves because query keyword matches with the method
name. However, SBMF retrieves one more method having
signature crypt(String strpw,String strsalt). The reason is that
encrypt and crypt both express the same intent as detected by
the query expansion part of SBMF (Algorithm 2).                                         Fig. 3. Feature Successfulness Analysis


                                                                     40
              4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

has higher recall than other two approaches. The reason is that           (such as feature No, 2, 3, 5, 6, 7, 9). The reason is that
most of the methods implementing this feature do not have                 user queries do not have proper parameter type or return
proper names to represent their intent. There are 5 methods               type. This scenario is common when developers have little or
relevant to this feature and only one method is found having              no knowledge about the implementation of a feature. KBCS
name consistent with the feature. KBCS and IDCS fail to re-               and SBMF mitigate the problem by retrieving more relevant
trieve all these methods because both techniques extract terms            methods adopting free text search. In order to determine
from individual method and do not consider appropriateness                whether a feature is successful or not, a metric named feature
of the terms. However, SBMF finds that these methods are                  successfulness is introduced. A feature is said to be successful
semantically similar. So, these methods are indexed under                 if at least one relevant method is retrieved that implements
common terms. As a result, when user query matches with                   the feature. Fig. 3 presents the number of successful features
one of these methods, other three methods are also retrieved              among SBMF, KBCS, and IDCS. According to this figure,
with this.                                                                SBMF is successful for all the 9 features whereas 7 and 3
   For feature no. 5, 6 and 7, IDCS cannot retrieve any                   successful features are found for KBCS and IDCS respectively.
method from the code base used in this experiment. The                    This measure provides a notion that having higher precision
reason is that appropriate parameter type is not determined               is not effective if number of successful feature is low. In
in the user queries used for this feature. However, KBCS                  addition, improving recall increases the chances of having
shows 50%, 33.33%, and 66.67% recall for feature no. 5, 6                 higher number of successful feature. For this reason, SBMF
and 7 respectively. On the other hand, SBMF shows 100%                    performs better than KBCS and IDCS.
for features no. 5 and 7, and 33.33% for feature no. 6 as                                   V. T HREATS TO VALIDITY
illustrated in Fig. 1. For feature no. 5, two relevant methods are
found which names are transpose and rotate correspondingly.                  In this section, limitations of the experimental study are
These two methods are feature-wise similar which is detected              discussed in terms of internal, external, and construct validity.
                                                                               a) Internal Validity: In the experiment, there was no
by SBMF and indexed under common terms (i.e., rotate and
                                                                          control over the skills of the subjects. However, the risks of
transpose). On the other hand, KBCS does not check similarity,
                                                                          this threat are reduced by applying repetitive measurement
and analyzes each method individually during indexing. So,
                                                                          approach because same user created queries for KBCS, IDCS,
only rotate method is retrieved by KBCS. For feature no. 7,
                                                                          and SBMF and evaluated the search results.
SBMF retrieves one more method than KBCS because this
                                                                               b) External Validity: The set of features selected may not
method does not contain any term related to image but it uses
                                                                          generalize to the population of software functions. However,
a field of type Image. SBMF considers this usage since scaling
                                                                          these features are among the most common features used for
operation is performed on this field by the method, and adds
                                                                          the evaluation in code search. Another possible threat is that
additional term Image against the method.
                                                                          projects used in the experiment may not be sufficient enough.
   SBMF, KBCS, and IDCS show equal performance for
                                                                          However, these projects are statistically representative of open
feature no. 8 (Encoding String to HTML) in terms of recall.
                                                                          source projects as highlighted in [7].
However, 50% relevant methods cannot be retrieved because                      c) Construct Validity: Existing code clone detection
no HTML keyword is found in these method.                                 technique can be used to improve recall in code search.
   Only SBMF is able to retrieve 21 relevant methods whereas              However, SBMF differs from code clone detection in several
other techniques cannot fetch a single method for feature no. 9           points. SBMF can detect similar methods written in different
(Joining String). Here, SBMF outperforms because it identifies            programming languages and only the execution of method is
many structurally similar methods which have different names              platform dependent. Another point is that code clone detection
but all these perform string concatenation. Among these,                  technique may provide false positive results to feature-wise
several methods are found which have proper keywords in                   clone detection (usually known as Type IV) if values of certain
their body. These keywords are attached to the term list of               parameters are not defined properly [26]. As a result search en-
each similar method by SBMF. As a result, these are indexed               gine may retrieve irrelevant methods. However, SBMF checks
under common appropriate terms and all these are retrieved                dynamic behavior through executing method and matches the
simultaneously. However, other 26 relevant methods cannot be              output for corresponding input to detect feature-wise similar
retrieved since no signature matching is found among these.               methods. Such mechanism ensures that methods providing the
   Number of Retrieved Methods (NRM) and Feature                          same output, are feature-wise similar and thus no irrelevant
Successfulness Analysis: As NRM is an important measure to                method is added to these methods. Besides recall, two other
perceive recall of a search engine, a comparative result analysis         metrics are used in the study to observe the effectiveness of
with respect to NRM is performed here. A bar diagram is                   the technique. Although precision is not shown directly in the
shown in Fig. 2 depicting feature-wise NRM by SBMF, KBCS,                 result analysis due to space limitation, it can be obtained by
and IDCS. According to the diagram, SBMF retrieves more                   using data given in TABLE I and Fig. 2.
methods than KBCS and IDCS because of adding common
terms to each method.                                                                           VI. C ONCLUSION
   Although IDCS produces better precision than KBCS and                     The recall of a code search engine reduces if similar code
SBMF, it cannot retrieve a single method for some features                fragments are indexed under common proper terms. So, a


                                                                     41
               4th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2016)

technique named SBMF is proposed in this paper which                                   [8] Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor,
indexes both syntactically and semantically similar methods                                Pierre Baldi, and Cristina Lopes. Sourcerer: a search engine for open
                                                                                           source code supporting structure-based search. In Companion to the 21st
under common terms. The technique is implemented as a                                      ACM SIGPLAN symposium on Object-oriented programming systems,
complete software, that constructs index and retrieves relevant                            languages, and applications, pages 681–682. ACM, 2006.
methods against the user query.                                                        [9] Andrew Begel. Codifier: a programmer-centric search user interface.
                                                                                           In Proceedings of the workshop on human-computer interaction and
   SBMF first identifies all the methods in the code base by                               information retrieval, pages 23–24, 2007.
parsing the source code. It converts the methods into reusable                        [10] Susan Elliott Sim and Rosalva E Gallardo-Valencia. Finding source
methods by resolving data dependency and redefining method                                 code on the web for remix and reuse. Springer, 2013.
                                                                                      [11] Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant
signature. Feature-wise similar methods are detected through                               for reusing open source code on the web. In Proceedings of the twenty-
checking signature and executing methods. Here, methods that                               second IEEE/ACM international conference on Automated software
produce the same output set for a randomly generated set of                                engineering, pages 204–213. ACM, 2007.
                                                                                      [12] Reid Holmes, Robert J Walker, and Gail C Murphy. Strathcona example
input values, are considered as similar methods and these are                              recommendation tool. In ACM SIGSOFT Software Engineering Notes,
kept under a cluster. Thus, all the methods are distributed                                volume 30, pages 237–240. ACM, 2005.
into a set of clusters where each cluster contains feature-wise                       [13] Amy Moormann Zaremski and Jeannette M Wing. Signature matching:
                                                                                           a tool for using software libraries. ACM Transactions on Software
similar methods and any two clusters differ from one another                               Engineering and Methodology (TOSEM), 4(2):146–170, 1995.
in implemented feature. All these methods are indexed against                         [14] Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel
the terms that are found in more than half of the methods in                               Ossher. Codegenie:: a tool for test-driven source code search. In
                                                                                           Companion to the 22nd ACM SIGPLAN conference on Object-oriented
the cluster. At last, query expansion is performed to increase                             programming systems and applications companion, pages 917–918.
the probability of retrieving more methods.                                                ACM, 2007.
                                                                                      [15] Werner Janjic and Colin Atkinson. Leveraging software search and reuse
   For experimental analysis of the technique, 50 open source                              with automated software adaptation. In Search-Driven Development-
projects were selected to build the code base and 9 features                               Users, Infrastructure, Tools and Evaluation (SUITE), 2012 ICSE Work-
were chosen to generate queries. An existing technique named                               shop on, pages 23–26. IEEE, 2012.
                                                                                      [16] Steven P Reiss. Semantics-based code search. In Proceedings of the
Sourcerer was used to compare the results to SBMF. While                                   31st International Conference on Software Engineering, pages 243–253.
analyzing the results it has been seen that SBMF shows 38%                                 IEEE Computer Society, 2009.
improvement in recall than KBCS and 58% than IDCS. It                                 [17] Naiyana Sahavechaphan and Kajal Claypool. Xsnippet: mining for
                                                                                           sample code. ACM Sigplan Notices, 41(10):413–430, 2006.
also retrieves relevant methods for all the 9 features, whereas                       [18] Wayne C Lim. Effects of reuse on quality, productivity, and economics.
KBCS and IDCS retrieves for 7 and 3 features, respectively.                                IEEE software, 11(5):23–30, 1994.
In future, the experiment will be conducted on a large scale                          [19] Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. Code reuse
                                                                                           in open source software. Management Science, 54(1):180–193, 2008.
dataset to observe the behavior of the technique.                                     [20] William B Frakes and Brian A Nejmeh. Software reuse through
                                                                                           information retrieval. In ACM SIGIR Forum, volume 21, pages 30–36.
                        ACKNOWLEDGMENT                                                     ACM, 1986.
                                                                                      [21] Susan Elliott Sim, Charles LA Clarke, and Richard C Holt. Archetypal
                                                                                           source code searches: A survey of software developers and maintainers.
  This research is supported by ICT Division, Ministry                                     In Program Comprehension, 1998. IWPC’98. Proceedings., 6th Interna-
of Posts, Telecommunications and Information Technology,                                   tional Workshop on, pages 180–187. IEEE, 1998.
Bangladesh. 56.00.0000.028.33.065.16-747, 14-06-2016.                                 [22] Gordon Fraser and Andrea Arcuri. Sound empirical evidence in software
                                                                                           testing. In Proceedings of the 34th International Conference on Software
                                                                                           Engineering, pages 178–188. IEEE Press, 2012.
                             R EFERENCES                                              [23] OtáVio Augusto Lazzarini Lemos, Sushil Bajracharya, Joel Ossher,
                                                                                           Paulo Cesar Masiero, and Cristina Lopes. A test-driven approach to
[1] Ruben Prieto-Diaz. Implementing faceted classification for software                    code search and its application to the reuse of auxiliary functionality.
    reuse. Communications of the ACM, 34(5):88–97, 1991.                                   Information and Software Technology, 53(4):294–306, 2011.
[2] Renuka Sindhgatta. Using an information retrieval system to retrieve              [24] Otávio AL Lemos, Adriano C de Paula, Felipe C Zanichelli, and
    source code samples. In Proceedings of the 28th international confer-                  Cristina V Lopes. Thesaurus-based automatic query expansion for
    ence on Software engineering, pages 905–908. ACM, 2006.                                interface-driven code search. In Proceedings of the 11th Working
[3] Hinrich Schütze. Introduction to information retrieval. In Proceedings of             Conference on Mining Software Repositories, pages 212–221. ACM,
    the international communication of association for computing machinery                 2014.
    conference, 2008.                                                                 [25] Barbara A Kitchenham, Shari Lawrence Pfleeger, Lesley M Pickard,
[4] Randy Smith and Susan Horwitz. Detecting and measuring similarity in                   Peter W Jones, David C. Hoaglin, Khaled El Emam, and Jarrett
    code clones. In Proceedings of the International workshop on Software                  Rosenberg. Preliminary guidelines for empirical research in software
    Clones (IWSC), 2009.                                                                   engineering. IEEE Transactions on software engineering, 28(8):721–
[5] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. Ccfinder: a                      734, 2002.
    multilinguistic token-based code clone detection system for large scale           [26] Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun
    source code. IEEE Transactions on Software Engineering, 28(7):654–                     Qiu, and Tao Xie. Xiao: tuning code clones at hands of engineers
    670, 2002.                                                                             in practice. In Proceedings of the 28th Annual Computer Security
[6] Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre              Applications Conference, pages 369–378. ACM, 2012.
    Baldi. Mining concepts from code with probabilistic topic models. In
    Proceedings of the twenty-second IEEE/ACM international conference
    on Automated software engineering, pages 461–464. ACM, 2007.
[7] Otávio Augusto Lazzarini Lemos, Adriano Carvalho de Paula, Hitesh
    Sajnani, and Cristina V Lopes. Can the use of types and query
    expansion help improve large-scale code search? In Source Code
    Analysis and Manipulation (SCAM), 2015 IEEE 15th International
    Working Conference on, pages 41–50. IEEE, 2015.


                                                                                 42