=Paper=
{{Paper
|id=Vol-3331/paper10
|storemode=property
|title=A Semantic Code Search Method Based on Program Conversion
|pdfUrl=https://ceur-ws.org/Vol-3331/paper10.pdf
|volume=Vol-3331
|authors=Yan Tao,Le Wei,Hongping Shu
|dblpUrl=https://dblp.org/rec/conf/ahpcai/TaoWS22
}}
==A Semantic Code Search Method Based on Program Conversion==
A Semantic Code Search Method Based on Program Conversion1
Yan Tao1, Le Wei1,2, Hongping Shu2
1
School of Software Engineering, Chengdu University of Information Technology, Chengdu, Sichuan 610225,
China
2
Automatic Software Generation & Intelligence Service Key Laboratory of Sichuan Province, Chengdu 610225,
China
Abstract
Aiming at the problem that code fragments can’t be captured accurately and quickly due to
ignoring semantic information and structural information of source code in code search task,
A semantic code search method based on program conversion is proposed (Semantic Code
Search based on Program Conversion, SCSPC). The SCSPC diversified the data of the acquired
code fragments through data enhancement, changed the program through variable renaming,
exchanging two independent statements, circular exchange, inserting exception capture and if
equivalent replacing switch statement, and trained the CodeBERT model with mixed objective
functions (masking language modeling and replacing token detection) to generate sentence
vectors with rich semantics for the code fragments and natural languages, and compared the
similarity of the vectors to complete the code search. Experimental results show that compared
with SWIM, QECK and CODEnn models, the average reciprocal ranking and hit rate (S@10)
of SCSPC method are increased by 2% and 0.041 respectively.
Keywords
Code Search, CodeBERT, Programming Transformation, Semantic Code
1. Introduction
In modern software development process, software reuse has been applied to various fields. The
same function may exist in different domains or different software systems, and the code to implement
the function is also similar. Developers usually search for code in a large number of procedural code
libraries, so as to save development time and efficiency.
At present, code search is mainly based on information retrieval methods and deep learning methods.
Methods based on information retrieval focus on how to generate correct keywords and calculate the
matching degree between them. For example, CodeHow proposed by Lv et al. [1], this model is based
on the natural language similarity and the influence of API on code search, and understands the query
by identifying the API that the query may refer to. After determining the potential API of the query, the
information of the API is merged into the process of code search. Lu et al. [2] proposed a synonym
expansion query model generated by WordNet [3]. Iman et al. [4] proposed a pattern-based search
technology, which uses the clone detection method based on vector space model to support the
discovery of working code instances and search out all types of statements (such as control flow, data
flow and API).
Current code search methods still have problems. First, they ignore that codes may come from
different dimensions, which makes it difficult to cover all perspectives with a single code representation.
Second, it is difficult for a single natural language query to express the intentions of different users,
which makes the search results inaccurate. Aiming at the above problems, this paper proposes a code
search method based on program transformation. The method uses program transformation to enhance
data and diversify code fragments. The CodeBERT model is used to map natural language and code
AHPCAI2022@2nd International Conference on Algorithms, High Performance Computing and Artificial Intelligence
EMAIL: Corresponding author’s e-mail: 897123966@qq.com (Yan Tao)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
69
snippets into the same vector space, and the similarity between the vectors is calculated and sorted to
search for the corresponding code snippet for the user.
2. The model of CodeBERT
Pretrained models are essentially a method of transfer learning. The most typical pre-trained model
is BERT [5] (Bidirectional Encoder Representation from Transformers). BERT uses a bidirectional
encoder from Transformer [6], which is geared towards textual languages but is not suitable for
representing semantic relationships between bimodal languages.
The CodeBERT [7] model captures the semantic association between natural language and code
snippet, and can quickly complete tasks such as semantic similarity search. The CodeBERT model is
based on a Transformer network with 12 layers. Its pre-training completes two tasks: Masked Language
Model (MLM) and Substitution Token Detection (RTD) task. The masked language model is to
randomly delete a word in a sentence and then judge what the deleted word is, a task pretrained using
NL-PL pairs. As shown in Figure 1, in the pre-training phase, [CLS] is placed at the beginning of the
natural language sentence to do binary classification. Separate natural language and code snippets with
flags. The input is of the form {[CLS], NL, [SEP], PL, [EOS]}. Consider natural language as a sequence
of words and code snippets as a whole set of tokens. The output is the Embedding value and the [CLS]
value for each Token. The Token detection task uses two generators combined with the context to
generate the token at [MASK], and trains the discriminator to predict whether the token in the sentence
is replaced or not. This task is pretrained using single-modal data, and the training process is shown in
Figure 2.
w1 [MASK] w51 replaced
w2 w2 w2
original
Key w3
Sorted w3 NL Generator w3
original
w4 w4 w4 original
w5 [MASK] w5
CodeBERT original
NL-Code
Discriminator
c1 c1 c1
original
c2 [MASK] c29 replaced
[CLS] Sort a dictionary by [MASK] [SEP] [(key,dictionary[key]) for key in [MASK] (dictionary.key())]
c3
c3 c3 original
Code Generator
Natural Language Programming Language c4 c4 c4 original
c5 c5 c5 original
c6 [MASK] c162 replaced
Figure 1. The model of masked language. Figure 2. The tasks of token detection.
3. Code search method based on program transformation
3.1. The framework of the method
The ultimate goal of method framework code search is to find code snippets that match the user's
expectations. Program transformation based Semantic code search (SCSPC) captures the semantic
relationships between natural languages and programming languages and generates a common
representation. The work of this paper mainly includes semantic-preserving code program
transformation, model training, and code search result matching. Its overview diagram is shown in
Figure 3. Firstly, the SCSPC method uses data augmentation to transform the source code. The specific
steps are to rename the source code variables, swap the position of independent statements, loop
exchange, add capture statements, and equivalently replace switch statements with if. Each
transformation has different effects on the structure of the method, so as to improve the diversity of the
training data. Then, natural language and code snippets are input into the CodeBERT model, which are
mapped into sentence vectors and code block vectors respectively. Finally, the cosine distance between
the code block vector and the sentence vector is calculated by the cosine similarity, and the code search
is completed by sorting the similarity.
70
The training of model
The transformation of programming Software
Search texts developer
Code1
Code2 Search statement
vector
Source Similarity
The model List of the sorted
Code Corpus of calculation
Code3 of results
code
CodeBERT
Code4
Search snippet vectors
Fine-tuning Sort the
results of
Code5
search
Sort the results of the model
Pre-training
model of
CodeBERT
Figure 3. Overview diagram of semantic code search methods based on program transformations.
3.2. The model training of the method
Model training focuses on capturing the semantic connections between queries and code snippets
and generating a common representation, as shown in Figure 4.
result
ranking
similarity calculation
Context Context
vector vector
Layer Norm
Layer Norm
FFN
FFN
Coding Layer
Layer Norm
Layer Norm
Multi-Head
Multi-Head
Attention xN Attention xN
Positional Positional
Encoding Encoding
Embedded Layer
Embedding Embedding
query statement code snippet
Figure 4. The process of model training.
71
The model mainly consists of embedding layer and encoding layer. Within the embedding layer,
input queries and code snippets are mapped into a high-dimensional vector space. Set the maximum
length of the input query to 1024. Think of it as a sequence of words, where each query uses
and to represent the beginning and end of the query. Use to split words. For a code
fragment, think of it as a sequence of tokens. The encoding layer learns code snippets c and natural
language d using a bidirectional encoder. The main calculation formula is shown in (1). PE is the
position embedding and i denotes the dimension of the embedding. Linear stands for linear layer, Q, K,
V are three vectors in the attention mechanism, which are query vector, key vector, and value vector.
LayerNorm represents the normalization layer, RELU is the activation function, and using RELU can
speed up the convergence of the network.
PE( pos,2i )=sin( pos /100002 i/d )
PE( pos ,2i +1) =cos( pos /100002 i/d )
X = Embedding + PositionalEncoding
Q = Linear ( X ) = XWQ
K = Linear ( X ) = XWk (1)
V = Linear ( X ) = XWV
X attention = X attention + X
X attention = LayerNorm( X attention )
X hidden = Linear ( RULU ( Linear ( X attention )))
4. Experiment
4.1. Data of the experiment
In order to ensure that the collected source code has real reliability. The source code of this paper is
from the most popular hosting open source website platform Github and the Deep Program group of
Microsoft Research-Cambridge jointly launched the dataset CodeSearchNet for code representation
learning [8]. The CodeSearchNet dataset is composed of 2 million annotation-code pairs from open
source libraries. Specifically, comments are top-level function or method comments, and code is the
entire function or method. In this experiment, the Java language from the CodeSearchNet Challenge
was selected as the dataset. It contains 500,754 pairs of functional Java code snippets and their
descriptions. 450, 439 pairs were used as training, 33, 658 pairs were used as validation set, and 28, 910
pairs were used as test set.
4.2. Evaluation criteria of the experiment
In this paper, SuccessRate@k and MRR (Mean Reciprocal Rank) are used to verify the effectiveness
of the SCSPC method. The formula for calculating the average reciprocal rank is given in (2). Q denotes
the set of automatically evaluated queries, and ranki is the rank corresponding to the ith query. The
higher the value of MRR, the better the performance of the code search.
Q
1 1
MRR =
Q i =1 ranki
(2)
SuccessRate@k denotes the percentage of queries for which more than one correct fragment
successfully exists among the top k code fragments returned by the search model, and is calculated as
shown in (3). Where Q is the query set in the automatic evaluation, Rankq is the highest Rank of the hit
code snippet in the search results, and 𝜎 is a function that returns 1 if the rank value of the q query is
less than k, and 0 otherwise. This evaluation metric is very important. Because a good search engine
should be able to find the code snippets that developers need by searching fewer search results. In this
72
experiment, the results were evaluated for k of 1, 5, and 10. If the value is higher, the model performance
of code search is better.
1 Q
SuccessRate @ k = σ ( rankq ≤ k )
Q q =1
(3)
4.3. Results of the experiment
In order to verify the effectiveness of the SCSPC method, SWIM [9], QECK [10] and CODEnn are
selected as the three benchmark methods to compare this experiment, and the code repository in Section
4.1 is used to construct the training set, validation set and test set.
In this experiment, the value of search result K is set as 1, 5, and 10. A value of 1 represents the
probability that the correct search result appears in the first position. The K values of 3 and 5 represent
the search performance when the correct search result does not appear in the first position and when
there are multiple correct search results. 1, 5, and 10 are also three parameters commonly used in the
search system.
Figure 5 illustrates the comparison between the SCSPC method and the other three benchmark
methods on the evaluation metric MRR. It can be seen that SCSPC has better performance on MRR
than SWIM, QECK and CODEnn. SWIM and other methods ignore the semantics between code
fragments, and SCSPC method performs diversified processing on data, which can search code more
accurately.
0.66
0.64
0.62
0.6
MRR
0.58
0.56
0.54
0.52
0.5
SWIM QECK CODEnn SCSPC
Figure 5. The comparison of the performance of the three methods.
Table 1 lists the comparison between the SCSPC method and the other three benchmark methods on
the evaluation metric SuccessRate@K. It can be seen that the SCSPC method is better than the other
three benchmark methods when k is 1, 5 and 10. The SCSPC method fully captures the semantic
relationship between code snippets and natural language, and has a greater probability of correct
answers in the search results.
Table 1. Performance comparison of three methods based on SuccessRate@k
Method S@1 S@5 S@10
SWIM 0.546 0.558 0.594
QECK 0.553 0.634 0.664
CODEnn 0.549 0.647 0.693
SCSPC 0.641 0/693 0.734
73
5. Conclusion
In this paper, we propose SCSPC, a semantic code search method based on program transformation.
This method performs data augmentation on Java code snippets in CodeSearchNet, and performs
program transformation on five different aspects of the code while maintaining the semantic unchanged
to obtain a large code corpus of high quality. Then, the CodeBERT model is fine-tuned to generate
sentence vectors for natural language and code snippets to complete code search. This method deals
with data diversification and effectively improves the accuracy of search results. In the following
research on code search, we can improve the accuracy of the results from all aspects of the code snippet.
The method in this paper is to search in the case of specified programming languages, and future
work can expand the scope of programming languages to further mine other semantic information of
the code and improve the search accuracy.
6. Acknowledgments
This paper is supported by the following projects: Research on software code reuse and automatic
generation(2020YFG0299).
7. References
[1] Lv F, Zhang H Y and Wang S W 2015 CodeHow: Effective Code Search Based on API
Understanding and Extended Boolean Model 30th IEEE/ACM International Conference on
Automated Software Engineering (ASE). ACM, 2015:260-270.
[2] Lu M L, Wang S W and Lo D 2015 Query expansion via WordNet for effective code search 22nd
International Conference on Software Analysis, Evolution, and Reengineering
(SANER). IEEE, 2015:545-549.
[3] Lucr D 2004 A survey on software components search and retrieval, Proceedings 30th Euromicro
Conference, 2004:152-159.
[4] Keivanloo I, Rilling J, Zou Y 2014 Spotting working code example.
[5] Devlin J, Chang M W and Lee K 2019 BERT: pre-training of deep bidirectional transformers for
language understanding Proc of Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for
Computational Linguistics, 2019: 4171-4186
[6] Vaswani A, Shazeer N and Parmar N 2017. Attention Is All You Need arXiv, 2017:5998-6008
[7] Feng Z, Guo D and Tang D 2020 CodeBERT: A Pre-Trained Model for Programming and Natural
Languages, 2020:1536-1547.
[8] Husain H, Wu H and Gazit T 2019 CodeSearchNet Challenge: Evaluating the State of Semantic
Code Search [EB/OL]. (2019-9-24)[2022-2-21]. http://arxiv.org/abs/1909.09436.
[9] Raghothaman M, Wei Y and Hamadi Y 2017 SWIM: Synthesizing What I Mean[J]. IEEE,
2017:357-367.
[10] Nie L, He J, and Ren Z 2017 Query Expansion Based on Crowd Knowledge for Code Search[J].
IEEE Transactions on Services Computing, 9(5):771-783.
74