A Semantic Code Search Method Based on Program Conversion1 Yan Tao1, Le Wei1,2, Hongping Shu2 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, Sichuan 610225, China 2 Automatic Software Generation & Intelligence Service Key Laboratory of Sichuan Province, Chengdu 610225, China Abstract Aiming at the problem that code fragments can’t be captured accurately and quickly due to ignoring semantic information and structural information of source code in code search task, A semantic code search method based on program conversion is proposed (Semantic Code Search based on Program Conversion, SCSPC). The SCSPC diversified the data of the acquired code fragments through data enhancement, changed the program through variable renaming, exchanging two independent statements, circular exchange, inserting exception capture and if equivalent replacing switch statement, and trained the CodeBERT model with mixed objective functions (masking language modeling and replacing token detection) to generate sentence vectors with rich semantics for the code fragments and natural languages, and compared the similarity of the vectors to complete the code search. Experimental results show that compared with SWIM, QECK and CODEnn models, the average reciprocal ranking and hit rate (S@10) of SCSPC method are increased by 2% and 0.041 respectively. Keywords Code Search, CodeBERT, Programming Transformation, Semantic Code 1. Introduction In modern software development process, software reuse has been applied to various fields. The same function may exist in different domains or different software systems, and the code to implement the function is also similar. Developers usually search for code in a large number of procedural code libraries, so as to save development time and efficiency. At present, code search is mainly based on information retrieval methods and deep learning methods. Methods based on information retrieval focus on how to generate correct keywords and calculate the matching degree between them. For example, CodeHow proposed by Lv et al. [1], this model is based on the natural language similarity and the influence of API on code search, and understands the query by identifying the API that the query may refer to. After determining the potential API of the query, the information of the API is merged into the process of code search. Lu et al. [2] proposed a synonym expansion query model generated by WordNet [3]. Iman et al. [4] proposed a pattern-based search technology, which uses the clone detection method based on vector space model to support the discovery of working code instances and search out all types of statements (such as control flow, data flow and API). Current code search methods still have problems. First, they ignore that codes may come from different dimensions, which makes it difficult to cover all perspectives with a single code representation. Second, it is difficult for a single natural language query to express the intentions of different users, which makes the search results inaccurate. Aiming at the above problems, this paper proposes a code search method based on program transformation. The method uses program transformation to enhance data and diversify code fragments. The CodeBERT model is used to map natural language and code AHPCAI2022@2nd International Conference on Algorithms, High Performance Computing and Artificial Intelligence EMAIL: Corresponding author’s e-mail: 897123966@qq.com (Yan Tao) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 69 snippets into the same vector space, and the similarity between the vectors is calculated and sorted to search for the corresponding code snippet for the user. 2. The model of CodeBERT Pretrained models are essentially a method of transfer learning. The most typical pre-trained model is BERT [5] (Bidirectional Encoder Representation from Transformers). BERT uses a bidirectional encoder from Transformer [6], which is geared towards textual languages but is not suitable for representing semantic relationships between bimodal languages. The CodeBERT [7] model captures the semantic association between natural language and code snippet, and can quickly complete tasks such as semantic similarity search. The CodeBERT model is based on a Transformer network with 12 layers. Its pre-training completes two tasks: Masked Language Model (MLM) and Substitution Token Detection (RTD) task. The masked language model is to randomly delete a word in a sentence and then judge what the deleted word is, a task pretrained using NL-PL pairs. As shown in Figure 1, in the pre-training phase, [CLS] is placed at the beginning of the natural language sentence to do binary classification. Separate natural language and code snippets with flags. The input is of the form {[CLS], NL, [SEP], PL, [EOS]}. Consider natural language as a sequence of words and code snippets as a whole set of tokens. The output is the Embedding value and the [CLS] value for each Token. The Token detection task uses two generators combined with the context to generate the token at [MASK], and trains the discriminator to predict whether the token in the sentence is replaced or not. This task is pretrained using single-modal data, and the training process is shown in Figure 2. w1 [MASK] w51 replaced w2 w2 w2 original Key w3 Sorted w3 NL Generator w3 original w4 w4 w4 original w5 [MASK] w5 CodeBERT original NL-Code Discriminator c1 c1 c1 original c2 [MASK] c29 replaced [CLS] Sort a dictionary by [MASK] [SEP] [(key,dictionary[key]) for key in [MASK] (dictionary.key())] c3 c3 c3 original Code Generator Natural Language Programming Language c4 c4 c4 original c5 c5 c5 original c6 [MASK] c162 replaced Figure 1. The model of masked language. Figure 2. The tasks of token detection. 3. Code search method based on program transformation 3.1. The framework of the method The ultimate goal of method framework code search is to find code snippets that match the user's expectations. Program transformation based Semantic code search (SCSPC) captures the semantic relationships between natural languages and programming languages and generates a common representation. The work of this paper mainly includes semantic-preserving code program transformation, model training, and code search result matching. Its overview diagram is shown in Figure 3. Firstly, the SCSPC method uses data augmentation to transform the source code. The specific steps are to rename the source code variables, swap the position of independent statements, loop exchange, add capture statements, and equivalently replace switch statements with if. Each transformation has different effects on the structure of the method, so as to improve the diversity of the training data. Then, natural language and code snippets are input into the CodeBERT model, which are mapped into sentence vectors and code block vectors respectively. Finally, the cosine distance between the code block vector and the sentence vector is calculated by the cosine similarity, and the code search is completed by sorting the similarity. 70 The training of model The transformation of programming Software Search texts developer Code1 Code2 Search statement vector Source Similarity The model List of the sorted Code Corpus of calculation Code3 of results code CodeBERT Code4 Search snippet vectors Fine-tuning Sort the results of Code5 search Sort the results of the model Pre-training model of CodeBERT Figure 3. Overview diagram of semantic code search methods based on program transformations. 3.2. The model training of the method Model training focuses on capturing the semantic connections between queries and code snippets and generating a common representation, as shown in Figure 4. result ranking similarity calculation Context Context vector vector Layer Norm Layer Norm FFN FFN Coding Layer Layer Norm Layer Norm Multi-Head Multi-Head Attention xN Attention xN Positional Positional Encoding Encoding Embedded Layer Embedding Embedding query statement code snippet Figure 4. The process of model training. 71 The model mainly consists of embedding layer and encoding layer. Within the embedding layer, input queries and code snippets are mapped into a high-dimensional vector space. Set the maximum length of the input query to 1024. Think of it as a sequence of words, where each query uses and to represent the beginning and end of the query. Use to split words. For a code fragment, think of it as a sequence of tokens. The encoding layer learns code snippets c and natural language d using a bidirectional encoder. The main calculation formula is shown in (1). PE is the position embedding and i denotes the dimension of the embedding. Linear stands for linear layer, Q, K, V are three vectors in the attention mechanism, which are query vector, key vector, and value vector. LayerNorm represents the normalization layer, RELU is the activation function, and using RELU can speed up the convergence of the network.  PE( pos,2i )=sin( pos /100002 i/d )   PE( pos ,2i +1) =cos( pos /100002 i/d )   X = Embedding + PositionalEncoding  Q = Linear ( X ) = XWQ   K = Linear ( X ) = XWk (1)   V = Linear ( X ) = XWV  X attention = X attention + X   X attention = LayerNorm( X attention )   X hidden = Linear ( RULU ( Linear ( X attention ))) 4. Experiment 4.1. Data of the experiment In order to ensure that the collected source code has real reliability. The source code of this paper is from the most popular hosting open source website platform Github and the Deep Program group of Microsoft Research-Cambridge jointly launched the dataset CodeSearchNet for code representation learning [8]. The CodeSearchNet dataset is composed of 2 million annotation-code pairs from open source libraries. Specifically, comments are top-level function or method comments, and code is the entire function or method. In this experiment, the Java language from the CodeSearchNet Challenge was selected as the dataset. It contains 500,754 pairs of functional Java code snippets and their descriptions. 450, 439 pairs were used as training, 33, 658 pairs were used as validation set, and 28, 910 pairs were used as test set. 4.2. Evaluation criteria of the experiment In this paper, SuccessRate@k and MRR (Mean Reciprocal Rank) are used to verify the effectiveness of the SCSPC method. The formula for calculating the average reciprocal rank is given in (2). Q denotes the set of automatically evaluated queries, and ranki is the rank corresponding to the ith query. The higher the value of MRR, the better the performance of the code search. Q 1 1 MRR =  Q i =1 ranki (2) SuccessRate@k denotes the percentage of queries for which more than one correct fragment successfully exists among the top k code fragments returned by the search model, and is calculated as shown in (3). Where Q is the query set in the automatic evaluation, Rankq is the highest Rank of the hit code snippet in the search results, and 𝜎 is a function that returns 1 if the rank value of the q query is less than k, and 0 otherwise. This evaluation metric is very important. Because a good search engine should be able to find the code snippets that developers need by searching fewer search results. In this 72 experiment, the results were evaluated for k of 1, 5, and 10. If the value is higher, the model performance of code search is better. 1 Q SuccessRate @ k = σ ( rankq ≤ k ) Q q =1 (3) 4.3. Results of the experiment In order to verify the effectiveness of the SCSPC method, SWIM [9], QECK [10] and CODEnn are selected as the three benchmark methods to compare this experiment, and the code repository in Section 4.1 is used to construct the training set, validation set and test set. In this experiment, the value of search result K is set as 1, 5, and 10. A value of 1 represents the probability that the correct search result appears in the first position. The K values of 3 and 5 represent the search performance when the correct search result does not appear in the first position and when there are multiple correct search results. 1, 5, and 10 are also three parameters commonly used in the search system. Figure 5 illustrates the comparison between the SCSPC method and the other three benchmark methods on the evaluation metric MRR. It can be seen that SCSPC has better performance on MRR than SWIM, QECK and CODEnn. SWIM and other methods ignore the semantics between code fragments, and SCSPC method performs diversified processing on data, which can search code more accurately. 0.66 0.64 0.62 0.6 MRR 0.58 0.56 0.54 0.52 0.5 SWIM QECK CODEnn SCSPC Figure 5. The comparison of the performance of the three methods. Table 1 lists the comparison between the SCSPC method and the other three benchmark methods on the evaluation metric SuccessRate@K. It can be seen that the SCSPC method is better than the other three benchmark methods when k is 1, 5 and 10. The SCSPC method fully captures the semantic relationship between code snippets and natural language, and has a greater probability of correct answers in the search results. Table 1. Performance comparison of three methods based on SuccessRate@k Method S@1 S@5 S@10 SWIM 0.546 0.558 0.594 QECK 0.553 0.634 0.664 CODEnn 0.549 0.647 0.693 SCSPC 0.641 0/693 0.734 73 5. Conclusion In this paper, we propose SCSPC, a semantic code search method based on program transformation. This method performs data augmentation on Java code snippets in CodeSearchNet, and performs program transformation on five different aspects of the code while maintaining the semantic unchanged to obtain a large code corpus of high quality. Then, the CodeBERT model is fine-tuned to generate sentence vectors for natural language and code snippets to complete code search. This method deals with data diversification and effectively improves the accuracy of search results. In the following research on code search, we can improve the accuracy of the results from all aspects of the code snippet. The method in this paper is to search in the case of specified programming languages, and future work can expand the scope of programming languages to further mine other semantic information of the code and improve the search accuracy. 6. Acknowledgments This paper is supported by the following projects: Research on software code reuse and automatic generation(2020YFG0299). 7. References [1] Lv F, Zhang H Y and Wang S W 2015 CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 2015:260-270. [2] Lu M L, Wang S W and Lo D 2015 Query expansion via WordNet for effective code search 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 2015:545-549. [3] Lucr D 2004 A survey on software components search and retrieval, Proceedings 30th Euromicro Conference, 2004:152-159. [4] Keivanloo I, Rilling J, Zou Y 2014 Spotting working code example. [5] Devlin J, Chang M W and Lee K 2019 BERT: pre-training of deep bidirectional transformers for language understanding Proc of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186 [6] Vaswani A, Shazeer N and Parmar N 2017. Attention Is All You Need arXiv, 2017:5998-6008 [7] Feng Z, Guo D and Tang D 2020 CodeBERT: A Pre-Trained Model for Programming and Natural Languages, 2020:1536-1547. [8] Husain H, Wu H and Gazit T 2019 CodeSearchNet Challenge: Evaluating the State of Semantic Code Search [EB/OL]. (2019-9-24)[2022-2-21]. http://arxiv.org/abs/1909.09436. [9] Raghothaman M, Wei Y and Hamadi Y 2017 SWIM: Synthesizing What I Mean[J]. IEEE, 2017:357-367. [10] Nie L, He J, and Ren Z 2017 Query Expansion Based on Crowd Knowledge for Code Search[J]. IEEE Transactions on Services Computing, 9(5):771-783. 74