-

1613-0073

Quality Evaluation⋆

Fangzhou Xu

xu_fangzhou@tju.edu.cn 0 1

Sai Zhang

0 1

Xiaowang Zhang

0 1

Yahong Han

yahong@tju.edu.cn 0 1 0 Code evaluation, Large Language Models , Code Semantic 1 College of Intelligence and Computing, Tianjin University , Tianjin, 300350 , China

Code quality evaluation involves scoring generated code quality based on a reference code. Extensive research has demonstrated that current evaluations do not truly reflect code quality. We propose Decompositional Semantic Analysis for Code Quality Evaluation. We employ a decompositional approach to enable LLMs to analyze portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Storage unit to make independent analysis feasible, by retriving related semantic descriptions. Experimental results indicate that our approach surpasses existing state-of-the-art methods in correlation with code execution.

CEUR ceur-ws.org

1. Introduction

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Existing methods [ 1 ] [ 2 ] rely on superficial code matching as an evaluation metric, which fails to capture code semantics accurately. Moreover, extensive research has demonstrated that existing methods do not truly reflect code quality [ 3 ].

With the development of large language models (LLMs) in recent years, studies [ 4 ] have proven the feasibility of using LLMs as evaluators for generative tasks. However, due to issues like hallucinations and uncertainty in LLMs [ 5 ], their correlation with code execution remains at a lower level [ 6 ], making the direct use of LLMs for code quality evaluation challenging. Quality Evaluation (DSA-CQE). We employ a decompositional approach to enable LLMs to comprehend portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Storage unit to make independent analysis feasible, allowing LLMs to achieve more accurate semantics by breaking down complex problems. Finally, the generated code is scored based on a semantic comparison between the reference code and itself. Experimental results indicate that DSA-CQE surpasses existing state-of-the-art methods in terms of correlation with code execution.

2. Approach

Fig 1 illustrates the overall framework of DSA-CQE. DSA-CQE inputs the generated code and the reference code, the output is the score of the generated code. First, the semantic of both codes is obtained through a Decompositional Code Semantic Analysis unit. Subsequently, the code semantic comparison unit determines the diferences in semantics. Finally, the generated code’s score is derived by analyzing these semantic diferences through an LLM. In Decompositional Code Semantic Analysis, we considered eight types of nodes of Abstract Syntax Tree (AST) [ 7 ] as our predefined nodes: “ For”, “While”, “Assign”, “If”, “ClassDef”, “FunctionDef”, “Switch”, and “Call”. We perform a depth-first traversal of the code’s AST, extracting the “subtrees” under these predefined nodes as sub-codes. This approach can decompose the originally complex code into simpler subcodes, allowing the LLM to perform semantic analysis1 on each part separately, thereby reducing the hallucination phenomenon [ 5 ].

After decomposing the code into several sub-code, it is not feasible to analyze them individually, as most code segments are interrelated through references and dependencies. Analyzing them in isolation could lead to missing external references, such as variables and function definitions. We designed a Semantic Storage unit that stores textual descriptions of semantics during the analysis process, which may be required for subsequent code semantic analysis. As shown in Fig 2, a search is conducted within the Semantic Storage unit to retrieve relevant semantic descriptions. These descriptions are concatenated with the original sub-code and, together with a pre-designed prompt template, are input into the LLM to obtain the semantic description of the sub-code. For example, variables such as ‘n’, ‘cap’, and ‘wei’, which appeared previously in other sub-codes, can be easily misunderstood by the LLM without additional semantic information. Without context, the LLM might misinterpret n as any generic integer or cap as an abbreviation unrelated to the problem domain. However, after conducting semantic analysis on the earlier sub-codes, the semantics of these variables have already been stored in the Semantic Storage unit. We only need to retrieve these stored semantics and incorporate them into the prompt template to provide the LLM with the necessary semantic context for these external variables.

The semantics of the code stored in the Semantic Storage are not static. Each time a semantic description of a sub-code is obtained, the LLM is prompted to update the semantic descriptions of each external variable based on the new description. These updated semantic descriptions are then re-stored in the Semantic Storage unit for further analysis. As shown in Fig 2, the variable ‘dp’, initially described as “a dynamic programming array initialized to 0,” is updated to “stores the maximum value for each possible weight” after semantic analysis. the process of updating its internal semantic descriptions Kendall-Tau ( ), Pearson ( ) correlations. The best performance is bold.

Method CodeBleu CodeBertScore 1-shot Simplified DSA-CQE DSA-CQE

.295 .430 .106 .512 .594 .241 .352 .105 .470 .553

3. Experiments

We conducted our experiments (following previous work [ 4 ]) on the HumanEval dataset [ 8 ] exclusively, as most of the code samples in the CoNaLa [ 9 ] subset of the dataset [ 3 ] used for evaluation are single-line codes lacking complex semantics. While the Card2Code Hearthstone [ 10 ] subset contains semantically more complex structures, such as “classes”, these “classes” follow a uniform structure with minimal variation. In practice, a significant portion of code demonstrates both complexity and semantic diversity. In contrast, the HumanEval dataset contains a rich and diverse range of code samples, making it the ideal choice for our experiments and evaluation. Cassano et al. [ 11 ] ran test cases on the HumanEval dataset and provided the functional correctness of each piece of code. We use the Pearson [12] and Kendall [13] correlation coeficient between the functional correctness scores and the scores given by diferent methods for comparison. To ensure fairness, we uniformly used GPT-3.5 Turbo [14] as the backbone model and set the LLM temperature to 0.2. We used state-of-the-art evaluation methods based on n-gram matching and deep learning, namely CodeBleu [ 1 ] and CodeBertScore [ 2 ], as baselines. The prompt for 1-shot utilized Zhou’s prompt template [ 4 ]. Simplified DSA-CQE is our framework, which replaces decomposition analysis with single-step analysis using LLMs1.

The experimental results are shown in the table 1. As can be seen, DSA-CQE performed significantly better on the HumanEval dataset compared to traditional code evaluation methods, with a Pearson correlation coeficient of 0.594. The single-step prompt and Simplified DSA-CQE methods achieved Pearson correlation coeficients of 0.106 and 0.512, respectively. This indicates that DSA-CQE, through decompositional semantic analysis, enhances the LLM’s comprehension of code semantics and improves overall performance in code evaluation.

Our current experiment focuses solely on evaluating the quality of Python code. However, since the method relies on the Abstract Syntax Tree, adapting it to other programming languages involves merely substituting the relevant parser. For instance, Java code can be parsed using JavaParser [15], while pycparser [16] can be used for C code.

4. Conclusion

In this poster, we propose Decompositional Semantic Analysis for LLM-based Code Quality Evaluation. We employ a decompositional approach to enable LLMs to analysis portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Storage unit to make independent analysis feasible, by retriving related semantic descriptions. The generated code is scored based on a semantic comparison between the reference code and itself. The experimental results show that DSA-CQE surpasses all existing methods in correlation with code execution. [12] I. Cohen, Y. Huang, J. Chen, J. Benesty, J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson correlation coeficient, Noise reduction in speech processing (2009) 1–4. [13] M. G. Kendall, A new measure of rank correlation, Biometrika 30 (1938) 81–93. [14] OpenAI., Openai gpt-3.5 turbo, https://platform.openai.com/docs/guides/text-generation/ chat-completions-api, 2022. [15] javaparser, https://github.com/javaparser/javaparser, n.d. [16] pycparser, https://github.com/eliben/pycparser, n.d.

[1]

Ren ,

Guo ,

Lu ,

Zhou , S. Liu,

Tang ,

Sundaresan ,

Zhou ,

Blanco , S. Ma, Codebleu: a method for automatic evaluation of code synthesis , arXiv preprint arXiv: 2009 . 10297 ( 2020 ).

[2]

Zhou ,

Alon ,

Agarwal , G. Neubig, Codebertscore: Evaluating code generation with pretrained models of code , arXiv preprint arXiv:2302.05527 ( 2023 ).

[3]

Evtikhiev , E. Bogomolov,

Sokolov , T. Bryksin, Out of the bleu: how should we assess quality of the code generation models? , Journal of Systems and Software 203 ( 2023 ) 111741 .

[4]

T. Y.

Zhuo , Large language models are state-of-the-art evaluators of code generation , arXiv preprint arXiv:2304.14317 ( 2023 ).

[5]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Y. J.

Bang ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2023 ) 1 - 38 .

[6]

Zhong ,

Liu ,

Yin ,

Mao ,

Jiao , P. Liu,

Zhu ,

Ji , J. Han, Towards a unified multi-dimensional evaluator for text generation , arXiv preprint arXiv:2210.07197 ( 2022 ).

[7]

Neamtiu ,

J. S.

Foster ,

Hicks , Understanding source code evolution using abstract syntax tree matching , in: Proceedings of the 2005 international workshop on Mining software repositories , 2005 , pp. 1 - 5 .

[8]

J. T. H. J. e. a.

Chen , Mark, Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 ( 2021 ).

[9]

Yin ,

Deng ,

Chen ,

Vasilescu ,

Neubig , Learning to mine aligned code and natural language pairs from stack overflow , in: Proceedings of the 15th international conference on mining software repositories , 2018 , pp. 476 - 486 .

[10]

Ling ,

Grefenstette , K. M. Hermann , T.

Kočiskỳ , A.

Senior , F.

Wang , P.

Blunsom , Latent predictor networks for code generation , arXiv preprint arXiv:1603.06744 ( 2016 ).

[11]

Cassano ,

Gouwar ,

Nguyen ,

Phipps-Costin ,

Pinckney , M.-H. Yee , Y. Zi , C. J. Anderson , M. Q.

Feldman , A.

Guha , M.

Greenberg , A.

Jangda , Multipl-e: A scalable and polyglot approach to benchmarking neural code generation , IEEE Transactions on Software Engineering 49 ( 2023 ) 3675 - 3691 . doi: 10 .1109/TSE. 2023 . 3267446 .