<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Quality Evaluation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fangzhou Xu</string-name>
          <email>xu_fangzhou@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sai Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yahong Han</string-name>
          <email>yahong@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Code evaluation, Large Language Models</institution>
          ,
          <addr-line>Code Semantic</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Intelligence and Computing, Tianjin University</institution>
          ,
          <addr-line>Tianjin, 300350</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Code quality evaluation involves scoring generated code quality based on a reference code. Extensive research has demonstrated that current evaluations do not truly reflect code quality. We propose Decompositional Semantic Analysis for Code Quality Evaluation. We employ a decompositional approach to enable LLMs to analyze portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Storage unit to make independent analysis feasible, by retriving related semantic descriptions. Experimental results indicate that our approach surpasses existing state-of-the-art methods in correlation with code execution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Code quality evaluation involves scoring generated code quality based on a reference code for
a specific problem statement. Existing methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] rely on superficial code matching as
an evaluation metric, which fails to capture code semantics accurately. Moreover, extensive
research has demonstrated that existing methods do not truly reflect code quality [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        With the development of large language models (LLMs) in recent years, studies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have
proven the feasibility of using LLMs as evaluators for generative tasks. However, due to issues
like hallucinations and uncertainty in LLMs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], their correlation with code execution remains
at a lower level [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], making the direct use of LLMs for code quality evaluation challenging.
Quality Evaluation (DSA-CQE). We employ a decompositional approach to enable LLMs to
comprehend portions of code semantics independently each time, obtaining the code semantics
through multiple interactions with LLMs. We designed a Semantic Storage unit to make
independent analysis feasible, allowing LLMs to achieve more accurate semantics by breaking
down complex problems. Finally, the generated code is scored based on a semantic comparison
between the reference code and itself. Experimental results indicate that DSA-CQE surpasses
existing state-of-the-art methods in terms of correlation with code execution.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Approach</title>
      <p>
        Fig 1 illustrates the overall framework of DSA-CQE. DSA-CQE inputs the generated code and
the reference code, the output is the score of the generated code. First, the semantic of both codes
is obtained through a Decompositional Code Semantic Analysis unit. Subsequently, the code
semantic comparison unit determines the diferences in semantics. Finally, the generated code’s
score is derived by analyzing these semantic diferences through an LLM. In Decompositional
Code Semantic Analysis, we considered eight types of nodes of Abstract Syntax Tree (AST) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
as our predefined nodes: “ For”, “While”, “Assign”, “If”, “ClassDef”, “FunctionDef”, “Switch”,
and “Call”. We perform a depth-first traversal of the code’s AST, extracting the “subtrees” under
these predefined nodes as sub-codes. This approach can decompose the originally complex code
into simpler subcodes, allowing the LLM to perform semantic analysis1 on each part separately,
thereby reducing the hallucination phenomenon [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>After decomposing the code into several sub-code, it is not feasible to analyze them
individually, as most code segments are interrelated through references and dependencies. Analyzing
them in isolation could lead to missing external references, such as variables and function
definitions. We designed a Semantic Storage unit that stores textual descriptions of semantics
during the analysis process, which may be required for subsequent code semantic analysis. As
shown in Fig 2, a search is conducted within the Semantic Storage unit to retrieve relevant
semantic descriptions. These descriptions are concatenated with the original sub-code and,
together with a pre-designed prompt template, are input into the LLM to obtain the semantic
description of the sub-code. For example, variables such as ‘n’, ‘cap’, and ‘wei’, which appeared
previously in other sub-codes, can be easily misunderstood by the LLM without additional
semantic information. Without context, the LLM might misinterpret n as any generic integer or
cap as an abbreviation unrelated to the problem domain. However, after conducting semantic
analysis on the earlier sub-codes, the semantics of these variables have already been stored in
the Semantic Storage unit. We only need to retrieve these stored semantics and incorporate
them into the prompt template to provide the LLM with the necessary semantic context for
these external variables.</p>
      <p>The semantics of the code stored in the Semantic Storage are not static. Each time a semantic
description of a sub-code is obtained, the LLM is prompted to update the semantic descriptions
of each external variable based on the new description. These updated semantic descriptions
are then re-stored in the Semantic Storage unit for further analysis. As shown in Fig 2, the
variable ‘dp’, initially described as “a dynamic programming array initialized to 0,” is updated to
“stores the maximum value for each possible weight” after semantic analysis.
the process of updating its internal semantic descriptions
Kendall-Tau ( ), Pearson (  ) correlations. The best performance is bold.</p>
      <sec id="sec-3-1">
        <title>Method</title>
      </sec>
      <sec id="sec-3-2">
        <title>CodeBleu</title>
      </sec>
      <sec id="sec-3-3">
        <title>CodeBertScore 1-shot</title>
      </sec>
      <sec id="sec-3-4">
        <title>Simplified DSA-CQE DSA-CQE</title>
        <p>.295
.430
.106
.512
.594

.241
.352
.105
.470
.553</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experiments</title>
      <p>
        We conducted our experiments (following previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) on the HumanEval dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
exclusively, as most of the code samples in the CoNaLa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] subset of the dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used for
evaluation are single-line codes lacking complex semantics. While the Card2Code Hearthstone [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
subset contains semantically more complex structures, such as “classes”, these “classes” follow a
uniform structure with minimal variation. In practice, a significant portion of code demonstrates
both complexity and semantic diversity. In contrast, the HumanEval dataset contains a rich and
diverse range of code samples, making it the ideal choice for our experiments and evaluation.
Cassano et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] ran test cases on the HumanEval dataset and provided the functional
correctness of each piece of code. We use the Pearson [12] and Kendall [13] correlation coeficient
between the functional correctness scores and the scores given by diferent methods for
comparison. To ensure fairness, we uniformly used GPT-3.5 Turbo [14] as the backbone model and
set the LLM temperature to 0.2. We used state-of-the-art evaluation methods based on n-gram
matching and deep learning, namely CodeBleu [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and CodeBertScore [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as baselines. The
prompt for 1-shot utilized Zhou’s prompt template [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Simplified DSA-CQE is our framework,
which replaces decomposition analysis with single-step analysis using LLMs1.
      </p>
      <p>The experimental results are shown in the table 1. As can be seen, DSA-CQE performed
significantly better on the HumanEval dataset compared to traditional code evaluation methods,
with a Pearson correlation coeficient of 0.594. The single-step prompt and Simplified DSA-CQE
methods achieved Pearson correlation coeficients of 0.106 and 0.512, respectively. This indicates
that DSA-CQE, through decompositional semantic analysis, enhances the LLM’s comprehension
of code semantics and improves overall performance in code evaluation.</p>
      <p>Our current experiment focuses solely on evaluating the quality of Python code. However,
since the method relies on the Abstract Syntax Tree, adapting it to other programming languages
involves merely substituting the relevant parser. For instance, Java code can be parsed using
JavaParser [15], while pycparser [16] can be used for C code.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>In this poster, we propose Decompositional Semantic Analysis for LLM-based Code Quality
Evaluation. We employ a decompositional approach to enable LLMs to analysis portions
of code semantics independently each time, obtaining the code semantics through multiple
interactions with LLMs. We designed a Semantic Storage unit to make independent analysis
feasible, by retriving related semantic descriptions. The generated code is scored based on a
semantic comparison between the reference code and itself. The experimental results show that
DSA-CQE surpasses all existing methods in correlation with code execution.
[12] I. Cohen, Y. Huang, J. Chen, J. Benesty, J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson
correlation coeficient, Noise reduction in speech processing (2009) 1–4.
[13] M. G. Kendall, A new measure of rank correlation, Biometrika 30 (1938) 81–93.
[14] OpenAI., Openai gpt-3.5 turbo, https://platform.openai.com/docs/guides/text-generation/
chat-completions-api, 2022.
[15] javaparser, https://github.com/javaparser/javaparser, n.d.
[16] pycparser, https://github.com/eliben/pycparser, n.d.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sundaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blanco</surname>
          </string-name>
          , S. Ma,
          <article-title>Codebleu: a method for automatic evaluation of code synthesis</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>10297</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , G. Neubig, Codebertscore:
          <article-title>Evaluating code generation with pretrained models of code</article-title>
          ,
          <source>arXiv preprint arXiv:2302.05527</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Evtikhiev</surname>
          </string-name>
          , E. Bogomolov,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          , T. Bryksin,
          <article-title>Out of the bleu: how should we assess quality of the code generation models?</article-title>
          ,
          <source>Journal of Systems and Software</source>
          <volume>203</volume>
          (
          <year>2023</year>
          )
          <fpage>111741</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. Y.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <article-title>Large language models are state-of-the-art evaluators of code generation</article-title>
          ,
          <source>arXiv preprint arXiv:2304.14317</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiao</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          , J. Han,
          <article-title>Towards a unified multi-dimensional evaluator for text generation</article-title>
          ,
          <source>arXiv preprint arXiv:2210.07197</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Neamtiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <article-title>Understanding source code evolution using abstract syntax tree matching</article-title>
          ,
          <source>in: Proceedings of the 2005 international workshop on Mining software repositories</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. T. H. J. e. a.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Mark,
          <source>Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vasilescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <article-title>Learning to mine aligned code and natural language pairs from stack overflow</article-title>
          ,
          <source>in: Proceedings of the 15th international conference on mining software repositories</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>476</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Hermann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kočiskỳ</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>Latent predictor networks for code generation</article-title>
          ,
          <source>arXiv preprint arXiv:1603.06744</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cassano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gouwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Phipps-Costin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pinckney</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Yee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. J. Anderson</surname>
            ,
            <given-names>M. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Feldman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Greenberg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jangda</surname>
          </string-name>
          , Multipl-e:
          <article-title>A scalable and polyglot approach to benchmarking neural code generation</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>49</volume>
          (
          <year>2023</year>
          )
          <fpage>3675</fpage>
          -
          <lpage>3691</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2023</year>
          .
          <volume>3267446</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>