<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Semantic Error Detection in Code Translation Using Knowledge-Driven Static Analysis with AI Chain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lei Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sai Zhang</string-name>
          <email>TU_sai@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fangzhou Xu</string-name>
          <email>TU_fangzhou@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liang Wan</string-name>
          <email>lwan@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Semantic Mistakes, Knowledge Base, Code Translation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Intelligence and Computing, Tianjin University</institution>
          ,
          <addr-line>Tianjin 300350</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the task of code translation, neural network-based models frequently generate semantically incorrect code that deviates from the original logic of the source code. This problem persists even with advanced large models. While a recent approach suggests using test cases to identify these semantic errors, its efectiveness is highly dependent on the quality of the test cases, making it unsuitable for code snippets that lack test cases in real-world scenarios. To automatically locate semantic errors in code translation without valid test cases, we propose the Knowledge-guided Semantic Analysis Framework (KSAF). KSAF decomposes the source and translated code synchronously and performs static analysis to detect semantic errors. This is achieved by leveraging fine-grained knowledge in conjunction with an AI chain-driven Large Language Model (LLM). In a previously studied benchmark of Python programs, our framework based on the GPT-3.5-turbo model achieved a correctness rate of 47.8% through a static evaluation method. This result represents a 37.2% improvement over the baseline using the same base model and a 13.4% improvement in correctness compared to the baseline using the GPT-4-turbo-based model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Code translation involves converting a program written in one programming language into
another, ensuring that the original functionality remains intact. Neural network models have
achieved significant success in this task, but recent studies [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] have found that these models
often introduce subtle errors. These subtle errors can be grouped into syntactic and semantic
errors. Syntax errors violate the syntax rules of destination languages, which a grammar checker
can often identify. In contrast, semantic errors are more subtle and may result in translated
code that either fails to execute without violating the target language’s syntax or produces
outputs that are inconsistent with the original code [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For example, as shown in the replace
function in Figure 1, s.replace(‘-’, ‘ ’) in Python replaces all occurrences of ‘-’ with ‘ ’, while in
JavaScript, it only replaces the first occurrence by default.
      </p>
      <p>
        Based on this, Wang et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] rely on test cases that can expose semantic errors to analyze
code and locate these errors dynamically. However, their method is highly dependent on the
quality of the test cases, requiring them to reveal semantic errors efectively, and it cannot
handle code snippets lacking valid test cases. Additionally, In the code translation domain,
relying on test cases to execute code is not only costly but also poses potential security risks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>To automatically locate semantic errors in code translation in the absence of valid test
cases, we propose a framework KSAF, which decomposes the source code and translated code
synchronously and statically analyses the code to locate semantic errors with fine-grained
knowledge combined with AI chain-driven LLM. Experiments show that our approach can
achieve better results. KSAF is the first method to locate semantic errors in code translation
without test cases. It only requires API documentation, does not need model training, and is
adaptable to low-resource languages.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Approach</title>
      <p>Translated Code
javascript code="
function f_gold(s, k) {
s = s.replace('-', ' ).toUpperCase();
let res = [];
let cnt = (s.length % k) || k;
let t = 0;
for (let i = 0; i &lt; s.length; i++) {
res.push(s[i]);
t += 1;
if (t == cnt) {
t = 0;
cnt = k;
if (i != s.length - 1) {
} }ret}ur}nrreess..pjouisnh('(')-;');</p>
      <p>Source Code
def f_gold(s: str, k: int) -&gt; str:
s = s.replace('-', ' ).upper()
res = []
cnt = (len(s) % k) or k
t = 0
for i, c in enumerate(s):
res.append(c)
t += 1
if t == cnt:
t = 0
cnt = k
if i != len(s) - 1:
res.append('-’)
return ' .join(res)
(A) Neural Source Map</p>
      <p>Generator
(B) Code AST Decomposition</p>
      <p>Knowledge Base
(C) Checking
(C) Comparing</p>
      <p>(C) Locating</p>
      <p>Suspicious Lines
s = s.replace('-', ' ).toUpperCase();
translated code, and a fixed prompt are input into the LLM. This module produces a mapping
between atomic fragments in the source code and their corresponding parts in the translated
code, providing an ordered list of these atomic fragments.</p>
      <p>In the Code AST Decomposition module, as shown in Figure 1 B, the abstract syntax tree(AST)
of the source code is traversed to extract ”subtrees” from eight types of nodes as sub-code [7].
Using the mapping list from Module A, the corresponding translated code for each sub-code is
obtained. Each sub-code pair is then passed to the next module.</p>
      <p>After obtaining the sub-code and its corresponding translated code, KSAF uses LLM for
static analysis to identify semantic inconsistencies between the source and translated code. We
designed a knowledge-driven LLM AI Chain workflow, as shown in Figure 1 C, which includes
three steps: Checking, Comparing, and Locating, all using the same LLM. In the Checking step,
KSAF inputs the source code, translated code, and a fixed prompt into the LLM to extract the fully
qualified names (FQN) of operators and APIs, then passes the results to the Comparing step. In
the Comparing step, the FQNs are linked with an ofline-built API knowledge base to obtain the
corresponding API function descriptions. These descriptions and the results from the Checking
step form a prompt fed into the LLM to precisely summarize the diferences in operators and
APIs between the source and translated code. In the Locating step, the Comparing step results,
source code, and translated code are input into the LLM as a prompt to identify suspicious code
lines that might cause semantic inconsistencies between the source and translated code.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Experiments</title>
      <p>In this module, our objective is to compare the efectiveness of KSAF with other methods.
To ensure fairness in the experiments, we selected methods that, like KSAF, do not require
test cases for static code analysis. Specifically, we chose the widely recognized prompt-based
methods that aim to fully leverage the potential of foundational models: LLMs with Few-Shot
Learning [8]: A few examples are provided as demonstration examples in the prompt to guide
the LLM in achieving better performance on the task. LLMs with Chain of Thought (CoT) [9]:
By appending ”Let’s think step by step” at the end of the prompt, the LLM is prompted to explain
the reasoning or steps before providing the final answer.</p>
      <p>
        We used the dataset (excluding test cases) and metrics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] of Wang et al. to evaluate our
method and baseline. Where   ,  ℎ , and   denote the ratios of successfully identified
errors to the total number of semantic errors, hidden errors, and errors leading to results
that difer from the source code output, respectively. Semantic errors are when the code is
syntactically correct but logically flawed, causing the program to behave in a way that is not
expected. Hidden errors are a special kind of semantic error, which usually can’t be immediately
localized to a specific fix, even when running test cases. Errors leading to results that difer
from the source code output are also a type of semantic error, which does not cause a runtime
error but causes the output of the translated code in unit tests to be inconsistent with the source
code [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        As shown in Table 1, our method outperforms all baseline approaches. Additionally, the
method proposed by Wang et al. is unable to handle code without test cases, resulting in zero
values for all metrics. And following the experimental setup of previous work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we found
that our framework KSAF detected an average of 3.0 suspicious lines, which represents 16.5% of
the total lines of code. This indicates that users typically need to review only 1 to 3 lines to
understand and fix semantic errors.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>This paper propose a method based on code AST decomposition and fine-grained knowledge
combined with an AI chain-driven LLM to locate semantic inconsistencies between source and
translated code. This method efectively handles code without test cases. We plan to extend
our approach to multi-language datasets and conduct comprehensive experiments to further
validate KSAF’s efectiveness in the future.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Project of Science and Technology Research and Development
Plan of China Railway Corporation (N2023J044).
[5] Mozilla, Javascript reference, https://developer.mozilla.org/en-US/docs/Web/JavaScript/</p>
      <p>Reference, 2024. Last accessed: March 11, 2024.
[6] beautiful-soup 4, Beautiful soup 4, 2024. https://beautiful-soup-4.readthedocs.io/en/latest/.
[7] T. Hu, Z. Xu, Y. Fang, Y. Wu, B. Yuan, D. Zou, H. Jin, Fine-grained code clone detection with
block-based splitting of abstract syntax tree, in: Proceedings of the 32nd ACM SIGSOFT
International Symposium on Software Testing and Analysis, 2023, pp. 89–100.
[8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
neural information processing systems 33 (2020) 1877–1901.
[9] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot
reasoners, Advances in neural information processing systems 35 (2022) 22199–22213.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Ibrahimzada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Wassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Merler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sobolev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pavuluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jabbarvand</surname>
          </string-name>
          ,
          <article-title>Lost in translation: A study of bugs introduced by large language models while translating code</article-title>
          ,
          <source>in: 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)</source>
          ,
          <source>IEEE Computer Society</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>866</fpage>
          -
          <lpage>866</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaidyanath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , Jigsaw:
          <article-title>Large language models meet program synthesis</article-title>
          ,
          <source>in: Proceedings of the 44th International Conference on Software Engineering</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1219</fpage>
          -
          <lpage>1231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <article-title>Transmap: Pinpointing mistakes in neural code translation</article-title>
          ,
          <source>in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>999</fpage>
          -
          <lpage>1011</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Codetransocean: A comprehensive multilingual benchmark for code translation</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>5067</fpage>
          -
          <lpage>5089</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>