Exploring Large Language Models for
                                Code Explanation
                                Paheli Bhattacharya1,*,† , Manojit Chakraborty1,† , Kartheek N S N Palepu1 ,
                                Vikas Pandey1 , Ishan Dindorkar2 , Rakesh Rajpurohit2 and Rishabh Gupta1
                                1
                                    Bosch Research and Technology Centre, Bangalore, India
                                2
                                    Bosch Global Software Technologies, Bangalore, India


                                                                         Abstract
                                                                         Automating code documentation through explanatory text can prove highly beneficial in code under-
                                                                         standing. Large Language Models (LLMs) have made remarkable strides in Natural Language Processing,
                                                                         especially within software engineering tasks such as code generation and code summarization. This
                                                                         study specifically delves into the task of generating natural-language summaries for code snippets, using
                                                                         various LLMs. The findings indicate that Code LLMs outperform their generic counterparts, and zero-shot
                                                                         methods yield superior results when dealing with datasets with dissimilar distributions between training
                                                                         and testing sets.

                                                                         Keywords
                                                                         Code Comment Generation, Code Summarization, Large Language Models, AI for Software Engineering


                                1. Introduction
                                Understanding legacy codes in large code repositories is a big challenge in the domain of software
                                engineering. Liang et.al. [1] showed that only 15.4% of Java GitHub codes are documented.
                                This makes it difficult and time-consuming for developers to comprehend the underlying
                                functionality [2, 3]. Automating the task of code documentation through explanations can
                                therefore prove beneficial.
                                   Large Language Models (LLMs) have brought in a progressive breakthrough in Natural
                                Language Processing, especially in the field of Generative AI. LLMs have been applied in many
                                software engineering tasks [4], popularly in code generation [5], code summarization [6] and
                                unit test case generation [7].
                                   In this paper we focus on the task of code explanation – generating the intent or summary
                                in natural-language for a given code snippet. We benchmark a suite of LLMs – both generic
                                LLMs [8] and Code LLMs [9] using zero-shot, few-shot and instruction fine-tuning approaches.
                                Extensive experiments on the IRSE dataset [10] leads to the following insights: (i) Code LLMs
                                perform better than generic LLMs for the task. (ii) Zero-shot approaches achieve better results
                                than few-shot and fine-tuning, where the train and test sets follow dissimilar distribution.


                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                *
                                  Corresponding author.
                                †
                                  Equal Contribution
                                $ paheli.bhattacharya@bosch.com (P. Bhattacharya)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Example data points from IRSE and conala-train datasets along with their average lengths (#words).
                                                                                                                                            Avg. length
   Dataset      Size          Example Code Snippet                                   Example Code Explanation
                                                                                                                                          Code Comment
                                                                  This code snippet uses the re (regular expression) module in Python
                                                                  to define a pattern that matches one or more whitespace characters.
                       pattern = re.compile(’\\s+’)
                100                                               It then uses the re.sub() function to remove any occurrences of the     21.18   84.28
    IRSE               sentence = re.sub(pattern, "", sentence)
                                                                  pattern from the string variable ’sentence’. The result is a modified
                                                                  version of ’sentence’ with all whitespace characters removed.
 conala-train   1666   re.sub(’[^A-Z]’, ”, s)                     remove uppercased characters in string ‘s‘                              13.92   14.68


2. Related Work
Code explanation [11], also termed as code summarization [3, 12] and comment generation [13,
2], is an important problem in the field of software engineering. Traditional approaches [14, 15,
16] as well as deep learning methods [13, 2] have been attempted for this task.
   Large Language Models have been successfully employed in a wide variety of natural
language generation tasks [17]. The zero shot and few shot capabilities of these systems make
them highly adaptable to any NLP task. There are several general domain, open source LLMs
like LLama-2 [8], Alpaca [18] and Falcon [19]. There are also Code LLMs which have been
trained or finetuned on code-specific data (usually source code files, covering 80+ programming
languages). The most popular LLMs for code are OpenAI CodeX and Co-pilot. Among the open
source models, we have StarCoder [9], CodeUp [5], CodeLlama [20] and Llama-2-Coder [21].
   Large Language Models have been used for Code explanation in a few shot setting [3, 22].
Ahmed et.al. [3] found that giving few shot examples from the same project gives better results
than from a different project. Geng et.al. [22] show that selecting relevant examples in a few
shot setting is an important design criteria.

3. Dataset
In this work, we consider a dataset of 100 samples released at the Information Retrieval in
Software Engineering (IRSE) track at Forum for Information Retrieval Evaluation (FIRE) 2023 [10].
Each sample in the dataset is a (𝑐𝑜𝑑𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡, 𝑐𝑜𝑑𝑒 𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛) pair. The explanation is a
natural language description that denotes what task the code snippet is performing. We refer
to this dataset as "IRSE" in the rest of the paper. Additionally, we use a publicly available
conala-train [23] dataset as a secondary data source for few-shot and instruction finetuning.
This dataset consists of 1666 unique samples of (𝑐𝑜𝑑𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡, 𝑐𝑜𝑑𝑒 𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛) pairs.
  Table 1 shows a few examples from both the datasets. It can be observed that while the code
snippets in are comparable in length (21 and 14 tokens respectively), the code explanations in
the IRSE dataset are lengthier (mean length = 84 words) than the ones in the conala-train set
(mean length = 15 words).

4. Evaluation
The model generated textual descriptions are evaluated with respect to the ground truth expla-
nations using the following measures:
(i) Token-based: BLEU [24] score combines precision scores of n-grams (typically up to 4-
grams) using weighted geometric mean, with higher weight given to shorter n-grams. BLEU-1,
Table 2
Zero-shot prompt templates used for Code Explanation. {𝑐𝑜𝑑𝑒} denotes the query code snippet whose
explanation needs to be generated.
  #                                           Prompt                                                   Models
      [INST] <>
      You are an expert in Programming. Below is a line of python code that describes a task.
      Return only one line of summary that appropriately describes the task that the code is      Llama-2-70B-Chat
 P1   performing. You must write only summary without any prefix or suffix explanations.        CodeLlama-13B-Instruct
      Note: The summary should have minimum 1 words and can have on an average 10 words.          CodeUp-13B-Chat
      <>
      {code} [/INST]
      #Human: You are a helpful code summarizer. Please describe in simple english the
                                                                                                  StarCoder (15.5B)
 P2   purpose of the following Python code snippet: {code}
                                                                                                  Llama-2-Coder-7B
      #Assistant:


BLEU-2, and BLEU-N (for any integer N) extend the evaluation to unigrams, bigrams, and
n-grams of varying lengths, respectively.
(ii) Semantics-based: We use this measure to assess the semantic similarity between the model
generated explanation (𝑚) and the ground truth explanation (𝑔). We project both 𝑚 and 𝑔 in
a continuous embedding space, − 𝑒→      →
                                        −
                                 𝑚 and 𝑒𝑔 respectively using the pretrained CodeBERT [25]
model. We then take a cosine similarity between the embeddings 𝑐𝑜𝑠𝑖𝑛𝑒(−      𝑒→   →
                                                                                  −
                                                                              𝑚 , 𝑒𝑔 ) to get the
score.

5. Methodology
We experiment with 5 LLMs (i) Generic LLM: Llama-2-70B-Chat model [8], which is the
largest, open source model available. (ii) Code LLM – Llama-2-Coder-7B [21], CodeLlama-
13B-Instruct [20], CodeUp-13B-Chat [5] and StarCoder [9] (15.5B) models, using the zero-shot,
few-shot and instruction fine-tuning strategies, described below:
(i) Zero-shot: In this setting, we directly prompt the LLM to generate output for a particular
input code snippet. We experiment with several prompts, some of which are listed in Table 2
as prompts P1 and P2. Based on the model cards, we provide the prompt template P1 to the
Llama-2-70B Chat, CodeLlama-13B-Instruct and CodeUp-13B-Chat models. The template P2 is
provided to StarCoder and Llama-2-Coder-7B models.
(ii) Few-shot: In few shot prompting, we provide a few examples that demonstrate the nature
of the task. For the task of code explanation [3] suggest using 10 examples in a few-shot setup.
Therefore, we provide 10 randomly selected (𝑐𝑜𝑑𝑒 𝑠𝑛𝑖𝑝𝑝𝑒𝑡, 𝑛𝑎𝑡𝑢𝑟𝑎𝑙 𝑙𝑎𝑛𝑔𝑢𝑎𝑔𝑒 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛)
pairs selected from the conala-train set (ref. Section 3).
(iii) Instruction Finetuning: For instruction finetuning of LLMs, we take CodeUp-13B-Chat
model [5]. We take each sample from conala-train dataset and generate instruction based
training instances using the following format:
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction : Below is a line of python code that describes a task. Write one line of summary
that appropriately describes the task that the code is performing.
### Input : 𝑠𝑜𝑟𝑡𝑒𝑑(𝑙, 𝑘𝑒𝑦 = 𝑙𝑎𝑚𝑏𝑑𝑎𝑥 : (−𝑖𝑛𝑡(𝑥[1]), 𝑥[0]))
### Output : Sort a nested list by two elements
Table 3
Performance evaluation of the different LLMs and three approaches for the code explanation task on
the IRSE dataset. We report the BLEU and CodeBERT based metrics.
                                                           Token-based         Semantics-based
       Approach                    LLM
                                                    BLEU1 BLEU2 BLEUN             CodeBERT
                             Llama2-70B-Chat         0.019    0.008    0.004        0.338
                           CodeLlama-13B-Instruct   0.189     0.073    0.036        0.498
        Zero Shot               CodeUp-13B           0.010    0.003    0.001        0.310
                              StarCoder-15.5B        0.069    0.024    0.005        0.336
                             Llama-2-Coder-7B       0.189     0.075    0.023        0.475
                             Llama2-70B-Chat         0.064    0.024    0.012        0.424
                           CodeLlama-13B-Instruct   0.164     0.073    0.044        0.483
        Few Shot                CodeUp-13B           0.061    0.023    0.011        0.416
                              StarCoder-15.5B        0.020    0.006    0.002        0.347
                             Llama-2-Coder-7B        0.023    0.008    0.003        0.342
  Instruction Finetuning
                                CodeUp-13B          0.047    0.011    0.005          0.429
        Zero Shot


  We load the CodeUp-13B-Chat model with 4-bit quantization using QLoRA [26] and bitsand-
bytes [27] methods. We then perform parameter-efficient finetuning (PEFT) [28] of the model
using the above prepared dataset.

6. Results
Table 3 shows the performance of the 5 different LLMs over three approaches – zero-shot,
few-shot and zero-shot over the Instruction finetuned model. CodeLlama-13B-Instruct and
Llama-2-Coder-7B have the best zero-shot performance over the other LLMs. Note that although
the generic Llama2 model is the largest in size (70B), it has poor performance when compared
to the smaller Code LLM models (13B, 7B). This shows that domain specific models perform
better than generic ones.
   While the few shot strategy is expected to give better performance than zero-shot, in this
study we find that the performance is worse. This is mainly because the few shot examples
had been selected from the conala-train set. As discussed in Section 3 and Table 1 the code
explanation lengths in the IRSE dataset and the conala-train dataset vary hugely. Since the
LLMs see few shot examples from the conala-train, it generates shorter length code explanations
for input samples coming from the IRSE dataset. This train-test distribution mismatch causes
the models to perform worse in the few shot scenario as compared to the zero-shot.
   Similar arguments can be drawn for the Instruction finetuning+Zero shot approach, as the
training data comes from the conala-train dataset which is different from the IRSE dataset.

7. Conclusion
In this work we explore the performance of 5 LLMs, both generic and code-specifc, for the task
of code explanation. We use zero-shot, few shot and instruction finetuning approaches over the
LLMs and assess their performance. We find that Code LLMs perform better than larger generic
LLMs. Also, zero-shot prompting works well in the scenario where we do not have enough
examples to prompt/finetune the model.
References
 [1] Y. Liang, K. Zhu, Automatic generation of text descriptive comments for code blocks, in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
 [2] R. Sharma, F. Chen, F. Fard, Lamner: code comment generation using character language
     model and named entity recognition, in: Proceedings of the 30th IEEE/ACM International
     Conference on Program Comprehension, 2022, pp. 48–59.
 [3] T. Ahmed, P. Devanbu, Few-shot training llms for project-specific code-summarization,
     in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software
     Engineering, 2022, pp. 1–5.
 [4] I. Ozkaya, Application of large language models to software engineering tasks: Opportu-
     nities, risks, and implications, IEEE Software 40 (2023) 4–8.
 [5] J. Jiang, S. Kim, Codeup: A multilingual code generation llama2 model with parameter-
     efficient instruction-tuning, https://huggingface.co/deepse, 2023.
 [6] M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, C.-W. Tan, Natural language generation and
     understanding of big code for AI-assisted programming: A review, Entropy 25 (2023) 888.
     URL: https://doi.org/10.3390%2Fe25060888. doi:10.3390/e25060888.
 [7] M. Schäfer, S. Nadi, A. Eghbali, F. Tip, An empirical evaluation of using large language
     models for automated unit test generation, 2023. arXiv:2302.06527.
 [8] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat mod-
     els, arXiv preprint arXiv:2307.09288 (2023). URL: https://huggingface.co/meta-llama/
     Llama-2-70b-chat-hf.
 [9] R. L. et.al., Starcoder: may the source be with you!, arXiv preprint arXiv:2305.06161 (2023).
     URL: https://huggingface.co/bigcode/starcoder.
[10] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
[11] S. MacNeil, A. Tran, A. Hellas, J. Kim, S. Sarsa, P. Denny, S. Bernstein, J. Leinonen, Ex-
     periences from using code explanations generated by large language models in a web
     software development e-book, in: Proceedings of the 54th ACM Technical Symposium on
     Computer Science Education V. 1, 2023, pp. 931–937.
[12] S. Iyer, I. Konstas, A. Cheung, L. Zettlemoyer, Summarizing source code using a neural
     attention model, in: 54th Annual Meeting of the Association for Computational Linguistics
     2016, Association for Computational Linguistics, 2016, pp. 2073–2083.
[13] X. Hu, G. Li, X. Xia, D. Lo, Z. Jin, Deep code comment generation, in: Proceedings of
     the 26th Conference on Program Comprehension, Association for Computing Machinery,
     2018, p. 200–210.
[14] S. Haiduc, J. Aponte, L. Moreno, A. Marcus, On the use of automated text summarization
     techniques for summarizing source code, in: 2010 17th Working conference on reverse
     engineering, IEEE, 2010, pp. 35–44.
[15] B. P. Eddy, J. A. Robinson, N. A. Kraft, J. C. Carver, Evaluating source code summarization
     techniques: Replication and expansion, in: 2013 21st International Conference on Program
     Comprehension (ICPC), IEEE, 2013, pp. 13–22.
[16] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, K. Vijay-Shanker, Automatic
     generation of natural language summaries for java classes, in: 2013 21st International
     conference on program comprehension (ICPC), IEEE, 2013, pp. 23–32.
[17] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, X. Hu, Harnessing the power of
     llms in practice: A survey on chatgpt and beyond, arXiv preprint arXiv:2304.13712 (2023).
[18] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto,
     Alpaca: A strong, replicable instruction-following model, Stanford Center for Research on
     Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3 (2023) 7.
[19] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier,
     E. Almazrouei, J. Launay, The RefinedWeb dataset for Falcon LLM: outperforming curated
     corpora with web data, and web data only, arXiv preprint arXiv:2306.01116 (2023).
[20] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
     et al., Code llama: Open foundation models for code, arXiv preprint arXiv:2308.12950
     (2023). URL: https://huggingface.co/codellama.
[21] Manuel Romero, llama-2-coder-7b (revision d30d193), 2023. URL: https://huggingface.co/
     mrm8488/llama-2-coder-7b. doi:10.57967/hf/0931.
[22] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, X. Liao, Large language models
     are few-shot summarizers: Multi-intent comment generation via in-context learning (2024).
[23] P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to mine aligned code and natural
     language pairs from stack overflow, in: International Conference on Mining Software
     Repositories, ACM, 2018, pp. 476–486. URL: https://conala-corpus.github.io/.
[24] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A method for automatic evaluation of
     machine translation, Association for Computational Linguistics, USA, 2002, p. 311–318.
[25] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al.,
     Codebert: A pre-trained model for programming and natural languages, in: Findings of
     the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.
[26] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of
     quantized llms, 2023. arXiv:2305.14314.
[27] T. Dettmers, M. Lewis, S. Shleifer, L. Zettlemoyer, 8-bit optimizers via block-wise quanti-
     zation, 9th International Conference on Learning Representations, ICLR (2022).
[28] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft: State-of-the-art
     parameter-efficient fine-tuning methods, https://github.com/huggingface/peft, 2022.