-

Retrieval in Software Engineering utilizing a pre-trained BERT

Koyel Ghosh

0 1 2

Apurbalal Senapati

0 1 2

Kokrajhar

1 2

Assam

1 2

India

1 2

Information Retrieval in Software Engineering

1 2

Binary classification

1 2 0 Central Institute of Technology 1 Forum for Information Retrieval Evaluation 2 Short-Term Memory)[4], BiLSTM (Bidirectional Long Short-Term Memory) , etc. Here, we use

The task is to detect whether a source code comment is useful or not for a given comment, and the surrounding code is paired together as input. IRSE (Information Retrieval in Software Engineering) shared task organized by FIRE 2022 (Forum for Information Retrieval Evaluation), gives a binary classification task where a system classifies Comments and Surrounding Code Context pairs into two classes: (a) USEFUL or (b) NOT_USEFUL. To do the task, we experimented with the roberta-base model, and the result was 0.9047 in F1 Marco. Our submission gets the second position out of all submissions.

model ⋆

1. Introduction 2. Related work

https://github.com/BrainLearns (K. Ghosh) classify comments as useful, partially useful, and not useful. Their result was precision and recall scores of 86.27% and 86.42%, respectively. As per [7], annotating programs with natural language comments is a standard programming practice to increase the readability of code. They manually annotate concepts for 5600 comments extracted from 672 C/C++ files/projects crawled from code repositories like GitHub. Comment-Mine extracts 38,992 concepts, out of which 79.8% is correct and validated using manual annotation.

3. Experimental Setup 3.1. Dataset

IRSE, a shared task organized by FIRE (Forum for Information Retrieval Evaluation), published the dataset containing 8047 Comments and Surrounding Code Context pairs training set along with Class, i.e. useful or not_useful. A total of 1001 Comments and Surrounding Code Context pairs are given on the test set. Table 1 shows the details dataset statistics.

Label encoding: Here, we just convert N OT_USEFUL to “0” and U SEFUL to “1” for the Class column.

IRSE NOT USEFUL

Training set 3710 Test set 719

USEFUL

4337 282 Total 8047 1001

3.2. Pretrained BERT models

BERT models are trained on a large raw text (without human labeling) corpus in a self-supervised way. Figure 1 shows the representation of the approach. We did several experiments and found the below-mentioned (Table 3) best hyperparameter combination. value of the same is NOT_USEFUL), _ = Precision of NOT_USEFUL class, = Precision of USEFUL class, _ = Recall of NOT_USEFUL class, 1 _ = F1 score of NOT_USEFUL class, 1 = F1 score of USEFUL class, _ = The total number of NOT_USEFUL class text present in the test set, = The total number of USEFUL class text present in the test set.

We execute our code up to 10 epochs and take the best result out of all the epochs. Here, we notice overfitting while fine-tuning pre-trained BERT models. After epoch 4, validation loss increases, and training loss decreases. We didn’t try dropout layer here.

Model on

IRSE roberta-base

5. Conclusion

In this paper, our task is to classify a comment and Surrounding Code Context pair to USEFUL or NOT_USEFUL. We used a pre-trained BERT model. During the method, we realize that the maximum length of a comment for the entire set is six, and for Surrounding Code Context, it’s 821. As BERT’s maximum input length capacity is 512, we can experiment with longformer2, but it needs a good configuration machine otherwise may face memory issues. Later, dual BERT3 can be used in place of a single BERT. [1] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. D Clough, S. Chattopadhyay, P. Majumder, Overview of the IRSE subtrack at FIRE 2022: Information Retreival in Software Engineering, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, ACM, 2022. [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (2013). [3] A. Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network, CoRR abs/1808.03314 (2018). URL: http://arxiv.org/abs/1808.03314. a r X i v : 1 8 0 8 . 0 3 3 1 4 . [4] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997) 1735–1780. [5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. a r X i v : 1 9 0 7 . 1 1 6 9 2 . 2https://huggingface.co/docs/transformers/model_doc/longformer 3https://towardsdatascience.com/siamese-and-dual-bert-for-multi-text-classification-c6552d435533 [6] S. Majumdar, A. Bansal, P. Das, P. Clough, K. Datta, S. Ghosh, Automated evaluation of comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022). doi:1 0 . 1 0 0 2 / s m r . 2 4 6 3 . [7] S. Majumdar, S. Papdeja, P. Das, S. Ghosh, Comment-Mine—A Semantic Search Approach to Program Comprehension from Code Comments, 2020, pp. 29–42. doi:1 0 . 1 0 0 7 / 9 7 8 - 9 8 1 - 1 5 - 2 9 3 0 - 6 _ 3 .