1. Introduction

International Workshop on Quantitative Approaches to Software Quality, December

Software Defect Prediction based on JavaBERT and CNN-BiLSTM

Kun Cheng

Shingo Takada

0 0 Grad. School of Science and Technology, Keio University Yokohama , Japan

2023

04 2023 51 59

Software defects can lead to severe issues in software systems, such as software errors, security vulnerabilities, and decreased software performance. Early prediction of software defects can prevent these problems, reduce development costs, and enhance system reliability. However, existing methods often focus on manually crafted code features and overlook the rich semantic and contextual information in program code. In this paper, we propose a novel approach that integrates JavaBERT-based embeddings with a CNN-BiLSTM model for software defect prediction. Our model considers code context and captures code patterns and dependencies throughout the code, thereby improving prediction performance. We incorporate Optuna to find optimal hyperparameters. We conducted experiments on the PROMISE dataset, which demonstrated that our approach outperforms baseline models, particularly in leveraging code semantics to enhance defect prediction performance.

eol>Software defect prediction JavaBERT CNN BiLSTM Optuna

1. Introduction 2. Related Work Optuna automatically executes the above combination of

JavaBERT and CNN-BiLSTM multiple times, and outputs Researchers have explored various models for feature ex- the best hyperparameter values through these executions. traction in software defect prediction, from traditional ma- Then we retrain the model in another version of the code chine learning to deep learning. Initially, Support Vector based on the obtained hyperparameters and test the model Machines (SVM), as employed by Elish et al.[ 3 ], gained performance. prominence for identifying defective modules using static code metrics. However, it struggled to uncover deep semantics within the source code. Deep Belief Networks 3.1. Embedding with JavaBERT (DBN), introduced by Wang et al.[ 4 ], aimed to extract BERT (Bidirectional Encoder Representations from more complex features from code through unsupervised Transformers)[ 11 ] is a language model widely employed learning. Yet, its limited depth posed challenges in reveal- in natural language processing (NLP) tasks. Unlike coning intricate relationships within the source code. Con- ventional embeddings, BERT excels at capturing intrivolutional Neural Networks (CNNs) were used by Li et cate contextual associations. Traditional methods like al.[ 5 ] to predict software defects by analyzing structural Word2Vec[ 12 ] and GloVe[ 13 ] generate static contextual correlations between code tokens. While proficient in representations, whereas BERT, utilizing multi-layer bidicapturing local patterns, CNNs faced challenges in captur- rectional transformers, enables tokens to gather informaing longer-range connections. Wang et al.[ 6 ] introduced tion from both preceding and succeeding tokens. an RNN (Recurrent Neural Network)-based model for In our approach, we leverage a pretrained BERT model, predicting software reliability. Deng et al.[ 7 ] and Liang JavaBERT[ 14 ], fine-tuned for Java code. JavaBERT has et al.[ 8 ] expanded Long Short-Term Memory (LSTM) been trained on a dataset of 2,998,345 Java files from models in software defect prediction, capturing temporal GitHub open source projects. JavaBERT’s transformer arpatterns in code sequences. However, a single LSTM can chitecture dynamically adapts token embeddings based on only capture one direction temporal pattern in the code the entire input sequence, enhancing representation depth sequence. Bidirectional LSTM (BiLSTM) models with and capturing code token interdependencies. The Javattention mechanisms emerged. Wang et al.[ 9 ] introduced aBERT embeddings, denoted as JavaBERT, are computed a gated hierarchical BiLSTM model. Uddin et al.[ 10 ] by applying the model’s encoder to tokenized Java code. combined BiLSTM with attention and BERT-based em- For a sequence of code tokens = {1, 2, . . . , }, beddings. JavaBERT embeddings are computed as:

In short, SVM has difficulty discovering the deep semantics of the source code, DBN has limited depth so it is difficult to understand the complex relationships in the JavaBERT = EncoderJavaBERT(1, 2, . . . , ) source code, CNN has difficulty capturing long-distance correlations, and RNN and LSTM can only capture a sin- Models typically cannot process code text sequences gle temporal pattern. BiLSTM may have challenges in directly. Through JavaBERT, we embed code text into a capturing local patterns. continuous vector space, using these vectors as inputs to

To solve these problems, we combine the advantages the model, making it easier for the model to compute and of CNN in detecting local patterns with the advantages understand the code. of BiLSTM in processing sequences, allowing for comprehensive code inspection. We further incorporate Jav- 3.2. Feature Extraction using aBERT to dynamically adjust token embeddings based CNN-BiLSTM on the entire input sequence, thereby deepening the representation and capturing interdependencies among code tokens.

3. Proposed Methodology We combine Convolutional Neural Networks (CNN) and

Bidirectional Long Short-Term Memory networks (BiLSTM) to extract features. This is the key part of our approach, where after extracting features with CNN, it is refined with the sequential capabilities of BiLSTM.

Our software defect prediction method consists of several

key steps, all aimed at improving prediction performance. 3.2.1. Feature Extraction with CNN As shown in Figure 1, we first use JavaBERT to convert Utilizing Convolutional Neural Networks (CNN)[ 15 ] for the code into vector representations. Next, we employ feature extraction involves sliding a small window, known the CNN-BiLSTM model for feature extraction, focusing as a filter, over various parts of the code. This filter examon local patterns and context. We also incorporate sta- ines a small segment of the code at a time, calculating a tistical features to fully utilize all available information. value at each sliding position to create a "feature map." ∑︁ ∑︁ [ + , + ] · [, ] +

)︃ where [, ] is the input at position (, ), [, ] represents the kernel at position (, ), is the bias, and signifies the activation function. 3.2.2. Refinement of Features with BiLSTM The

Bidirectional

Long

Short-Term

Memory

(BiLSTM)[ 16 ] layer enhances the features extracted by the Convolutional Neural Networks (CNN). What sets

BiLSTM apart is its capability to capture both short-term and long-term dependencies within the code, which perfectly complements the local feature extraction carried out by CNN. The forward and backward computations in BiLSTM can be unified into a single mathematical representation:

ℎ = BiLSTM(, ℎ−1 , ℎ+1)

In this equation, ℎ represents the hidden state at time

STM) model. It is computed based on the input at the current time step, the previous hidden state ℎ−1 , and terns and connections over time, amplifying the feature representation. In summary, we refine the feature maps obtained from CNN using BiLSTM to achieve a comprehensive code representation. This fusion of capturing local patterns and accounting for temporal dependencies improves software defect prediction performance.

3.3. Integration with Statistical Features

Our methodology integrates the refined BiLSTM outputs with statistical features (such as shown in Table 2) extracted from dataset. This step concatenates the vectors obtained from the BiLSTM and the vectors of statistical features obtained from the dataset into longer vectors, making full use of the description information of the code.

3.4. Hyperparameter Optimization by Optuna Optuna, a powerful hyperparameter optimization frame

work developed by Akiba et al.[ 17 ], plays a vital role in our approach by automating hyperparameter tuning for the CNN-BiLSTM model. There are similar frameworks such as Ray Tune, etc., but Optuna is more lightweight and mator (TPE) algorithm to efficiently explore and exploit the hyperparameter space, enhancing the performance of our Software Defect Prediction task. step in the Bidirectional Long Short-Term Memory (BiL- easier to use. It employs the Tree-structured Parzen Esti

4.2. Dataset and Data Preprocessing

In this section, we will discuss a crucial step in our methodology: determining optimal hyperparameters by leveraging shared features among different versions of the same project. Usually, code with similar version numbers exhibits a high degree of similarity. By harnessing these inherent similarities, we attempt to find hyperparameters that can generalize across various versions, ultimately enhancing model performance.

Using the Ant project as an example, our aim is to demonstrate the transferability of hyperparameters obtained from training on one version (e.g., 1.5) to another (e.g., 1.6). This transferability is valid as both versions originate from the same project, sharing similar code structures and functionalities. This enables the hyperparameters obtained from one version to serve as a foundation for other versions within the same project, thereby solidifying our model configuration.

We start by selecting version pairs, using the Ant project as an illustration. Here, we designate version 1.5 for training and version 1.6 for testing. Next, we deifne the performance metric to optimize, such as the F1 score. Subsequently, Optuna conducts multiple experiments, traversing various hyperparameter combinations and evaluating their performance on the designated testing dataset. Through these iterative experimentation and evaluation stages, Optuna determines the hyperparameter set that maximizes the chosen performance metric.

This process can be represented as:

Our study uses the PROMISE[18] dataset, exclusively

comprised of Java projects. This dataset spans various domains and project scales, providing project details like name, description, version, and bug rate. Table 1 shows an overview of the projects we use that are in the PROMISE Java Dataset. Since Optuna’s process of finding hyperparameters takes a lot of time, we only selected a part of the projects in the PROMISE data set. Statistical features also = Optuna (Ant 1.5, Ant 1.6) play a vital role in code analysis, offering insights into Here, (Ant 1.5, Ant 1.6) embodies the objective func- code composition and behavior. To enhance our study, we tion maximized during the hyperparameter optimization carefully selected a subset of these features, as shown in process, with Ant 1.5 as the training dataset and Ant 1.6 Table 2. as the testing dataset. After obtaining optimal hyperpa- To prepare the data for analysis, we conducted thorrameters through the Optuna process, we seamlessly ough data preprocessing. Using the "javalang"[ 19 ] Python transfer them across different project versions. is library, we removed redundant code elements such as applied to reconfigure the training and testing sets. For comments, white spaces, and unnecessary details. This instance, in the Ant project, is then used on different process allowed us to extract essential token sequences, version pairs, such as training on Ant 1.6 with and capturing the code’s semantics. To address class imbaltesting on Ant 1.7. ance in software defect prediction, we implemented ran

This operation optimizes hyperparameters across ver- dom oversampling exclusively on the "Bug" class files. sion pairs, contributing to enhanced model adaptability This deliberate strategy generated synthetic data instances, and performance in varying project iterations. improving class distribution and mitigating potential bias towards the majority class.

4. Experimental Setup 4.1. Research Questions Our experiment addresses the following research questions (RQ) : RQ1: How does the performance of our CNN-BiLSTM model compare against baseline models? 4.3. Experimental Settings

For each project listed in Table 1, we selected the smallest two version numbers to serve as versions Y and Y+1 for Optuna’s hyperparameter optimization process. The search space for the hyperparameters was specified as shown in Table 3. The number of trials for each project was set to 30. After completing these experiments, each project will produce a different set of hyperparameters that allow the model to output the highest F1 score, and a model trained on these parameters using version Y. These hyperparameters were then applied to train new models on version Y+1 for each project. Then the model trained on version Y and the model trained on version Y+1 were evaluated against the code of version Y+2. We conducted each evaluation test three times and calculated the mean to obtain the experimental result.

5. Results and Discussion In this section, we present the results of our study and discuss their implications, addressing the research questions (RQ) that guide our investigation. 5.1. Impact of JavaBERT-based Embeddings with CNN-BiLSTM Model To address RQ1, we assessed the performance of our

model in comparison to baseline models. Table 4 presents a detailed performance comparison between our CNNBiLSTM model and the baseline models concerning precision, recall, and F1-score. For instance, "ant_1.5_1.6" represents the experimental results obtained by using version 1.5 of Ant as the training dataset and version 1.6 as the test dataset. The results demonstrate a consistent outperformance of our model across all metrics. Figure 2 complements the table by providing a visual representation of the F1 scores, where the x-axis represents pairs of software versions used for training and testing (e.g., ant_1.5_1.6), and the y-axis represents the corresponding F1 values obtained during testing. This figure shows that the F1 of our model is higher than the base model most of the time.

5.2. Model Performance Variability Across PROMISE Projects and Versions 4.4. Baseline Models We compare our proposed approach against the following

baseline models: To address RQ2, Figure 3 presents the F1 scores of our model across different projects and their respective versions in the PROMISE dataset. In this figure, the x-axis • Support Vector Machine (SVM): SVM, a classic represents pairs of software versions used for training and and widely adopted machine learning algorithm, testing (e.g., ant_1.5_1.6), while the y-axis represents the excels in both linear and non-linear classification corresponding F1 values obtained during testing. When tasks and is known for its effectiveness in handling we examined the model’s performance across different high-dimensional data. projects and its various versions, we observed certain noteworthy patterns. Specifically, within the same project, suggests that the predictive performance of our model tends to vary when applied to certain projects. Although we cannot pinpoint the exact reasons behind these changes at this time, we speculate that they may have been influenced by a variety of factors, including project-specific characteristics, code complexity, and domain-related differences. such as Lucene, POI, and Xalan, our models show a high degree of performance consistency across different versions. This shows that our model is able to predict results consistently when dealing with different versions of certain projects. This consistency can be partially attributed to the higher code similarity found between ver- Figure 3: F1 Score Across PROMISE Projects sions within the same project, making it easier for models to capture shared features and patterns.

There are some differences between versions of Ant and Synapse, these differences are relatively minor. In contrast, projects such as Camel and JEdit show more performance fluctuations, even within the same project. This

5.3. The impact of hyperparameters on the performance of CNN-BiLSTM model

To address RQ3, in this section, we study the impact of hyperparameters on the performance of the CNN-BiLSTM model for code defect prediction. Initially, we set the hyperparameters to the following values: the number of epochs is 10, the batch size is 64, the learning rate is 1e-4, the number of CNN filters is 128, the number of BiLSTM hidden units is 256, and the CNN filter size is 5 . After that, we fixed other hyperparameters, and then gradually manually adjust one of the other parameters, the CNN iflter or the number of BiLSTM hidden units, to observe changes in model performance.

Figure 4 and Figure 5 show our experimental results, the x-axis is the change in the number of CNN filters Figure 4: Effect of CNN Filter Length on F1 Score and BiLSTM hidden units, and the y-axis shows the F1 score. We can see that the model performance fluctuates greatly when a single parameter changes. For example, the smaller the number of CNN filters, the better the performance of the model. In Figure 5, the F1 score drops after BiLSTM hidden unit is 16, but performs better and tends to be stable after 256. Exploring the impact of each hyperparameter individually would be a time-consuming task, and it is difficult to predict how the model will behave when these hyperparameters are combined. So we used Optuna, which will constantly try to search for hyperparameters that can make the model perform better based on the search algorithm.

Figures 6 and 7 show the F1 score (y-axis) for a certain number of trials (x-axis). Specifically, Figure 6 is a scatter plot, representing the F1 score that was obtained in each Figure 5: Effect of BiLSTM Hidden Units on F1 Score trial, e.g., when the trial number is 5, the F1 score is the value for the fifth trial. Figure 7 represents the best model performance that can be achieved based on the search until the current trial model is executed. So, in figure 5.4. Threats to Validity 7, when the trial number is 5, the F1 score is the best In our research, we have identified and addressed several F1 score from the first to the fifth trial. We can observe potential threats to the validity of our findings. that through continuous repetition and search, Optuna can The implementation of our Python experimental code gradually search for better results. The entire process is for processing source code text and building models poses automated, which greatly simplifies our hyperparameter a potential threat due to the possibility of bugs. To mitigate tuning process. this, we took measures by leveraging mature third-party libraries (such as javalang and PyTorch) and conducting thorough code inspections. Additionally, we applied random oversampling during data preprocessing, which could introduce bias. Future work will explore alternative methods to handle class imbalance and assess their impact on results. Moreover, the use of Optuna for hyperparameter optimization introduces potential variability in results due to different search spaces and numbers of trials. To reduce these threats, we plan to conduct more extensive searches and explore larger search spaces.

Our choice of a subset of projects from the PROMISE dataset due to time constraints may impact the generalizability of our findings, as the results may not generalize well to other projects. To address this, we intend to include a broader range of projects in future research.

We evaluated our models using a limited set of performance metrics, specifically precision, recall, and F1 measure. To reduce these threats, we will consider incorporating additional metrics such as AUC-ROC and MCC, among others, to provide a more comprehensive assessment of model performance.

6. Conclusion and Future Work In this research, we have introduced a novel approach

that leverages JavaBERT-based embeddings with a CNNBiLSTM model for software defect prediction. Our approach harnesses semantic and contextual information in program code to enhance prediction accuracy. Through comprehensive experiments on the PROMISE dataset, we have demonstrated the superiority of our model over baseline models based on precision, recall, and F1-score metrics.

Although our study improves the performance of software defect prediction compared to baseline models, we still have many future works to do. In addition to what we discussed in the "threats to validity" session, we can also train the BERT model in different languages to adapt our methods to different programming languages.

[1]

Omri ,

Sinz , Deep learning for software defect prediction: A survey , in: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops , 2020 , pp. 209 - 214 .

[2]

Meng ,

Huang ,

Wang , A survey of software defects research based on deep learning , in: 2023 6th International Conference on Information Systems and Computer Networks (ISCON) , IEEE, 2023 , pp. 1 - 5 .

[3]

K. O.

Elish ,

M. O.

Elish , Predicting defect-prone software modules using support vector machines , Journal of Systems and Software 81 ( 2008 ) 649 - 660 .

[4]

Wang ,

Liu ,

Tan , Automatically learning semantic features for defect prediction , in: Proceedings of the 38th International Conference on Software Engineering , 2016 , pp. 297 - 308 .

[5]

Li ,

He ,

Zhu ,

M. R.

Lyu , Software defect prediction via convolutional neural network, in: 2017 IEEE international conference on software quality, reliability and security (QRS) , IEEE, 2017 , pp. 318 - 328 .

[6]

Wang , C. Zhang, Software reliability prediction using a deep learning model based on the RNN encoder-decoder , Reliability Engineering & System Safety 170 ( 2018 ) 73 - 82 .

[7]

Deng ,

Lu ,

Qiu , Software defect prediction via LSTM , IET software 14 ( 2020 ) 443 - 450 .

[8]

Liang ,

Yu ,

Jiang ,

Xie , Seml: A semantic LSTM model for software defect prediction , IEEE Access 7 ( 2019 ) 83812 - 83824 .

[9]

Wang ,

Zhuang , X. Zhang, Software defect prediction based on gated hierarchical LSTMs , IEEE Transactions on Reliability 70 ( 2021 ) 711 - 727 .

[10]

M. N.

Uddin ,

Li ,

Ali ,

Kefalas , I. Khan , I. Zada , Software defect prediction employing BiLSTM and BERT-based semantic feature , Soft Computing 26 ( 2022 ) 7877 - 7891 .

[11]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[12]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , Advances in neural information processing systems 26 ( 2013 ).

[13]

Pennington ,

Socher ,

C. D.

Manning , Glove: Global vectors for word representation , in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , 2014 , pp. 1532 - 1543 .

[14] N. T. De Sousa , W. Hasselbring, JavaBERT: Training a transformer-based model for the Java programming language , in: 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) , IEEE, 2021 , pp. 90 - 95 .

[15]

Fukushima , Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , Biological cybernetics 36 ( 1980 ) 193 - 202 .

[16]

Schuster ,

K. K.

Paliwal , Bidirectional recurrent neural networks , IEEE transactions on Signal Processing 45 ( 1997 ) 2673 - 2681 .

[17]

Akiba ,

Sano ,

Yanase ,

Ohta ,

Koyama , Optuna: A next-generation hyperparameter optimization framework , in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019 , pp. 2623 - 2631 .

[18]

J. Sayyad

Shirabad ,

Menzies , The PROMISE Repository of Software Engineering Databases., School of Information Technology and Engineering , University of Ottawa, Canada, 2005 . URL: http://promise.site.uottawa.ca/SERepository.

[19]

Thunes , javalang: pure Python Java parser and tools , 2020 .