Software Defect Prediction based on JavaBERT and CNN-BiLSTM Kun Cheng1 , Shingo Takada2 1 Grad. School of Science and Technology, Keio University Yokohama, Japan 2 Grad. School of Science and Technology, Keio University Yokohama, Japan Abstract Software defects can lead to severe issues in software systems, such as software errors, security vulnerabilities, and decreased software performance. Early prediction of software defects can prevent these problems, reduce development costs, and enhance system reliability. However, existing methods often focus on manually crafted code features and overlook the rich semantic and contextual information in program code. In this paper, we propose a novel approach that integrates JavaBERT-based embeddings with a CNN-BiLSTM model for software defect prediction. Our model considers code context and captures code patterns and dependencies throughout the code, thereby improving prediction performance. We incorporate Optuna to find optimal hyperparameters. We conducted experiments on the PROMISE dataset, which demonstrated that our approach outperforms baseline models, particularly in leveraging code semantics to enhance defect prediction performance. Keywords Software defect prediction, JavaBERT, CNN, BiLSTM, Optuna, 1. Introduction in combination with various classification methods, en- compassing both traditional algorithms and deep learning Software defects present significant challenges to the relia- techniques. bility and performance of software systems, often leading SDP encompasses two primary domains: Cross-Project to critical issues such as slow software operation, frequent Defect Prediction (CPDP) and Within-Project Defect Pre- security vulnerabilities, and software crashes. To address diction (WPDP). CPDP involves training a model on one these challenges, researchers have turned their attention project and applying it to another, addressing the chal- to software defect prediction (SDP), a key research area lenge of generalization across different software environ- aimed at identifying potentially problematic code early in ments. In contrast, WPDP focuses on building models the development process. within the same project, enhancing defect prediction per- Software Defect Prediction (SDP) is a structured pro- formance by considering unique project characteristics cess involving data preprocessing, feature extraction, and evolution patterns. For the purpose of this study, our model building, and evaluation[1]. Feature extraction primary focus lies on WPDP, aiming to improve defect plays a pivotal role in SDP as it determines the model’s prediction performance within a single project. data representation. SDP methods have traditionally relied In this paper, we introduce an innovative approach to on manual feature engineering, a process involving time- SDP that combines Java Bidirectional Encoder Represen- consuming and laborious manual design. However, this tations of Transformers (JavaBERT) and Convolutional approach faces challenges in capturing complex semantics Neural Networks with Bidirectional Long Short-Term and contextual information embedded in software code Memory (CNN-BiLSTM). By harnessing JavaBERT’s as systems become more complex. As a result, there’s a contextual understanding of text data and CNN-BiLSTM’s growing demand for advanced techniques that can effec- capacity to capture structural features, we improve defect tively exploit the intrinsic semantic and structural meaning prediction performance. Furthermore, we optimize the of code, along with its statistical properties. model’s hyperparameters by introducing Optuna, further Recent advances in SDP have shifted towards leverag- refining our predictive model. ing structural and semantic features directly from source The remainder of this paper is organized as follows: code or through parsing into an abstract syntax tree Section 2 discusses related work. Section 3 presents the (AST)[2]. These modern methods employ these features design of our proposed approach. Section 4 covers the implementation details based on the design, and Section QuASoQ 2023: 11th International Workshop on Quantitative 5 offers the evaluation results along with a discussion of Approaches to Software Quality, December 04, 2023, Seoul, South Korea potential threats to validity. Finally, Section 6 concludes $ chengkun@keio.jp (K. Cheng); michigan@ics.keio.ac.jp the paper and discusses future work. (S. Takada) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 51 2. Related Work Optuna automatically executes the above combination of JavaBERT and CNN-BiLSTM multiple times, and outputs Researchers have explored various models for feature ex- the best hyperparameter values through these executions. traction in software defect prediction, from traditional ma- Then we retrain the model in another version of the code chine learning to deep learning. Initially, Support Vector based on the obtained hyperparameters and test the model Machines (SVM), as employed by Elish et al.[3], gained performance. prominence for identifying defective modules using static code metrics. However, it struggled to uncover deep se- mantics within the source code. Deep Belief Networks 3.1. Embedding with JavaBERT (DBN), introduced by Wang et al.[4], aimed to extract BERT (Bidirectional Encoder Representations from more complex features from code through unsupervised Transformers)[11] is a language model widely employed learning. Yet, its limited depth posed challenges in reveal- in natural language processing (NLP) tasks. Unlike con- ing intricate relationships within the source code. Con- ventional embeddings, BERT excels at capturing intri- volutional Neural Networks (CNNs) were used by Li et cate contextual associations. Traditional methods like al.[5] to predict software defects by analyzing structural Word2Vec[12] and GloVe[13] generate static contextual correlations between code tokens. While proficient in representations, whereas BERT, utilizing multi-layer bidi- capturing local patterns, CNNs faced challenges in captur- rectional transformers, enables tokens to gather informa- ing longer-range connections. Wang et al.[6] introduced tion from both preceding and succeeding tokens. an RNN (Recurrent Neural Network)-based model for In our approach, we leverage a pretrained BERT model, predicting software reliability. Deng et al.[7] and Liang JavaBERT[14], fine-tuned for Java code. JavaBERT has et al.[8] expanded Long Short-Term Memory (LSTM) been trained on a dataset of 2,998,345 Java files from models in software defect prediction, capturing temporal GitHub open source projects. JavaBERT’s transformer ar- patterns in code sequences. However, a single LSTM can chitecture dynamically adapts token embeddings based on only capture one direction temporal pattern in the code the entire input sequence, enhancing representation depth sequence. Bidirectional LSTM (BiLSTM) models with and capturing code token interdependencies. The Jav- attention mechanisms emerged. Wang et al.[9] introduced aBERT embeddings, denoted as 𝐸JavaBERT , are computed a gated hierarchical BiLSTM model. Uddin et al.[10] by applying the model’s encoder to tokenized Java code. combined BiLSTM with attention and BERT-based em- For a sequence of code tokens 𝐶 = {𝑐1 , 𝑐2 , . . . , 𝑐𝑛 }, beddings. JavaBERT embeddings are computed as: In short, SVM has difficulty discovering the deep se- mantics of the source code, DBN has limited depth so it is difficult to understand the complex relationships in the 𝐸JavaBERT = EncoderJavaBERT (𝑐1 , 𝑐2 , . . . , 𝑐𝑛 ) source code, CNN has difficulty capturing long-distance correlations, and RNN and LSTM can only capture a sin- Models typically cannot process code text sequences gle temporal pattern. BiLSTM may have challenges in directly. Through JavaBERT, we embed code text into a capturing local patterns. continuous vector space, using these vectors as inputs to To solve these problems, we combine the advantages the model, making it easier for the model to compute and of CNN in detecting local patterns with the advantages understand the code. of BiLSTM in processing sequences, allowing for com- prehensive code inspection. We further incorporate Jav- 3.2. Feature Extraction using aBERT to dynamically adjust token embeddings based CNN-BiLSTM on the entire input sequence, thereby deepening the rep- resentation and capturing interdependencies among code We combine Convolutional Neural Networks (CNN) and tokens. Bidirectional Long Short-Term Memory networks (BiL- STM) to extract features. This is the key part of our approach, where after extracting features with CNN, it is 3. Proposed Methodology refined with the sequential capabilities of BiLSTM. Our software defect prediction method consists of several 3.2.1. Feature Extraction with CNN key steps, all aimed at improving prediction performance. As shown in Figure 1, we first use JavaBERT to convert Utilizing Convolutional Neural Networks (CNN)[15] for the code into vector representations. Next, we employ feature extraction involves sliding a small window, known the CNN-BiLSTM model for feature extraction, focusing as a filter, over various parts of the code. This filter exam- on local patterns and context. We also incorporate sta- ines a small segment of the code at a time, calculating a tistical features to fully utilize all available information. value at each sliding position to create a "feature map." 52 Figure 1: Overview of Methodology The positions in the code correspond to positions in the the next time step 𝑡 + 1’s hidden state ℎ𝑡+1 . The BiL- feature map. The observed code segment within the fil- STM model effectively captures sequential patterns and ter’s scope is termed the "input sequence slice." As the dependencies in data by considering information from filter traverses the entire code, it analyzes these input se- both directions. It analyzes the sequence of tokens, cap- quence slices, effectively capturing distinct features that turing dependencies extending both backward and for- characterize the code’s structural and syntactical elements. ward within the code. This dynamic construction of code The process of feature extraction using CNN is mathe- features considers token order, revealing evolving pat- matically expressed as: terns and connections over time, amplifying the feature representation. In summary, we refine the feature maps (︃ )︃ obtained from CNN using BiLSTM to achieve a com- 𝑦[𝑖, 𝑗] = 𝜎 ∑︁ ∑︁ 𝑥[𝑖 + 𝑚, 𝑗 + 𝑛] · 𝑤[𝑚, 𝑛] + 𝑏 prehensive code representation. This fusion of capturing 𝑚 𝑛 local patterns and accounting for temporal dependencies improves software defect prediction performance. where 𝑥[𝑖, 𝑗] is the input at position (𝑖, 𝑗), 𝑤[𝑚, 𝑛] rep- resents the kernel at position (𝑚, 𝑛), 𝑏 is the bias, and 𝜎 3.3. Integration with Statistical Features signifies the activation function. Our methodology integrates the refined BiLSTM outputs 3.2.2. Refinement of Features with BiLSTM with statistical features (such as shown in Table 2) ex- tracted from dataset. This step concatenates the vectors The Bidirectional Long Short-Term Memory obtained from the BiLSTM and the vectors of statisti- (BiLSTM)[16] layer enhances the features extracted by cal features obtained from the dataset into longer vectors, the Convolutional Neural Networks (CNN). What sets making full use of the description information of the code. BiLSTM apart is its capability to capture both short-term and long-term dependencies within the code, which perfectly complements the local feature extraction carried 3.4. Hyperparameter Optimization by out by CNN. Optuna The forward and backward computations in BiLSTM Optuna, a powerful hyperparameter optimization frame- can be unified into a single mathematical representation: work developed by Akiba et al.[17], plays a vital role in our approach by automating hyperparameter tuning for the ℎ𝑡 = BiLSTM(𝑥𝑡 , ℎ𝑡−1 , ℎ𝑡+1 ) CNN-BiLSTM model. There are similar frameworks such In this equation, ℎ𝑡 represents the hidden state at time as Ray Tune, etc., but Optuna is more lightweight and step 𝑡 in the Bidirectional Long Short-Term Memory (BiL- easier to use. It employs the Tree-structured Parzen Esti- STM) model. It is computed based on the input 𝑥𝑡 at the mator (TPE) algorithm to efficiently explore and exploit current time step, the previous hidden state ℎ𝑡−1 , and the hyperparameter space, enhancing the performance of our Software Defect Prediction task. 53 In this section, we will discuss a crucial step in our Table 1 methodology: determining optimal hyperparameters by Selected Projects in the PROMISE Java Dataset leveraging shared features among different versions of the Project Versions (Buggy Rate) same project. Usually, code with similar version numbers exhibits a high degree of similarity. By harnessing these Ant 1.5, 1.6, 1.7 (0.109, 0.263, 0.224) inherent similarities, we attempt to find hyperparameters Camel 1.2, 1.4, 1.6 (0.36, 0.171, 0.201) that can generalize across various versions, ultimately JEdit 3.2, 4.0, 4.1 (0.346, 0.256, 0.263) enhancing model performance. Lucene 2.0, 2.2, 2.4 (0.489, 0.611, 0.615) Poi 2.0, 2.5, 3.0 (0.120, 0.654, 0.641) Using the Ant project as an example, our aim is to Synapse 1.0, 1.1, 1.2 (0.102, 0.270, 0.336) demonstrate the transferability of hyperparameters ob- Xalan 2.4, 2.5, 2.6 (0.163, 0.509, 0.468) tained from training on one version (e.g., 1.5) to another (e.g., 1.6). This transferability is valid as both versions originate from the same project, sharing similar code struc- RQ2: How does the performance of the proposed model tures and functionalities. This enables the hyperparame- vary across different software projects and within the dif- ters obtained from one version to serve as a foundation for ferent versions of each project in the PROMISE dataset? other versions within the same project, thereby solidifying RQ3: How do different hyperparameter settings impact our model configuration. the performance of the combined CNN-BiLSTM model We start by selecting version pairs, using the Ant in code defect prediction? project as an illustration. Here, we designate version 1.5 for training and version 1.6 for testing. Next, we de- fine the performance metric to optimize, such as the F1 4.2. Dataset and Data Preprocessing score. Subsequently, Optuna conducts multiple experi- Our study uses the PROMISE[18] dataset, exclusively ments, traversing various hyperparameter combinations comprised of Java projects. This dataset spans various and evaluating their performance on the designated test- domains and project scales, providing project details like ing dataset. Through these iterative experimentation and name, description, version, and bug rate. Table 1 shows an evaluation stages, Optuna determines the hyperparameter overview of the projects we use that are in the PROMISE set that maximizes the chosen performance metric. Java Dataset. Since Optuna’s process of finding hyperpa- This process can be represented as: rameters takes a lot of time, we only selected a part of the projects in the PROMISE data set. Statistical features also 𝐻𝑥 = Optuna 𝑓 (Ant 1.5, Ant 1.6) play a vital role in code analysis, offering insights into Here, 𝑓 (Ant 1.5, Ant 1.6) embodies the objective func- code composition and behavior. To enhance our study, we tion maximized during the hyperparameter optimization carefully selected a subset of these features, as shown in process, with Ant 1.5 as the training dataset and Ant 1.6 Table 2. as the testing dataset. After obtaining optimal hyperpa- To prepare the data for analysis, we conducted thor- rameters 𝐻𝑥 through the Optuna process, we seamlessly ough data preprocessing. Using the "javalang"[19] Python transfer them across different project versions. 𝐻𝑥 is library, we removed redundant code elements such as applied to reconfigure the training and testing sets. For comments, white spaces, and unnecessary details. This instance, in the Ant project, 𝐻𝑥 is then used on different process allowed us to extract essential token sequences, version pairs, such as training on Ant 1.6 with 𝐻𝑥 and capturing the code’s semantics. To address class imbal- testing on Ant 1.7. ance in software defect prediction, we implemented ran- This operation optimizes hyperparameters across ver- dom oversampling exclusively on the "Bug" class files. sion pairs, contributing to enhanced model adaptability This deliberate strategy generated synthetic data instances, and performance in varying project iterations. improving class distribution and mitigating potential bias towards the majority class. 4. Experimental Setup 4.3. Experimental Settings 4.1. Research Questions For each project listed in Table 1, we selected the smallest Our experiment addresses the following research ques- two version numbers to serve as versions Y and Y+1 tions (RQ) : for Optuna’s hyperparameter optimization process. The RQ1: How does the performance of our CNN-BiLSTM search space for the hyperparameters was specified as model compare against baseline models? shown in Table 3. The number of trials for each project was set to 30. After completing these experiments, each project will produce a different set of hyperparameters 54 Table 2 • Convolutional Neural Network (CNN): CNNs ex- Selected Statistical Features cel at extracting hierarchical features from struc- Measure of Functional Abstraction (MFA) tured data, making them suitable for capturing Coupling Between Methods (CBM) local patterns in software defect prediction. Data Access Metric (DAM) • Bidirectional Long Short-Term Memory (BiL- Coupling Between Object classCA (CBO) STM): BiLSTM enhances LSTM by considering Lines Of Code (LOC) bidirectional information flow, enabling it to cap- Afferent Couplings (CA) ture both past and future contexts. Number Of Children (NOC) Lack of COhesion in Methods (LCOM) In assessing the predictive performance, this paper uti- Average Method Complexity (AMC) lizes three widely accepted metrics: precision, recall, and Inheritance Coupling (IC) the F1-score. Response For a Class (RFC) Efferent Couplings (CE) Measure Of Aggregation (MOA) 5. Results and Discussion Weighted Methods per Class (WMC) Depth of Inheritance Tree (DIT) In this section, we present the results of our study and dis- Lack of COhesion in Methods (LCOM3) cuss their implications, addressing the research questions Cohesion Among Methods of class (CAM) Number of Public Methods (NPM) (RQ) that guide our investigation. 5.1. Impact of JavaBERT-based that allow the model to output the highest F1 score, and a Embeddings with CNN-BiLSTM model trained on these parameters using version Y. These Model hyperparameters were then applied to train new models on version Y+1 for each project. Then the model trained To address RQ1, we assessed the performance of our on version Y and the model trained on version Y+1 were model in comparison to baseline models. Table 4 presents evaluated against the code of version Y+2. We conducted a detailed performance comparison between our CNN- each evaluation test three times and calculated the mean BiLSTM model and the baseline models concerning pre- to obtain the experimental result. cision, recall, and F1-score. For instance, "ant_1.5_1.6" represents the experimental results obtained by using ver- Table 3 sion 1.5 of Ant as the training dataset and version 1.6 Search Space for Hyperparameters as the test dataset. The results demonstrate a consistent outperformance of our model across all metrics. Figure Hyperparameter Search Range 2 complements the table by providing a visual represen- Number of Epochs 3 to 10 tation of the F1 scores, where the x-axis represents pairs Batch Size [16, 32, 64, 128] of software versions used for training and testing (e.g., Learning Rate 1 × 10−5 to 1 × 10−2 (Log- ant_1.5_1.6), and the y-axis represents the corresponding uniform) F1 values obtained during testing. This figure shows that Filter Sizes [3, 5, 7, 9, 11] the F1 of our model is higher than the base model most of Number of Filters 32 to 512 the time. Hidden Units [16, 32, 64, 128, 256, 512, 1024] 5.2. Model Performance Variability Across PROMISE Projects and Versions 4.4. Baseline Models To address RQ2, Figure 3 presents the F1 scores of our We compare our proposed approach against the following model across different projects and their respective ver- baseline models: sions in the PROMISE dataset. In this figure, the x-axis • Support Vector Machine (SVM): SVM, a classic represents pairs of software versions used for training and and widely adopted machine learning algorithm, testing (e.g., ant_1.5_1.6), while the y-axis represents the excels in both linear and non-linear classification corresponding F1 values obtained during testing. When tasks and is known for its effectiveness in handling we examined the model’s performance across different high-dimensional data. projects and its various versions, we observed certain noteworthy patterns. Specifically, within the same project, 55 Table 4 Comparison of Experimental Results with Baseline Models SVM CNN BiLSTM CNN-BiLSTM project P R F1 P R F1 P R F1 P R F1 ant_1.5_1.6 0.5133 0.6304 0.5659 0.5526 0.4565 0.5000 0.5120 0.6957 0.5899 0.6364 0.6848 0.6597 ant_1.6_1.7 0.5849 0.5602 0.5723 0.5230 0.5482 0.5353 0.2486 0.5241 0.3372 0.5868 0.5904 0.5886 ant_1.5_1.7 0.4297 0.6807 0.5268 0.3653 0.7349 0.4880 0.4194 0.7048 0.5258 0.5531 0.5964 0.5739 camel_1.2_1.4 0.3645 0.5103 0.4253 0.5625 0.4966 0.5275 0.3186 0.4966 0.3881 0.4647 0.7724 0.5803 camel_1.4_1.6 0.2687 0.0957 0.1412 0.5392 0.2926 0.3793 0.2908 0.3883 0.3326 0.4957 0.3032 0.3762 camel_1.2_1.6 0.5000 0.1915 0.2769 0.3571 0.1862 0.2448 0.2908 0.3883 0.3326 0.3976 0.5266 0.4531 jedit_3.2_4.0 0.4333 0.1733 0.2476 0.4715 0.7733 0.5859 0.5208 0.6667 0.5848 0.4741 0.8533 0.6095 jedit_4.0_4.1 0.3835 0.7727 0.5126 0.5889 0.6709 0.6272 0.5517 0.7273 0.6275 0.7838 0.3671 0.5000 jedit_3.2_4.1 0.4783 0.1667 0.2472 0.5039 0.8101 0.6214 0.5455 0.7273 0.6234 0.4803 0.7722 0.5922 lucene_2.0_2.2 0.7681 0.4454 0.5638 0.6918 0.7692 0.7285 0.7571 0.4454 0.5608 0.6371 0.9875 0.7745 lucene_2.2_2.4 0.6923 0.7310 0.7111 0.6329 0.6650 0.6485 0.7739 0.4518 0.5705 0.6204 0.9806 0.7600 lucene_2.0_2.4 0.6120 0.9848 0.7549 0.6339 0.7208 0.6746 0.7768 0.4416 0.5631 0.6204 0.9806 0.7600 poi_2.0_2.5 0.6996 0.6573 0.6778 0.6781 0.3992 0.5025 0.8053 0.7339 0.7679 0.7240 0.9839 0.8342 poi_2.5_3.0 0.8560 0.7429 0.7954 0.7038 0.7214 0.7125 0.8547 0.7143 0.7782 0.6943 0.9571 0.8048 poi_2.0_3.0 0.7436 0.7250 0.7342 0.7333 0.5893 0.6535 0.8559 0.7214 0.7829 0.7034 0.9571 0.8109 synapse_1.0_1.1 0.4815 0.2281 0.3095 0.5946 0.3860 0.4681 0.5000 0.3158 0.3871 0.5077 0.5789 0.5410 synapse_1.1_1.2 0.5152 0.3953 0.4474 0.5417 0.4535 0.4937 0.5439 0.3605 0.4336 0.5190 0.4767 0.4970 synapse_1.0_1.2 0.5455 0.2791 0.3692 0.5634 0.4651 0.5096 0.5273 0.3372 0.4113 0.4483 0.4535 0.4509 xalan_2.4_2.5 0.6609 0.4258 0.5179 0.5922 0.4059 0.4817 0.5721 0.3221 0.4122 0.5957 0.7176 0.6510 xalan_2.5_2.6 0.6494 0.5221 0.5788 0.6274 0.6397 0.6335 0.5344 0.3431 0.4179 0.5804 0.8848 0.7010 xalan_2.4_2.6 0.6333 0.5123 0.5664 0.6506 0.2647 0.3763 0.5344 0.3431 0.4179 0.5983 0.8431 0.6999 Average 0.5626 0.4967 0.5020 0.5766 0.5452 0.5425 0.5588 0.5166 0.5164 0.5772 0.7270 0.6295 suggests that the predictive performance of our model tends to vary when applied to certain projects. Although we cannot pinpoint the exact reasons behind these changes at this time, we speculate that they may have been influ- enced by a variety of factors, including project-specific characteristics, code complexity, and domain-related dif- ferences. Figure 2: F1 Score Comparison Visualization such as Lucene, POI, and Xalan, our models show a high degree of performance consistency across different ver- sions. This shows that our model is able to predict re- sults consistently when dealing with different versions of certain projects. This consistency can be partially at- tributed to the higher code similarity found between ver- Figure 3: F1 Score Across PROMISE Projects sions within the same project, making it easier for models to capture shared features and patterns. There are some differences between versions of Ant and Synapse, these differences are relatively minor. In contrast, projects such as Camel and JEdit show more per- formance fluctuations, even within the same project. This 56 Table 5 Hyperparameter combinations obtained through Optuna proj Optuna Time num_epochs batch_size learning_rate filter_size num_filters rnn_hidden ant_1.5_1.6 8.18 h 8 64 0.000185 3 186 256 camel_1.2_1.4 30.92 h 7 32 0.000148 7 120 1024 jedit_3.2_4.0 4.32 h 6 32 0.000251 11 157 64 lucene_2.0_2.2 1.21 h 3 128 0.008864 9 178 128 poi_2.0_2.5 3.93 h 6 128 0.000015 3 240 256 synapse_1.0_1.1 3.09 h 7 64 0.000458 5 346 64 xalan_2.4_2.5 35.23 h 5 128 0.000232 5 194 16 5.3. The impact of hyperparameters on Table 5 provides a summary of hyperparameter com- the performance of CNN-BiLSTM binations obtained through Optuna. These combinations model have been identified to bring better performance for our code defect prediction model. To address RQ3, in this section, we study the impact of hy- perparameters on the performance of the CNN-BiLSTM model for code defect prediction. Initially, we set the hyperparameters to the following values: the number of epochs is 10, the batch size is 64, the learning rate is 1e-4, the number of CNN filters is 128, the number of BiLSTM hidden units is 256, and the CNN filter size is 5 . After that, we fixed other hyperparameters, and then gradually manually adjust one of the other parameters, the CNN filter or the number of BiLSTM hidden units, to observe changes in model performance. Figure 4 and Figure 5 show our experimental results, the x-axis is the change in the number of CNN filters Figure 4: Effect of CNN Filter Length on F1 Score and BiLSTM hidden units, and the y-axis shows the F1 score. We can see that the model performance fluctuates greatly when a single parameter changes. For example, the smaller the number of CNN filters, the better the per- formance of the model. In Figure 5, the F1 score drops after BiLSTM hidden unit is 16, but performs better and tends to be stable after 256. Exploring the impact of each hyperparameter individually would be a time-consuming task, and it is difficult to predict how the model will be- have when these hyperparameters are combined. So we used Optuna, which will constantly try to search for hyper- parameters that can make the model perform better based on the search algorithm. Figures 6 and 7 show the F1 score (y-axis) for a certain number of trials (x-axis). Specifically, Figure 6 is a scatter Figure 5: Effect of BiLSTM Hidden Units on F1 Score plot, representing the F1 score that was obtained in each trial, e.g., when the trial number is 5, the F1 score is the value for the fifth trial. Figure 7 represents the best model performance that can be achieved based on the search until the current trial model is executed. So, in figure 5.4. Threats to Validity 7, when the trial number is 5, the F1 score is the best In our research, we have identified and addressed several F1 score from the first to the fifth trial. We can observe potential threats to the validity of our findings. that through continuous repetition and search, Optuna can The implementation of our Python experimental code gradually search for better results. The entire process is for processing source code text and building models poses automated, which greatly simplifies our hyperparameter a potential threat due to the possibility of bugs. To mitigate tuning process. this, we took measures by leveraging mature third-party 57 6. Conclusion and Future Work In this research, we have introduced a novel approach that leverages JavaBERT-based embeddings with a CNN- BiLSTM model for software defect prediction. Our ap- proach harnesses semantic and contextual information in program code to enhance prediction accuracy. Through comprehensive experiments on the PROMISE dataset, we have demonstrated the superiority of our model over baseline models based on precision, recall, and F1-score metrics. Although our study improves the performance of soft- ware defect prediction compared to baseline models, we Figure 6: Scatter Plot of F1 Scores Across Optuna Trials still have many future works to do. In addition to what we discussed in the "threats to validity" session, we can also train the BERT model in different languages to adapt our methods to different programming languages. References [1] S. Omri, C. Sinz, Deep learning for software de- fect prediction: A survey, in: Proceedings of the IEEE/ACM 42nd international conference on soft- ware engineering workshops, 2020, pp. 209–214. [2] F. Meng, R. Huang, J. Wang, A survey of soft- ware defects research based on deep learning, in: 2023 6th International Conference on Information Figure 7: Progressive Improvement of Best Model Perfor- Systems and Computer Networks (ISCON), IEEE, mance 2023, pp. 1–5. [3] K. O. Elish, M. O. Elish, Predicting defect-prone software modules using support vector machines, libraries (such as javalang and PyTorch) and conducting Journal of Systems and Software 81 (2008) 649– thorough code inspections. Additionally, we applied ran- 660. dom oversampling during data preprocessing, which could [4] S. Wang, T. Liu, L. Tan, Automatically learning introduce bias. Future work will explore alternative meth- semantic features for defect prediction, in: Pro- ods to handle class imbalance and assess their impact on ceedings of the 38th International Conference on results. Moreover, the use of Optuna for hyperparameter Software Engineering, 2016, pp. 297–308. optimization introduces potential variability in results due [5] J. Li, P. He, J. Zhu, M. R. Lyu, Software defect pre- to different search spaces and numbers of trials. To reduce diction via convolutional neural network, in: 2017 these threats, we plan to conduct more extensive searches IEEE international conference on software quality, and explore larger search spaces. reliability and security (QRS), IEEE, 2017, pp. 318– Our choice of a subset of projects from the PROMISE 328. dataset due to time constraints may impact the generaliz- [6] J. Wang, C. Zhang, Software reliability prediction ability of our findings, as the results may not generalize using a deep learning model based on the RNN well to other projects. To address this, we intend to include encoder–decoder, Reliability Engineering & System a broader range of projects in future research. Safety 170 (2018) 73–82. We evaluated our models using a limited set of per- [7] J. Deng, L. Lu, S. Qiu, Software defect prediction formance metrics, specifically precision, recall, and F1 via LSTM, IET software 14 (2020) 443–450. measure. To reduce these threats, we will consider in- [8] H. Liang, Y. Yu, L. Jiang, Z. Xie, Seml: A semantic corporating additional metrics such as AUC-ROC and LSTM model for software defect prediction, IEEE MCC, among others, to provide a more comprehensive Access 7 (2019) 83812–83824. assessment of model performance. [9] H. Wang, W. Zhuang, X. Zhang, Software defect pre- diction based on gated hierarchical LSTMs, IEEE Transactions on Reliability 70 (2021) 711–727. 58 [10] M. N. Uddin, B. Li, Z. Ali, P. Kefalas, I. Khan, I. Zada, Software defect prediction employing BiL- STM and BERT-based semantic feature, Soft Com- puting 26 (2022) 7877–7891. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems 26 (2013). [13] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceed- ings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [14] N. T. De Sousa, W. Hasselbring, JavaBERT: Train- ing a transformer-based model for the Java program- ming language, in: 2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineer- ing Workshops (ASEW), IEEE, 2021, pp. 90–95. [15] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biologi- cal cybernetics 36 (1980) 193–202. [16] M. Schuster, K. K. Paliwal, Bidirectional recur- rent neural networks, IEEE transactions on Signal Processing 45 (1997) 2673–2681. [17] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter opti- mization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowl- edge discovery & data mining, 2019, pp. 2623– 2631. [18] J. Sayyad Shirabad, T. Menzies, The PROMISE Repository of Software Engineering Databases., School of Information Technology and Engineer- ing, University of Ottawa, Canada, 2005. URL: http://promise.site.uottawa.ca/SERepository. [19] C. Thunes, javalang: pure Python Java parser and tools, 2020. 59