<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Duplicate Bug Report Detection by Using Sentence Embedding and Faiss</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lee Sunho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lee Seonah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gyeongsang National University</institution>
          ,
          <addr-line>Gyeongsangnam-do, Jinju City</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Duplicate issue reports in the issue management system can cause unnecessary work for users and developers and disrupt the progress of their work. Therefore, researchers have developed duplicate report detection techniques. However, most of those techniques require training pairs of duplicate reports as well as those of non-duplicate bug reports. To speed up the processing of such machine learning-oriented methods, we propose implementing duplicate report detection using the recent technologies Sentence Bert and Faiss. By using the Faiss library, we can quickly detect duplicate issue reports. We also evaluate our proposed approaches with two different experiments and discuss our future work.</p>
      </abstract>
      <kwd-group>
        <kwd>Bug report</kwd>
        <kwd>Duplicate report detection</kwd>
        <kwd>Sentence BERT</kwd>
        <kwd>SBERT</kwd>
        <kwd>Faiss</kwd>
        <kwd>Machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Many developers collaborate on a single software project, working together with their team
members to address various tasks. In the case of open source projects, who interested in the projects
can easily participate in development, in addition to the designated team members. When many
people are involved in a single project, they create bug reports to share problems of the software
systems, information about their situations, the current state of development, and so on. In larger
projects, tens of thousands of such bug reports can be daily created by reporters.</p>
      <p>However, unnecessarily, duplicate bug reports are also created in this process. These duplicated bug
reports can increase unnecessary workload and cause confusion about the progress of the tasks. To
resolve these issues, it is necessary to proactively detect and eliminate duplicate bug reports. To
identify duplicate bug reports, it is possible to check document similarity among those reports.</p>
      <p>
        Researchers have proposed various methods to detect such duplicate bug reports. Those methods
can broadly categorized into ML(Machine Learning) and IR(Information Retrieval). In the IR group,
the methods use TF-IDF and cosine similarity [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Others employed BM25F to measure the
similarity between bug reports to query detecting duplicate bug reports [
        <xref ref-type="bibr" rid="ref3">3, 19</xref>
        ]. However, those
approaches often fall short in understanding the context. In the ML group, methods used DCCNN
(Dual Channel Convolutional Neural Networks) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], GloVe[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], MLP (Multi Layer Perceptron) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
and self-attention [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Those methods aimed to comprehend the contexts for duplicate bug reports.
However, there is a drawback when using those in that those methods require creating bug reports
pairs for training and evaluation. Additionally, training deep learning models can be time-consuming.
      </p>
      <p>We propose duplicate report detection using Sentence BERT(SBERT) and Faiss(Facebook AI
Similarity Search). Faiss is a library that quickly finds a similar one among several documents. In this
case, the approach can detect duplicate bug reports without extensively training on a pair dataset of
duplicate bug reports.</p>
      <p>In our experiments, we used only unstructured data, specifically the 'title' and 'body' of bug reports
without any pairs of bug reports. We employed Sentence BERT to generate
embeddings for this data. Subsequently, we create an indexIDMap in Faiss, where the results of the
embeddings are associated with the respective issue IDs. When a new issue is given, our proposed
approach utilizes Faiss' search function to find duplicate bug reports in the indexIDMap.</p>
      <p>In this paper, we ask three research questions. The first question is about the accuracy of detecting
duplicate bug reports. The second question is about predicting “duplicate” labels for bug reports.
The third question is the speed of searching and detecting duplicate bug reports. Our experimental
results shows that, with regard to duplicate bug report detection, our proposed approach shows
reasonable accuracy but not outperforms state-of-the-art approaches. However, as shown in the
second question, our approach can be applied to a new problem of predicting “duplicate” labels for
bug reports and shows a high prediction accuracy than state-of-the-art approaches. At last, our
approach shows its speedy performance in detecting duplicate bug reports and predicting its labels.</p>
      <p>Our contributions are as follows. First, we propose a method for quickly identifying duplicate bug
reports using the newly developed Faiss library. Second, we collected real-world open-source data
and reported the performance of our proposed method. We found that our system can rapidly detect
duplicate reports. Our research results could help users and developers to identify duplicate reports
and, by doing so, to avoid unnecessary tasks in advance.</p>
      <p>The structure of this paper is as follows. Section 2 explains the basic concepts as the
"BACKGROUND" section. Section 3 describes the proposed method and the structure of the model.
Section 4 provides an explanation of the experimental approach. Section 5 presents the analysis of
the model's results in "EXPERIMENTAL RESULTS AND ANALYSIS". Finally, Section 6 summarizes the
findings of this paper and outlines future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>We will provide a summary of the terms and concepts used in this paper.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Duplicate bug reports</title>
          <p>Software developers often collaborate with many individuals to enhance project performance and
expedite completion via bug management systems. During software development, various bugs
arise, and these bug reports are documented to facilitate communication among developers. Most
bug reports are described in text form. Given that many people express bug reports in text, even the
same bug report can be reported in different bug reports with different textual expressions. These
bug reports are referred to as "duplicate bug reports." Duplicate Bug Reports lead to an increase in
unnecessary work for developers, which makes it challenging to monitor the progress of tasks. Due
to these bug reports, it is necessary to identify duplicate bug reports proactively. However, manually
detecting duplicate bug reports requires a significant amount of time and effort.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2. Sentence Embedding</title>
          <p>
            To process text on a computer, it's necessary to convert text into numerical representations. One
way to do this is to assign a unique number to each word. The most simple one is One-Hot Encoding,
which assigns unique numbers to words. However, such an approach doesn't capture the meaning of
words in the context in which the words are used. To create numerical representations that capture
contextual meaning, BERT is used. "BERT," introduced in a paper by Jacob Devlin in 2018 [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], is a
language model that leverages a self-attention mechanism to transform text into vectors that
encapsulate the meaning of context.
          </p>
          <p>
            In 2019, Nils Reimers introduced "Sentence BERT (SBERT)" in a paper [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], which is a modified
Language Model derived from BERT specifically designed for tasks involving the calculation of
similarity between sentences. SBERT provides more appropriate sentence-level embeddings using
techniques like the "Siamese architecture."
          </p>
          <p>There are three main methods for representing sentences as vectors using SBERT:
● Mean Pooling: the sentence is input into BERT, and the average of the resulting vector is used
as the representation of the sentence. This approach captures the influence of all words in the
sentence.
● Max Pooling: Max Pooling uses the maximum value from the vector instead of the average. This
approach gives more weight to the most significant words in the sentence.
● CLS Token: The BERT model's output contains a [CLS] token as its first value. This token
encapsulates the summary embedding of the sentence. Using this [CLS] token as the vector
representation effectively summarizes the sentence's meaning.</p>
          <p>These different methods allow SBERT to create meaningful sentence embeddings, depending on the
specific needs of the task at hand. In our approach, we adopted the Mean Polling method to consider
the influence of all words in a sentence.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.3. Faiss Library</title>
          <p>Faiss (Facebook AI Similarity Search)2 is a library developed by the Facebook AI research team. Faiss
quickly finds similar vectors in large amounts of vectorized data or cluster vectorized data. To this end,
Faiss uses vector quantization technology and indexing technology. Vector quantization is a technology
that maps high-dimensional vectors into low-dimensional vectors. This reduces a memory usage and
improves a searching speed. Indexing technology enables to find the most similar cluster in the
clusters and compare a vector with the vectors in the matched cluster, rather than comparing it with
all vectors.</p>
          <p>Faiss makes it possible to quickly calculate the similarity between bug reports. It is particularly
useful in detecting duplicate bug reports where two bug reports share the similar textual contents in
a speedy way. Therefore, we used the Faiss Library to detect duplicate bug reporters, taking
advantage of its capability of quickly finding similar documents.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>2.4. Euclidean Distance</title>
          <p>There are various metrics used to measure how similar two documents are, such as Cosine
Similarity, Inner Product (IP) Similarity, and more. 'Euclidean Distance' is a metric that calculates the
distance between two vectors. If the vector values of two documents are denoted as X and Y, the
similarity between the two documents can be calculated based on the following formula (1).</p>
          <p>k
                 = √∑(xi − yi)2 ( ℎ   ,   ∈  ,   ∈  )
i
(1)
2 https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/</p>
          <p>If the metric yields a small value, the distance between the two vectors is short, which indicates
that the two documents are similar. Conversely, if the metric yields a large value, the distance
between the two vectors is long, which means that the two documents are dissimilar. With the
Euclidean Distance , we determined that the two documents were similar if the value was below a
specific value(14.5).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>We propose the method we presented in Figure 2. The method consists of the following steps:
data preprocessing, sentence embedding, Faiss database creation, and the search process.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sentence Embedding</title>
        <p>The preprocessed bug reports are embedded into vectors so that the similarity of the bug reports can
be calculated in a value. When embedding bug reports, it would be good to map semantically similar
documents closely. For that, we used SBERT by specifically employing the mean pooling method.</p>
        <p>
          SBERT uses the siamese network and the triplet network to do this with a fine tuned model of Bert,
making it possible to map more accurately and to calculate faster according to the meaning of the
document.[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] A preprocessed document is provided as input to fine tuned BERT. The output consists
of tensors where each word is represented as a vector. The process involves averaging these output
vectors to represent the entire sentence as a single tensor.
        </p>
        <p>For example, Figure 3 presents the sentence "Specifying thumbnail and metadata location.". This
sentence is tokenized and each word is converted to a vector. Then, the vectors of the words are
averaged and the averaged vector becomes the embedding vector for the sentence "Specifying
thumbnail and metadata location". If the sentence transformers library is used , the output tensor
will typically be a 768- dimensional vector. In this way, all collected bug reports are converted to
vectors.This process is applied to all bug reports.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Faiss IndexIDMap</title>
        <p>In large open source projects where many contributors participate, tens of thousands of bug reports
are daily generated. Therefore, to detect duplicate bug reports for a single bug report, the tens of
thousands of existing bug reports should be compared with the bug report. As the project
accumulates more bug reports, the complexity of this process continues to increase. To address this
complexity, we used the Faiss library. The Faiss library creates an index map by clustering documents
so as to speed up searching similar documents. The Faiss library also reduces memory usage by
compressing large amounts of data. . To calculate the similarity of documents by using Faiss, users
need to create an IndexIDMap. As shown in Figure 2, this involves creating an IndexIDMap pairing
the embedded bug reports with their respective I</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Faiss Search</title>
        <p>When a new bug report is created, the bug report is embedded by SBERT. The embedded bug report
is then compared to other bug reports in the IndexIDMap.</p>
        <p>The K most similar bug reports can be extracted by the Faiss library's Search function. This function
returns a list containing the IDs of the bug reports and the distances between the new bug report and
each existing bug report. If the distance is less than a certain threshold 14.5, the existing bug reports
and the new bug report are determined as duplicates.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Set-up</title>
      <sec id="sec-4-1">
        <title>4.1. Research Questions</title>
        <sec id="sec-4-1-1">
          <title>In this paper, we ask three research questions.</title>
          <p>RQ1. When focusing on duplicate bug report detection for duplicate pairs dataset, what is the
accuracy of the proposed method?
RQ2. When considering all bug reports with duplicate Labels, what is the accuracy of the proposed
method for duplicate bug report detection tasks?
RQ3. How fast is the search process using the Faiss library?</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Target Data</title>
        <p>To conduct our experiments, we initially identified and sorted projects in the GitHub dataset that
had a significant number of "duplicate” Labels. We then manually investigated these projects to
determine if they were practical open-source projects. As a result of this process, we selected the
YouTube-dl project on GitHub for our evaluation.</p>
        <p>We then collected bug reports from the YouTube-dl project by using web crawling. From the various
attributes of GitHub bug reports, we collected only the title, body, label, and ID attributes. Figure 4
represents the analysis of the bug report data from the YouTube-dl project. The statistics are
visualized as a bar graph. In total, there were 19,855 issues in the dataset.
The issues were categorized into three main groups based on the following criteria:
1. Non-duplicate: The number of bug reports without “duplicate” labels.
2. Duplicate: The number of bug reports with “duplicate” labels, assigned by authorized
developers.
3. Duplicate pair: The number of bug reports within the duplicate data set where the pairs of
duplicate reports can be found.</p>
        <p>In practice, authorized developers manually assign labels to bug reports. In projects with a high
volume of developer participation, thousands of new bug reports can be generated daily. It becomes
practically impossible for authorized developers to individually verify and label all of them. As a
result, there are often non-duplicate bug reports that are effectively considered as duplicates.
Conversely, there are situations where bug reports receive “duplicate” labels but actually they are
not duplicate bug reports.</p>
        <p>Bug reports with Duplicate labels must have an originating source bug report which have same
content. When authorized developers assign a "duplicate” label, they may or may not include the ID
or link of the source bug report in the bug report's Comment. Most duplicate issues do not include
the ID or link to the source issue. Ignoring the mistakes made by authorized developers in such cases
is difficult. Therefore, to minimize the impact of such errors, we collected only duplicate bug reports
for which the source bug report could be identified, creating a dataset named ‘Duplicate pair’.</p>
        <p>To avoid bias in the test data set and ensure that the model does not perform exceptionally well due
to such bias, we constructed the test data set with an equal number of duplicate and non-duplicate
issues. Table 1 shows a table with the number of issues for two different datasets.</p>
        <p>The first test dataset, named ‘Duplicate’, is created by randomly selecting the same number of
Duplicate issues as the Non-duplicate data and then combining them. The second test dataset,
named ‘Duplicate pair’, is generated by randomly extracting as many issues from the Non-duplicate
issues as there are in the Duplicate pair dataset and then combining them with the Duplicate pair
dataset.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results &amp; Analysis</title>
      <p>In this section, we have summarized and provided answers to each of the previously mentioned
research questions one by one. Figure 5 presents four different result metrics – F1 score, recall,
precision, and accuracy – for two datasets. The evaluation results can be the answers for RQ1 and
RQ2.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Results for RQ1</title>
        <p>
          The answer to RQ1 can be understood by examining Figure 5. In Figure 5, the green bars represent
the results for the Duplicate pair dataset. In the case of the Duplicate pair dataset, the F1 score is
0.6583. The recall is 0.7128, while the precision is 0.6116. Finally, the accuracy is 0.6301. If we
consider the previous duplicate bug report detection techniques [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the performance of our
proposed method is not high but presents a reasonable performance in that our methods does not
require the cost of training a model.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Results for RQ2</title>
        <p>
          The answer to RQ2 can also be understood by examining Figure 5. In Figure 5, the blue bars
represent the results for the Duplicate dataset. the F1 score is 0.6495. The experimental results can
be compared to the previous study [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which achieved an F1-score of 0.329 for duplicate Labels. This
is the most interesting point, because we can see an improvement in the Duplicate dataset with an
F1-score of approximately 0.3205. This means that the results of the current study achieved F1
scores that were roughly twice as high as those from the previous research [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>When comparing the results for the Duplicate pair dataset and the Duplicate dataset, we can see
that in each aspect, the Duplicate pair dataset outperforms the Duplicate dataset. Specifically, the
F1score for the Duplicate pair dataset is 0.0088 higher, recall is 0.0159 higher, precision is 0.0034
higher, and accuracy is 0.0062 higher.</p>
        <p>These improvements indicate that the Duplicate pair dataset has undergone more stringent data
refinement compared to the Duplicate dataset. Notably, the recall value exhibits the largest
difference. Since the Duplicate pair dataset eliminates cases where authorized developers wrongly
assigned Duplicate Labels to non-duplicate issues, it is expected that the recall value would be
significantly higher in the Duplicate pair dataset compared to the Duplicate dataset.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Experimental Results for RQ3</title>
        <p>Figure 6 represents the time it takes to compute the similarity between the bug reports in the test
dataset and the entire dataset and to find the K most similar bug reports. The graph on the left
pertains to the Duplicate dataset, while the one on the right is for the Duplicate pair dataset.</p>
        <p>Table 2 presents the maximum, minimum, average, and variance of the search times. In response
to RQ3, it can be observed from Table 2 that when using the Faiss library, it takes approximately 0.5
seconds to search the entire dataset of 19,855 bug reports. Even in cases where the search time is
longer, it remains within 1.5 seconds. The small variance, approximately 0.02, indicates that the
search time for most data points is around 0.5 seconds.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>We have proposed a new approach if a new issue report is a duplicate report of the existing one or
not and evaluated the performance of the proposed approach. However, we have not qualitatively
analyzed the samples of the evaluation. Therefore, we selected one sampled issue report from our
evaluation datasets and investigated what happened to determine if the issue report is duplicated or
not. The selected sample was a issue report of the youtube-dl project, #21639 "TED support is
broken3". once the issue report is given, our approach gets the title and content of the report as
input and returns the id and distance values of the five bug reports that are most similar to the issue.
In our experiments, the model returned [#22317, 0.907], [#21947, 1.022], [#22374, 1.063], [#21390,
1.065], and [#23378, 1.082]. The distance values were calculated by the Euclidean distance between
the two bug reports. When the distance value of a bug report is 14.5 or less, the approach
determines that the new issue is a duplicate bug report. Because the bug report "TED support is
broken" has a set of bug reports with a distance not greater than 14.5, the bug report is determined
to be a duplicate. The most similar bug report is the bug report #22317 "younow extractor is
broken4." With the mechanism, the proposed approach successfully predict if a new bug report is
duplicated or not.</p>
      <p>
        The reason that we have proposed a new approach in this paper is that existing approaches that
determine if two bug reports are duplicate or not have several limitations. The first group of
approaches, relying on Information Retrieval (IR) methods [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3, 19, 20</xref>
        ], were based on word
similarity of bug reports. However, However, the paper "Towards Understanding the Impacts of
Textual Dissimilarity on Duplicate Bug Report Detection" [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], published by Sigma Jahan and
Mohammad Masudur Rahman in 2023, claims that there are many duplicate bug reports with word
dissimilarity. The second group of approaches, relying on Machine Learning (ML) methods [
        <xref ref-type="bibr" rid="ref10 ref11 ref4 ref5 ref6">4, 10, 11,
5, 6</xref>
        ], were based on the training data for creating machine learning models. However, the paper
"Duplicate Bug Report Detection: How Far Are We?"[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] published by Ting zhang in 2023 argued that
many methods use biased data (e.g., testing using only old bug reports, collecting data using only
certain bug tracking systems, etc. Besides, machine learning based approaches take a significant
amount of time for training a model with data.
      </p>
      <p>
        In this paper, we considered these arguments and used SBERT to encompass the semantic context
of issue reports, and used data from real Github projects to reflect realism. In addition, we
implemented a fast duplicate bug report detection using the Faiss library. However, we also observe
the limitations of this paper. First, we have not provided the performance of baseline approaches to
compare with our proposed approach. Although we certainly observed that our approach with the
Faiss yields a fast search speed based on the clusters and indices of Faiss, we acknowledge that we
have provided no performance comparison experimental results on how fast similar search is
compared to other ML methods. Second, the paper "Duplicate Bug Report Detection: How Far Are
We?"[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] reported that the IR-based REP method presented in the paper "Towards More Accurate
Retrieval of Duplicate Bug Reports"[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by Chennian Sun in 2011 performed the best. However, we
have not compared the performance of our approach with the existing approach. Therefore, further
experiments are needed. Third, we assume that duplicate bug reports have a negative impact and
should be detected and marked in advance. However, in 2008, Nicolas Bettenburg published
"Duplicate Bug Reports Considered Harmful . . . Really?"[17] argued that duplicate bug reports have
a positive impact because they provide additional information, allowing us to learn more about the
bug. According to their point of view, it would be better to preserve the additional information by
merging duplicate bug reports rather than disregarding them. Therefore, we need to deeply
deliberate how to utilize the information of bug reports for facilitating software evolution.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Threats to Validity</title>
      <sec id="sec-7-1">
        <title>7.1. Threats to Internal Validity</title>
        <p>There are three main internal factors that threaten the validity of the experiments in this paper. First,
we arbitrarily set a threshold of 14.5 as a constant to determine how far two issues should be to not
be considered as duplicate issues. In other words, when calculating the similarity between a new
issue and existing issues, if the Euclidean distance between the two issues is less than or equal to
14.5, they are considered duplicates. This setting could lead to entirely different results when applied
to different datasets.</p>
        <p>Second, we imposed a maximum limit of 1024 characters on the text of all data. Without this limit,
not only would a significant amount of data be lost in the SBERT process, but also the embedding
speed would slow down considerably.</p>
        <p>Third, there is the issue of inaccurate data labeling. The labeling is solely based on the judgment
and decision of authorized developers who attach Duplicate Labels. As a result, variables where
authorized developers either mistakenly attach Duplicate Labels to issues that should not have them
or omit them from issues that should be labeled as duplicates are reflected in the dataset.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Threats to External Validity</title>
        <p>In this paper, we conducted a performance analysis on a single open-source project. However, it's
important to note that performance metrics can vary depending on the open-source project and the
dataset used. Therefore, in future work, we plan to increase the number of projects to measure the
metrics for each project and calculate their average values.</p>
        <p>Additionally, we did not include a comparison experiment with existing methods in this study. In
future research, we intend to include such comparisons to provide a more comprehensive analysis.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion &amp; Future Work</title>
      <p>
        In this paper, we introduced a novel method for duplicate bug report detection using SBERT and
Faiss. Addressing the challenge of training and search speed delays for large datasets was a significant
concern, and we effectively addressed it by leveraging Faiss, achieving search times of approximately
0.5 seconds. Additionally, our proposed method demonstrated a notable performance improvement,
about twice as effective, compared to previous research[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in detecting “duplicate” labels.
      </p>
      <p>However, there are certain limitations and areas for future work. Firstly, since we employed
realworld project bug report data, we couldn't completely eliminate the impact of data noise. Therefore,
further experiments with denoised data are necessary. Secondly, our experiments were conducted
using bug report data from a single project, Youtube-dl. Future research should assess the
applicability of this methodology to various issue datasets. Particularly, it would be essential to use
datasets with a more extensive range of data to effectively evaluate Faiss's performance.
Additionally, exploring alternative similarity metrics beyond Euclidean Distance or implementing
more rigorous data pre- processing steps are directions for future research and improvement.
Duplicate Bug Report Detection Method Based on Technical Term Extraction”, 2023 IEEE/ACM
International Conference on Automation of Software Test (AST), 2023.
[17] Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann and Sunghun Kim“Duplicate Bug
Reports Considered Harmful . . . Really?”, 2008 IEEE International Conference on Software
Maintenance, 2008.
[18] Per Runeson, Magnus Alexandersson and Oskar Nyholm, “Detection of Duplicate Defect Reports
Using Natural Language Processing “, 29th International Conference on Software Engineering
(ICSE'07), 2007.
[19] Cheng-Zen Yang, Hung-Hsueh Du, Sin-Sian Wu and Ing-Xiang Chen “Duplication Detection for
Software Bug Reports based on BM25 Term Weighting”, 2012 Conference on Technologies and
Applications of Artificial Intelligence, 2012.
[20] Karan Aggarwal, Tanner Rutgers, Finbarr Timbers, Abram Hindle, Russ Greiner, and Eleni
Stroulia, “Detecting Duplicate Bug Reports with SoftwareEngineering Domain Knowledge”, 2015
IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER),
2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hindle</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Onuczko</surname>
          </string-name>
          , “
          <article-title>Preventing duplicate bug reports by continuously querying bug reports”</article-title>
          ,
          <source>Empirical Software Engineering</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Jalbert</surname>
          </string-name>
          and
          <string-name>
            <given-names>Westley</given-names>
            <surname>Weimer</surname>
          </string-name>
          , “
          <source>Automated Duplicate Detection for Bug Tracking Systems”</source>
          ,
          <source>2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC(DSN)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Chengnian</given-names>
            <surname>Sun</surname>
          </string-name>
          , David Lo, Siau-Cheng Khoo and Jing Jiang, “Towards More Accurate Retrieval of Duplicate Bug Reports”,
          <year>2011</year>
          26th IEEE/ACM International Conference on
          <source>Automated Software Engineering(ASE</source>
          <year>2011</year>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jianjun</given-names>
            <surname>He</surname>
          </string-name>
          , Ling Xu, Meng Yan,
          <source>Xin Xia and Yan Lei, “Duplicate Bug Report Detection Using Dual-Channel Convolutional Neural Networks”, ICPC '20: Proceedings of the 28th International Conference on Program Comprehension</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Montassar</given-names>
            <surname>Ben</surname>
          </string-name>
          <string-name>
            <surname>Messaoud</surname>
          </string-name>
          , Asma Miladi, Ilyes Jenhani and Mohamed Wiem Mkaouer, “
          <article-title>Duplicate bug report detection Using an Attention-Based Neural Language Model”</article-title>
          ,
          <source>IEEE Transactions on Reliability ( Volume: 72</source>
          , Issue: 2, June 2023),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lahari</given-names>
            <surname>Poddar</surname>
          </string-name>
          , Leonardo Neves,
          <string-name>
            <given-names>William</given-names>
            <surname>Brendel</surname>
          </string-name>
          , Luis Marujo,
          <source>Sergey Tulyakov and Pradeep Karuturi, “Train One Get One Free: Partially Supervised Neural Network for Bug Report Duplicate Detection and Clustering”</source>
          , arXiv preprint arXiv:
          <year>1903</year>
          .12431,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and Iryna Gurevych, “
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks”</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .10084,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding”</article-title>
          ,
          <source>arXiv preprint arXiv …</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jueun</given-names>
            <surname>Heo</surname>
          </string-name>
          and
          <article-title>Seonah Lee, “An Empirical Study on the Performance of Individual Issue Label Prediction”</article-title>
          ,
          <source>2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Irving</given-names>
            <surname>Muller</surname>
          </string-name>
          <string-name>
            <surname>Rodrigues</surname>
          </string-name>
          , Daniel Aloise,
          <article-title>Eraldo Rezende Fernandes and Michel Dagenais, “A Soft Alignment Model for Bug Deduplication”</article-title>
          ,
          <source>2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jayati</surname>
            <given-names>Deshmukh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Annervaz K M</surname>
            , Sanjay
            <given-names>Podder</given-names>
          </string-name>
          , Shubhashis Sengupta and Neville Dubash, “
          <article-title>Towards Accurate Duplicate Bug Retrieval using Deep Learning Techniques”</article-title>
          ,
          <source>2017 IEEE International Conference on Software Maintenance and Evolution (ICSME)</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Guanping</surname>
            <given-names>Xiao</given-names>
          </string-name>
          , Xiaoting Du,
          <source>Yulei Sui and Tao Yue, “HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction”</source>
          ,
          <source>2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE)1</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ting</surname>
            <given-names>zhang</given-names>
          </string-name>
          , Donggyun han,
          <article-title>Venkatesh vinayakarao, Ivana clairine irsan, bowen xu, ferdian thung, david lo and lingxiao jiang“</article-title>
          <source>Duplicate Bug Report Detection: How Far Are We?”, ACM Transactions on Software Engineering and Methodology</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Avinash</surname>
            <given-names>Patil</given-names>
          </string-name>
          , Kihwan Han and
          <article-title>Sabyasachi Mukhopadhyay, “A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports”</article-title>
          ,
          <source>arXiv preprint arXiv:2308.09193</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Sigma</given-names>
            <surname>Jahan</surname>
          </string-name>
          and Mohammad Masudur Rahman, “
          <source>Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection”</source>
          ,
          <source>2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Xiaoxue</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Wenjing Shan, Wei Zheng, Zhiguo Chen,
          <source>Tao Ren and Xiaobing Sun, “An Intelligent</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>