<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Kyiv, Ukraine
Corresponding author.
These authors contributed equally.
i.a.yurchuk@gmail.com (I. Yurchuk); danylo.boiko@knu.ua (D. Boiko)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extending Monolingual Asymmetric Semantic Search Models for Multilingual Query Processing using Knowledge Distillation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iryna Yurchuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danylo Boiko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Volodymyrska Street, 60, Kyiv, 01033</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Semantic search is a key task in today s world, where the amount of data is growing rapidly. In this work, the main role is devoted to cases when long answers must be found to a short query (known as asymmetric search). The teacher model with 30,522 tokens of vocabulary and the student model with 119,547 tokens of vocabulary are basic for training a multilingual asymmetric semantic search model using multilingual knowledge distillation. The authors used the reciprocal rank (RR), the mean average precision (MAP), and the normalized discounted cumulative gain (NDCG) to evaluate the obtained model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;asymmetric semantic search</kwd>
        <kwd>multilingual embedding</kwd>
        <kwd>knowledge distillation1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Semantic search is a set of search algorithms that work based on understanding the meaning of text.
This approach effectively handles synonyms, abbreviations, and spelling errors, unlike keyword
search engines that rely on exact lexical matches to find documents. It is useful for grading and
assessing academic work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], integrating search functionalities on e-commerce [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], information
retrieval in the petrochemical sector using a fusion of video transcript data with other data sources
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], helping Human Resources employees to target relevant people for their events and trainings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
performing social search [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], IoT systems [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and other different fields [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Ontologies and knowledge graphs are classical approaches to the design and implementation of
semantic search systems, but recently these approaches have been enriched or replaced by statistical
algorithms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and AI techniques [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for query expansion, generating question-precise training
statistics, semi-supervised gaining knowledge, etc.
      </p>
      <p>This work aims to study the possibility of using the bi-encoder architecture to implement
asymmetric semantic search when storage contains indexed data in English with the necessary to
search information in Ukrainian.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Algorithms</title>
      <p>The intention behind semantic search is to transform passages into a multidimensional vector space.
During the search, the query is similarly embedded in this vector space, which allows to identify the
desired quantity of most relevant matches. This approach ensures that even if the wording is
different, the meaning remains the same and the system can provide accurate retrievals (Figure 1).</p>
      <p>There are two main types of semantic search that work with different types of data. For symmetric
search, queries and passages in the corpus have approximately the same length and content. In turn,
asymmetric search typically uses short queries (e.g., a question or a few keywords) and longer
passages answering those queries.</p>
      <p>
        Nowadays, a lot of trained asymmetric semantic search models based on both bi-encoder and
cross-encoder architectures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are available for English. Bi-encoder models are efficient for
largescale retrieval due to their ability to use precomputed passage embeddings from storage, making
them ideal for speed-critical tasks, while the cross-encoder models are known for higher accuracy
by directly capturing similarity between query and passages, suited for accuracy-sensitive scenarios
like reranking where computational cost is less of a concern.
      </p>
      <p>
        To use asymmetric semantic search for less common languages or even a mix of them, we need
to train new models, which will require a lot of data and computational power. Fortunately, there is
a way to facilitate this using multilingual knowledge distillation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (Figure 2).
      </p>
      <p>This approach requires a teacher model for the source language and a set of pairs (each pair
includes a sentence in the source language and its translation). A new student model attempts to
approximate the output of a teacher model for both source and target sentences using the mean
squared error (MSE) loss. The student model could have the structure and weights of the teacher
model, or it could be a different network architecture since the student model learns representation
of the teacher model. This allows the student model to achieve robust generalization across
languages.</p>
      <p>It is critical that identical sentences in different languages have similar vector representations.
That is why the vector space properties of the source language, obtained from the teacher model,
must be applied to other languages.</p>
      <p>From the information above, it is evident that the input parameters of semantic search models are
queries and passages. In the field of multilingual search, there are three main paradigms:
• Multilingual-to-monolingual: this approach allows to accept queries in multiple languages
and compare them with passages in a single language.
• Monolingual-to-multilingual: conversely, the idea behind this paradigm is to accept queries
in a single language and return passages in multiple languages.
• Multilingual-to-multilingual: provides the highest search adaptability and allows users to
formulate queries and peruse passages across multiple languages. However, this flexibility
may reduce the average accuracy due to the large number of possible language pairs.</p>
      <p>This paper explores the scenario when a storage contains indexed data in English and it becomes
necessary to search information in other languages, particularly Ukrainian. We will take the original
model (data in the index must be produced by this model) trained for asymmetric semantic search,
using the bi-encoder architecture, and add capabilities to handle multilingual queries (in our case,
bilingual, i.e., English and Ukrainian).</p>
      <p>Using knowledge distillation will significantly reduce the training time since we will mimic the
teacher model on translated queries. To avoid unnecessary training on passages and to ensure
consistency between previously indexed and new documents, we will continue to use the teacher
model to create embeddings for new passages. The output of both models can be evaluated using a
similarity metric to find the most relevant passages (Figure 3).</p>
      <p>To determine how similar two vectors are, we can use different metrics (cosine similarity,
Euclidean distance, and dot product). Measuring Euclidean distance for high dimensional vectors
becomes impractical, as they will be very far apart simply because of the vastness of the space they
inhabit. Using cosine similarity, which measures the angle between two vectors by paying attention
to their direction and ignoring magnitude, or dot product, which measures the overall congruence
.</p>
      <p>In practice, we can generate sentence embeddings using specially designed modifications
of the Bidirectional Encoder Representations from Transformers (BERT) model, known as
Sentence Transformers (SBERT). It provides two types of state-of-the-art asymmetric semantic
search models, one tuned for cosine similarity, the other for dot product. Cosine similarity tuned
models prefer to retrieve shorter passages, while dot product tuned models prefer to retrieve longer
passages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>
        3.1. MS MARCO
Microsoft Machine Reading Comprehension (MS MARCO) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is a large-scale dataset created by
Microsoft for training and evaluating information retrieval systems. It is widely used to benchmark
the performance of various models on tasks such as reading comprehension, question answering,
and passage ranking.
      </p>
      <p>The dataset comprises over a million anonymous questions
offering a collection of short, real-world, natural language queries split into the train (808,731
queries), development (101,093 queries), and evaluation (101,092 queries) subsets. We will use the
train and development subsets to train and evaluate the student model respectively, so it makes sense
to look at their length and word patterns (Figure 4).</p>
      <p>Furthermore, the dataset contains 8,841,823 passages that are required to provide natural
language answers (a question may have multiple answers or no answers at all). Although passages
are not involved in the knowledge distillation for multilingual-to-monolingual models, it still seems
reasonable to analyze them to figure out what the teacher model was trained on (Figure 5).</p>
      <p>The length distribution of the combination of queries from the train and development subsets is
markedly skewed, with the majority being quite brief, typically under 100 characters, predominantly
concentrated between 15-25 characters. The frequency drops significantly for longer queries, with
very few extending beyond 100 characters. The most common bigrams (two-word sequence of
words) indicate that the majority of queries are fact-oriented. These patterns reveal that a significant
number of queries are short questions asking ,
, ,</p>
      <p>The length distribution of the passages has a right-skewed plot with the noticeable peak around
200-300 characters. The character type distribution shows that letters are mostly used, with a median
of about 250 characters per passage and a wide interquartile range reflecting considerable variation.
Digits and punctuation marks are used less frequently, but there are outliers indicating some
passages with a large number of such characters.</p>
      <p>
        The scale and real-world nature of the dataset makes it attractive for training and evaluating
machine learning models, but the original MS MARCO contains only English search queries. To train
and evaluate models for other languages, the dataset must be translated in one of the available ways.
For Ukrainian, the OPUS-MT English to East Slavic neural machine translation model [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
demonstrated high-quality results in a relatively short time.
3.2. TREC 19
In 2019, the National Institute of Standards and Technology (NIST), in collaboration with Microsoft,
organized the Text Retrieval Conference (TREC) deep learning track [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] benchmark competition.
This event was aimed to foster research in the information retrieval direction using deep learning
techniques. As the official evaluation dataset, organizers provided a list of 200 queries and a pool of
documents and passages, labeled by NIST assessors using multi-graded judgments.
      </p>
      <p>Although the deliberate selection of the queries at an intuitive level implies high-quality data and
a lack of outliers, it still seems reasonable to check directly the length and word patterns (Figure 6).</p>
      <p>The length distribution peaks at 20-25 characters, indicating that most queries are concise and
right-skewed, with fewer longer queries beyond 60 characters. The most frequent bigrams begin
with common phrases such as "what," "definition," and "how," suggesting that queries are structured
as questions, often seeking definitions or factual information.</p>
      <p>Looking at the queries from both datasets, it is clear that TREC 19 queries are generally shorter
and more focused on concrete evidence, while MS MARCO queries might be significantly longer and
more complex. It is also important to note that TREC 19 does not have outliers, unlike MS MARCO,
where some queries can exceed 400 characters.</p>
      <p>These days, the TREC 19 has become a recognized benchmark for various information retrieval
and deep learning tasks such as document and passage retrieval. In the task of full search, we can
evaluate up to 1,000 passages for each query based on their estimated likelihood of containing the
answer.</p>
      <p>To evaluate the trained model, all 200 queries were manually translated into Ukrainian by a native
speaker, which guarantees the veracity of the results.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Training and Evaluation</title>
      <p>
        We will use the DistilBERT base multilingual (cased) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as a student model. This model was trained
on the concatenation of Wikipedia in 104 different languages (including English and Ukrainian), has
6 layers, 768 dimensions, and 12 heads, totalizing 134 million parameters (compared to 177 million
parameters for the BERT base multilingual).
      </p>
      <p>
        The MS MARCO DistilBERT base v4 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] was chosen as a teacher model. It embeds text into a
768-dimensional vector space and can be used for clustering and semantic search. This model was
fine-tuned on the original MS MARCO passage ranking dataset and optimized to generate
embeddings for queries and passages.
      </p>
      <p>We trained the student model for 5 epochs with a batch size of 24, 10,000 warm-up steps, and a
learning rate of 2e-5. The entire training process on the train and development (used for intermediate
evaluation) subsets took about 8 hours using the Apple M3 Max chip (16- ore CPU, 40- ore GPU).
To measure the difference between computed and target query embeddings we used the MSE loss
(Figure 7).</p>
      <p>The teacher model has a vocabulary of 30,522 tokens, the same as the original BERT base. This
vocabulary includes common words, sub-words, and special tokens to deal with English. The student
model, on the other hand, is a distilled version of the BERT multilingual. The extended vocabulary
of 119,547 tokens covers a variety of multiple symbols from different languages, allowing to
efficiently process and understand text in different linguistic contexts.</p>
      <p>Any dataset for semantic search or information retrieval systems has selection bias. It can be
related to the data source, the date of publication, and the personal preferences of the publisher.
Achieving high performance on cross-lingual tasks depends on the ability to seamlessly map
sentences from different languages into a single vector space. The similarity between alphabets also
significantly affects accuracy. Languages with similar alphabets, such as English and German,
typically produce more accurate results than languages with dissimilar alphabets, such as English
and Ukrainian.</p>
      <p>Using the original English and machine-translated Ukrainian search queries, we achieved a
reasonable level of accuracy (about 93%) for English to Ukrainian and Ukrainian to English
translation tasks, indicating a high level of proficiency in both directions (Figure 8).</p>
      <p>It is very important to monitor performance on the subset that did not participate in training to
avoid overfitting. MSE on the evaluation subset steadily decreases, indicating that the student model
learns to more accurately mimic the results of the teacher model (Figure 9).</p>
      <p>Evaluating the performance of information retrieval systems is a critical step in improving their
efficiency. NIST assessors labeled the TREC 19 using multi-graded judgments, making it easy to
measure all the necessary and widely used state-of-the-art metrics.</p>
      <p>Reciprocal rank (RR) calculates the score for the first relevant passage in the ranked list. This
metric is very important when we need to evaluate the occurrence of the most relevant passage:
where rel the rank of the first relevant passage in the list.</p>
      <p>Mean average precision (MAP) evaluates both the relevance of the suggested passages and the
position of the most relevant passages at the top. For each query, the average precision (AP) is
determined by calculating the arithmetic mean of the precision scores for every position in which a
relevant passage was found. This metric focuses on the ability to distinguish relevant and irrelevant
items:
where Q</p>
      <p>the total number of queries.</p>
      <p>Normalized discounted cumulative gain (NDCG) measures the ability of machine learning
algorithms to sort passages by relevance. NDCG is determined by dividing the discounted cumulative
gain (DCG) by the ideal DCG, representing the best version of the rating. This is useful in many
scenarios where we expect passages to be sorted by relevance:

=</p>
      <p>,

1

∑ 
,
(1)
(2)
(3)
where k the number of items considered in the calculation.</p>
      <p>Both MAP and NDCG reflect ranking quality, but account for rank reduction in different ways.
MAP gives more weight to relevant passages at the top of the list because this metric is based on
precision, making it more sensitive to changes in early positions. DCG assigns decreasing weights
to passages proceeding down the hierarchical order, but they are logarithmic and reduce the
contribution of passages not very quickly.</p>
      <p>In the case of models based on the bi-encoder architecture, it makes sense to evaluate up to 100
passages using the above metrics (Table 1). If the obtained performance is not sufficient, a reranking
model based on the cross-encoder architecture can be used to improve results.
Performance of the original and trained model on TREC 19</p>
      <sec id="sec-4-1">
        <title>Model</title>
        <p>msmarco-distilbert-base</p>
        <p>v4
msmarco-distilbertmultilingual-en-uk</p>
        <p>RR
0.96</p>
        <p>The bilingual model trained using knowledge distillation inherited the capabilities of the
monolingual teacher</p>
        <p>model and shows reasonable results for both languages. Obtained
768-dimensional vectors are optimized to work with cosine similarity as expected.</p>
        <p>We can notice a slight decrease in performance for English, which is not critical since we got the
ability to search passages using multiple languages. A higher score does not necessarily mean higher
performance in production. At some point, models can become too specialized on the MS MARCO
and its selection bias.
and passages.</p>
        <p>It is impossible to create a perfect dataset for semantic search. Manual creation of a subset even
for evaluation is quite expensive and unfortunately always has some selection bias. This is a
longrecognized problem, but there is no good solution for it, especially at the scale of millions of queries</p>
        <p>The number of languages may not be limited to two, but it is better to reasonable add only justified
languages and keep a balance between performance and multilingual search capabilities. Otherwise,
new languages to the model degrades
performance because the capacity of the model remains the same.</p>
        <p>The performance of the trained model was also affected by the quality of the data, which in our
case is a combination of original and machine-translated queries. Cloud solutions such as Google
Translate or DeepL could marginally improve the results but would not compare to manual
translation by native speakers. The time taken to create pairs using the neural machine translation
model was about 50% longer than the time required to train the student model. Intuitively, it makes
the knowledge distillation less of a training and more of a translation task.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>For multilingual processing of asymmetric semantic search queries, the DistilBERT base multilingual
(cased) as a student model and the MS MARCO DistilBERT base v4 as a teacher model can be used
as components in the scenario when storage contains indexed data in English with the necessity to
search for information in Ukrainian up to the reciprocal rank (RR), the mean average precision
(MAP), and the normalized discounted cumulative gain (NDCG) as its evaluating.</p>
      <p>The authors achieved a reasonable level of accuracy (about 93%) for English to Ukrainian and
Ukrainian to English translation tasks, indicating a high level of proficiency in both directions. In
the future, this result will be improved by the quality of the data, which can be a combination of
original queries and manual translations by native speakers. The resulting model is useful in
industries such as finance, healthcare, and e-commerce, where huge data sets are prevalent and
asymmetric semantic search plays a key role to quickly retrieve relevant information.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kgosietsile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. U.</given-names>
            <surname>Okike</surname>
          </string-name>
          ,
          <article-title>An Intelligent Semantic Vector Search Model for Grading and Assessing Students</article-title>
          , in: 2023
          <source>International Conference on Sustainable Technology and Engineering (i-COSTE)</source>
          , Nadi, Fiji,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/i-
          <fpage>COSTE60462</fpage>
          .
          <year>2023</year>
          .
          <volume>10500811</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shirol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>Semantic Search for Sustainable Platforms Using Transformers</article-title>
          , in: 2023
          <source>International Conference on Emerging Techniques in Computational Intelligence (ICETCI)</source>
          , Hyderabad, India,
          <year>2023</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>118</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICETCI58599.
          <year>2023</year>
          .
          <volume>10331079</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Aamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sherafgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Arbab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jamil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Bhatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <article-title>Deep Learning-based Semantic Search Techniques for Enhancing Product Matching in E-commerce</article-title>
          ,
          <source>in: 2024 IEEE 3rd International Conference on Computing and Machine Intelligence</source>
          (ICMI), Mt Pleasant, MI, USA,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICMI60790.
          <year>2024</year>
          .
          <volume>10586148</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Saikia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahapatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nandy</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Unveiling Deeper</surname>
          </string-name>
          Petrochemical Insights:
          <article-title>Navigating Contextual Question Answering with the Power of Semantic Search and LLM Fine-Tuning</article-title>
          , in: 2023
          <source>International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)</source>
          ,
          <source>Greater Noida, India</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>881</fpage>
          -
          <lpage>886</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCCIS60361.
          <year>2023</year>
          .
          <volume>10425564</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Employee Data, in: 2021
          <source>International Conference on Computational Intelligence and Computing Applications (ICCICA)</source>
          , Nagpur, India,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCICA52458.
          <year>2021</year>
          .
          <volume>9697114</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sindhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shamsi</surname>
          </string-name>
          ,
          <article-title>Semantic Social Searching-An Ontology Based Approach</article-title>
          , in: 2023
          <source>International Multi-disciplinary Conference in Emerging Research Trends (IMCERT)</source>
          , Karachi, Pakistan,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/IMCERT57083.
          <year>2023</year>
          .
          <volume>10075145</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Beliatis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Presser</surname>
          </string-name>
          ,
          <article-title>Semantic Search System For Real Time Occupancy</article-title>
          , in: 2021
          <source>IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS)</source>
          , Bandung, Indonesia,
          <year>2021</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>55</lpage>
          . doi:
          <volume>10</volume>
          .1109/IoTaIS53735.
          <year>2021</year>
          .
          <volume>9628719</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>An Analysis of the Technical Trend of Semantic Search in Natural Language Processing</article-title>
          ,
          <source>in: 2023 9th Annual International Conference on Network and Information Systems for Computers (ICNISC)</source>
          , Wuhan, China,
          <year>2023</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICNISC60562.
          <year>2023</year>
          .
          <volume>00033</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Md</surname>
          </string-name>
          ,
          <string-name>
            <surname>Blending Weighted</surname>
          </string-name>
          TF-IDF &amp;
          <article-title>BERT for Improving Semantic Search</article-title>
          ,
          <source>in: 2022 2nd International Conference on Advanced Research in Computing (ICARC)</source>
          , Belihuloya, Sri Lanka,
          <year>2022</year>
          , pp.
          <fpage>154</fpage>
          -
          <lpage>159</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICARC54489.
          <year>2022</year>
          .
          <volume>9753875</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. V</given-names>
            ,
            <surname>D. Dhabliya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <article-title>Ameliorating Semantic Search Through Advanced AI Techniques</article-title>
          ,
          <source>in: 2023 3rd International Conference on Smart Generation Computing</source>
          ,
          <article-title>Communication and Networking (SMART GENCON)</article-title>
          , Bangalore, India,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/SMARTGENCON60755.
          <year>2023</year>
          .
          <volume>10442780</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>FBC</surname>
          </string-name>
          :
          <article-title>Fusing Bi-Encoder and Cross-Encoder for Long-Form Text Matching</article-title>
          ,
          <source>in: Frontiers in Artificial Intelligence and Applications</source>
          , Krakow, Poland,
          <year>2023</year>
          , pp.
          <fpage>1473</fpage>
          -
          <lpage>1480</lpage>
          . doi:
          <volume>10</volume>
          .3233/faia230426.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4512</fpage>
          -
          <lpage>4525</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>365</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>MS</given-names>
            <surname>MARCO. URL</surname>
          </string-name>
          : https://microsoft.github.io/msmarco.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , S. Thottingal,
          <article-title>OPUS-MT Building open translation services for the World</article-title>
          ,
          <source>in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation</source>
          , Lisboa, Portugal,
          <year>2020</year>
          , pp.
          <fpage>479</fpage>
          -
          <lpage>480</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .eamt-
          <volume>1</volume>
          .
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soboroff</surname>
          </string-name>
          ,
          <article-title>TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2369</fpage>
          -
          <lpage>2375</lpage>
          . doi:
          <volume>10</volume>
          .1145/3404835.3463249.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper</article-title>
          and lighter,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1910</year>
          .
          <volume>01108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERT-Networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <source>Hong Kong, China</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3973</fpage>
          -
          <lpage>3983</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1410.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>