<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>STQ-UA: A dataset of synthetic and translated search queries for the Ukrainian language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danylo Boiko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazar Kohut</string-name>
          <email>nazar.kohut.mknssh.2024@lpnu.ua</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktoriia Mishkurova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleh Basystiuk</string-name>
          <email>oleh.a.basystiuk@lpnu.ua</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bogomolets National Medical University</institution>
          ,
          <addr-line>Beresteiskyi Avenue, 34, Kyiv, 03057</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Innoloft Inc.</institution>
          ,
          <addr-line>701 Brazos Street, Austin, TX 78701</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepana Bandery Street, 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>WDA'26: International Workshop on Data Analytics</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper introduces a novel dataset of 100,000 search queries specifically compiled for the Ukrainian language. Given the scarcity of such resources, the dataset was created using a dual approach: synthetic generation and machine translation. To generate authentic-sounding queries, we used zero-shot and three-shot prompting techniques with eight distinct state-of-the-art closed-source large language models (LLMs) from five leading providers: OpenAI, Google, Cohere, Anthropic, and Mistral AI. These providers have headquarters in the USA, Canada, and France, which are located on two continents, thereby adding a layer of geographical and potentially cultural diversity to the dataset. To accurately reflect realistic search intent and phrasing, we also used the same suite of models to translate a substantial set of anonymized real-world English search queries taken from two major search engines: Google and Bing. The resulting dataset provides a high-quality resource essential for training, evaluating, and fine-tuning models in a wide range of tasks, including information retrieval, query understanding, relevance ranking, and related search challenges within the Ukrainian context.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Search queries</kwd>
        <kwd>synthetic generation</kwd>
        <kwd>machine translation</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The eficiency of modern search engines and information retrieval systems heavily depends on their
ability to accurately understand and process user queries. To achieve this, advanced algorithms analyze
linguistic patterns and semantic structures to capture the essence. Training, evaluating, and fine-tuning
the underlying models require extensive, high-quality datasets that reflect real-world search behaviors.
These resources allow models to learn the correlation between query intent and relevant content,
ensuring that results meet user expectations.</p>
      <p>
        Nowadays, models for widely spoken languages like English demonstrate the best performance and
dominate on the global stage, while many other languages face a significant data gap, which leads
to a spread of low-quality models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, the Ukrainian language has historically faced a
longstanding scarcity of resources, including educational materials, linguistic research, digital tools,
and cultural initiatives.
      </p>
      <p>
        Unfortunately, all generative models, including LLMs, have their biases due to the data they are
trained on, which can lead to outputs that are systematically prejudiced, unfair, or skewed against
certain groups or viewpoints [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A strategy that emphasizes complementary diversity of models can
address this fundamental problem and help us achieve more balanced and less biased outcomes in the
dataset.
      </p>
      <p>We carefully selected a suite of eight state-of-the-art models from five leading providers, including
OpenAI’s GPT-4o and GPT-4o Mini, Google’s Gemini 1.5 Flash and Gemini 2.0 Flash, Cohere’s Command</p>
      <p>
        A and Command R+, Anthropic’s Claude 3.5 Haiku, and Mistral AI’s Mistral Large. These providers
have headquarters located in three countries (USA, Canada, and France) spread across two continents
(North America and Europe), adding a layer of geographical and potentially cultural diversity to the
dataset. In addition to improving robustness, it enables a thorough investigation of how various models
interpret and react to diferent prompts, which eventually promotes a deeper comprehension of behavior
in diferent cultural contexts [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Historically, large-scale search query datasets have relied either on logs released by major search
engine providers or on data collected through specialized academic or commercial eforts. Here are the
most notable and widely used English datasets, representing a wealth of real-world queries and search
interactions collected from well-known search engines:
• MS MARCO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a large-scale dataset designed by Microsoft for machine reading comprehension
and information retrieval tasks. It comprises 1,010,916 anonymous questions extracted directly
from Bing’s search logs, ofering a valuable collection of concise, real-world, natural language
queries.
• Natural Questions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a question-answering dataset developed by Google Research, consists of
real, anonymized, and aggregated queries issued to the Google search engine and corresponding
answers. The public release includes 307,373 training samples with single annotations, 7,830
samples with 5-way annotations for development data, and a further 7,842 samples with 5-way
annotations as test data.
• MIMICS [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a collection of search clarification datasets created by Microsoft for research on
conversational information seeking systems. It was built from real search queries sampled from
Bing’s query logs, where each data sample includes a clarifying question and up to five candidate
answers intended to refine the original query. The total collection includes 3 datasets, comprising
more than 450,000 unique queries.
      </p>
      <p>
        To meet diverse needs, there are a few datasets available for a range of Ukrainian natural language
processing (NLP) tasks. For example, large corpora such as UberText 2.0 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and CC-100 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] derived
from web crawls serve as the basis for pre-training LLMs. The BRUK corpus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] ofers genre-balanced
samples that can be used for more structured linguistic analysis or model training on diferent text styles.
Furthermore, there are enough datasets for less common tasks: Djinni Recruitment [10] focuses on IT
recruitment, UA-GEC [11] provides annotated text for grammatical error correction, ParaRook||DE-UK
[12] serves as a parallel German-Ukrainian corpus for machine translation, etc.
      </p>
      <p>In turn, the landscape of publicly available search query datasets for the Ukrainian language is
significantly more limited than for other Ukrainian NLP tasks and pales in comparison to the millions
of real-world queries available for English. With partial success, we can only use questions from the
UA-SQuAD dataset [13], which is a translation of part of the original SQuAD 2.0 [14] and consists of
13,859 samples.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Synthetic generation</title>
      <p>Synthetic data generation [15, 16] is a widely used approach for creating artificial data that mimics the
statistical properties and patterns of real-world resources. This technique is especially valuable in our
case because getting real data is impossible without access to search providers.</p>
      <p>To control the randomness and diversity of the content produced by LLMs, it is crucial to use the
temperature and top-p parameters [17]. The temperature parameter afects the probability distribution
of the model’s predictions. It essentially controls how “creative”, or “conservative” the model’s outputs
will be. Top-p sampling, also known as nucleus sampling, limits the selection to a subset of words
whose cumulative probability exceeds a given threshold instead of selecting from the entire vocabulary.
Balancing these settings efectively allows for a tailored interaction with models, whether for creative
writing or providing informative content.</p>
      <p>To generate a subset of synthetic search queries, we used a temperature of 0.85 and nucleus sampling
of 0.8 to balance creativity and variance, while kernel sampling was used to maintain relevance and
consistency. Beyond these direct parameter adjustments, we also indirectly influenced the models using
both zero-shot and three-shot techniques. This allowed us to explore diferent hint strategies to control
the characteristics of the generated queries.</p>
      <p>We used zero-shot prompting, which involves giving the model common generation instruction
without providing any examples, to directly create 25,000 search queries. This approach allowed us to
consistently guide the generation process based solely on the knowledge embedded in the parameters
of the models.</p>
      <p>By providing a few proper examples, models can better understand the desired style of content,
resulting in more accurate and contextually relevant synthetic data. To generate another batch of 25,000
queries using the three-shot prompting, we combined the common generation instruction with three
examples in the Ukrainian language.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Neural machine translation</title>
      <p>A valuable alternative to the synthetic generation described earlier is machine translation using LLMs
[18], which, being trained on massive corpora, can produce remarkably fluent and contextually
appropriate outcomes across a wide range of topics and query styles.</p>
      <p>To create a subset with machine-translated queries, we used anonymized real-world samples from two
major search engines (Google and Bing). The previously described English datasets, Natural Questions
and MIMICS, served as the data source.</p>
      <p>For machine translation, it is appropriate to use low values for parameters responsible for the
randomness during model configuration. A temperature of 0 reduces variability, enforcing determinism
by sequentially selecting the most probable tokens. Meanwhile, a nucleus sampling of 0.05 restricts the
set of tokens to the most confident predictions, balancing the accuracy and fluency.</p>
      <p>We provided original queries to the models in batches of size 10 and overrode the system prompt.
For some samples, the models produced incorrectly formatted outputs [19]. Comparing the number of
failed batches to the total number determines the failure rate for each model (Table 1).</p>
      <p>These failure rates highlight the importance of selecting the appropriate models based on the specific
task to unleash their potential. We translated 25,000 queries each from Google and Bing logs, distributing
them evenly among models regardless of the obstacles confronting some of them.</p>
      <p>During machine translation, some abbreviations, names, digits, etc., may retain their original spelling
in English. To compare the content similarity between the translated and source queries, we used the
Ratclif-Obershelp algorithm and computed a score ranging from 0.0 to 1.0 for each pair. Presenting the
average value and standard deviation helps maintain simplicity in displaying the outcomes (Table 2).</p>
      <p>The Ratclif-Obershelp algorithm compares two strings by finding the largest common substrings
between them. It recursively identifies the largest common fragment in two strings and then repeats
this process for the remaining strings to the left and right. The similarity score reflects how alike the
two strings are in terms of content and overall structure.</p>
      <p>All models demonstrated relatively low average similarity scores between the translated and original
queries. Given the diferences in language structures, it is not surprising that the Ukrainian queries
significantly difer from the English ones. The standard deviations are quite small, which indicates the
consistency of the dataset.</p>
      <p>At first glance, it may seem that datasets based on search engine logs have no disadvantages, but
not everything is so unequivocally. Bias exists everywhere, and search engines are no exception. If the
claim that the query “download .net 8” is more likely to appear in Bing than in Google logs may be
open to debate, the fact that the query “how to upload images on google drive” is more expected to be
found in Google than in Bing logs is impossible to dispute.</p>
      <p>It is important to notice that Bing queries have slightly higher average similarity scores and standard
deviations compared to Google. This bias can be explained by the specific nature of the queries that
users enter in diferent search engines.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Data overview</title>
      <p>The final dataset comprises 100,000 real-world-like queries, evenly split between machine-translated and
synthetically generated samples. The translated queries are divided equally between Bing and Google
subgroups. Similarly, the generated queries are split based on zero-shot and three-shot techniques.
Each of these four subgroups is further divided into eight parts based on utilized models, resulting in 32
subsegments of 3,125 queries each.</p>
      <p>This well-organized composition enables in-depth analysis according to source, generation method,
or particular model performance. To support this, each query is accompanied by relevant metadata and
follows a consistent schema (Table 3), which outlines the fields provided for each sample.</p>
      <p>Considering the prevalence of semantic text processing in today’s digital world [20], we focused on
queries that provide enough context. That is why STQ-UA doesn’t include queries containing less than
3 words, as they are too short and lack the necessary semantic information (Figure 1).</p>
      <p>The overall trend shows a positive correlation between the number of words and characters. The
graph’s lower part concentrates the majority of points, suggesting that most queries range from 3 to
14 words and 7 to 80 characters. However, there are several points in the upper right corner, which
indicate the presence of extreme cases with long queries.</p>
      <p>To reflect the semantic diversity of the dataset, it is appropriate to use clustering. We used the LaBSE
model [21] to construct high-dimensional vector representations of search queries. The HDBSCAN
algorithm [22] combined these embeddings, identified clusters of varying density, and separated noise.
query The unique search query obtained from the model_name.
model_provider The provider responsible for the model_name (openai, google, cohere, anthropic or mistral).
model_name The model used to obtain the query (gpt-4o, gpt-4o-mini, gemini-1.5-flash,
gemini-2.0flash, command-a, command-r-plus, claude-3.5-haiku or mistral-large).
approach The approach used to obtain the query (zero-shot, three-shot or translation).
search_engine The search engine from which the source_query was taken. If the approach is
translation, this field will contain either google or bing. In all other cases, this field will be
empty.</p>
      <p>source_query The real-world user query taken from search_engine logs.</p>
      <p>For visualization, we used the UMAP method [23], which projected vectors into a two-dimensional
space while preserving their structure (Figure 2).</p>
      <p>Two-dimensional projection demonstrates a high fragmentation of the feature space with hundreds
of compact clusters located unevenly and with varying densities. In the center of the space, there are
areas with an increased concentration of points corresponding to the most frequent types of queries,
while the peripheral areas are represented by small groups and isolated points, potentially anomalous or
rare. Elongated and curved structures reveal gradual transitions between semantically related groups.</p>
      <p>Since we retain both source and translated queries, the dataset could be valuable for NLP tasks
that utilize data in multiple languages. For example, it can be applied to knowledge distillation [24],
where knowledge from a more complex model is transferred to a smaller one [25], as well as adapting
monolingual models for multilingual capabilities [26, 27].</p>
    </sec>
    <sec id="sec-6">
      <title>6. Practical application</title>
      <p>One of the useful applications of the final dataset is training a model for autocorrection in search queries,
focusing on the most common types of errors, such as typos. At the initial stage of this field, systems
mostly relied on rule-based approaches. Later, they were gradually replaced by statistical methods,
which analyzed large corpora of texts to learn the probabilities of word sequences (for example, using
n-gram models). The modern approaches use machine learning, in particular deep learning established
on the sequence-to-sequence architecture [28].</p>
      <p>In April 2024, Grammarly introduced the spivavtor-large [29], a model for the Ukrainian language
based on the mt0-large multilingual transformer [30] with approximately 1.2 billion parameters, designed
for eficient text editing and solving complex linguistic tasks. However, despite its advantages, the model
demonstrates limited performance for typo correction in search queries, particularly when handling
short, highly informal user inputs, which emphasizes the importance of fine-tuning on the STQ-UA to
better capture domain-specific patterns.</p>
      <p>Using a script for typo generation, we created a dataset with search queries containing one of three
predefined errors: adding an extra letter, replacing one with another, or omitting one. Based on the
original search queries and their versions with synthetically generated typos, spivavtor-large was
ifne-tuned on an NVIDIA A100 GPU (Figure 3). The main configuration parameters of the pipeline
were a learning rate of 5e-5, a batch size of 8, and 5 training epochs, with the sequence length limited to
128 tokens.</p>
      <p>The training loss indicates that the model is learning and improving its fit to the data. However, the
validation loss reveals overfitting after 3 epochs, which may impair the performance on new, unseen
data. For models like spivavtor-large, efective evaluation involves using metrics such as precision,
recall, and the combined   score. In the context of typo correction, it is reasonable to use  at 0.5,
as the value allows us to consider the balance between identifying relevant and avoiding irrelevant
predictions. We compared the evaluation metrics of the baseline model with variations after 3 and 5
epochs of fine-tuning (Table 4).</p>
      <p>Comparison of the base and fine-tuned models reveals a significant improvement in eficiency. In
particular, the best results were achieved after 3 epochs, which was expected based on the analysis
of the validation loss. This suggests that the fine-tuning process efectively captured the task-specific
patterns.</p>
      <p>Using high-quality datasets is the key factor that enables faster adaptation and reduces overall
computational requirements. Such optimization is especially important because modern models require
significant computing resources, including suficient memory and powerful GPUs, not only for training
but also for inference.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>This paper presents STQ-UA, a new large-scale dataset comprising 100,000 search queries for the
Ukrainian language. Following the significant lack of such publicly available resources, we applied a
dual strategy combining synthetic generation and machine translation to maintain linguistic diversity
and consistency.</p>
      <p>To ensure less biased outcomes, we used a diverse set of eight state-of-the-art LLMs incorporating
varying architectures from five leading providers (OpenAI, Google, Cohere, Anthropic, and Mistral AI)
headquartered in three countries (USA, Canada, and France) and spread across two continents (North
America and Europe).</p>
      <p>We applied both zero-shot and three-shot prompting techniques for synthetic generation, producing
50,000 queries that were intended to mimic real-world user search intent. To ensure the inclusion of
authentic search patterns, we translated 50,000 real-world English search queries taken from Google
and Bing logs.</p>
      <p>The analysis involved evaluating the performance of the models during translation, revealing varying
failure rates. We also assessed the content similarity between translated and source queries using the
Ratclif-Obershelp algorithm, finding generally low average scores, indicating a significant
transformation while retaining some original elements such as abbreviations and digits.</p>
      <p>The resulting dataset was manually verified and ofers a previously scarce resource to the Ukrainian
NLP community, making another step toward bridging the global data gap for under-resourced languages.
It can be used for training, evaluating, and fine-tuning models for various search-related tasks, including
information retrieval, query autocompletion, and relevance ranking. Future work would involve
building a dataset with a larger number of used models and search engines, as well as attempting to
ifnd real-world Ukrainian search queries in common crawls.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 91–95. doi:10.18653/
v1/2023.unlp-1.11.
[10] N. Drushchak, M. Romanyshyn, Introducing the djinni recruitment dataset: A corpus of
anonymized CVs and job postings, in: M. Romanyshyn, N. Romanyshyn, A. Hlybovets,
O. Ignatenko (Eds.), Proceedings of the Third Ukrainian Natural Language Processing
Workshop (UNLP) @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 8–13. URL:
https://aclanthology.org/2024.unlp-1.2.
[11] O. Syvokon, O. Nahorna, P. Kuchmiichuk, N. Osidach, UA-GEC: Grammatical error correction
and fluency corpus for the Ukrainian language, in: M. Romanyshyn (Ed.), Proceedings of the
Second Ukrainian Natural Language Processing Workshop (UNLP), Association for Computational
Linguistics, Dubrovnik, Croatia, 2023, pp. 96–102. doi:10.18653/v1/2023.unlp-1.12.
[12] M. Shvedova, A. Lukashevskyi, Creating parallel corpora for Ukrainian: A German-Ukrainian
parallel corpus (ParaRook||DE-UK), in: M. Romanyshyn, N. Romanyshyn, A. Hlybovets, O.
Ignatenko (Eds.), Proceedings of the Third Ukrainian Natural Language Processing Workshop
(UNLP) @ LREC-COLING 2024, ELRA and ICCL, Torino, Italia, 2024, pp. 14–22. URL: https:
//aclanthology.org/2024.unlp-1.3.
[13] FIdo AI, UA-SQuAD, 2022. URL: https://huggingface.co/datasets/FIdo-AI/ua-squad.
[14] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable questions for SQuAD,
in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics,
Melbourne, Australia, 2018, pp. 784–789. doi:10.18653/v1/P18-2124.
[15] A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, Comprehensive
exploration of synthetic data generation: A survey, 2024. URL: https://arxiv.org/abs/2401.02524.
[16] Y. Lu, L. Chen, Y. Zhang, M. Shen, H. Wang, X. Wang, C. van Rechem, T. Fu, W. Wei, Machine
learning for synthetic data generation: A review, 2025. URL: https://arxiv.org/abs/2302.04062.
[17] M. Renze, The efect of sampling temperature on problem solving in large language models,
in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Findings of the Association for Computational
Linguistics: EMNLP 2024, Association for Computational Linguistics, Miami, Florida, USA, 2024,
pp. 7346–7356. doi:10.18653/v1/2024.findings-emnlp.432.
[18] W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, L. Li, Multilingual machine translation
with large language models: Empirical results and analysis, in: K. Duh, H. Gomez, S. Bethard
(Eds.), Findings of the Association for Computational Linguistics: NAACL 2024, Association for
Computational Linguistics, Mexico City, Mexico, 2024, pp. 2765–2781. doi:10.18653/v1/2024.
findings-naacl.176.
[19] D. X. Long, N.-H. Nguyen, T. Sim, H. Dao, S. Joty, K. Kawaguchi, N. F. Chen, M.-Y. Kan, LLMs are
biased towards output formats! systematically evaluating and mitigating output format bias of
LLMs, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations
of the Americas Chapter of the Association for Computational Linguistics: Human Language
Technologies (Volume 1: Long Papers), Association for Computational Linguistics, Albuquerque,
New Mexico, 2025, pp. 299–330. doi:10.18653/v1/2025.naacl-long.15.
[20] D. Maulud, S. Zeebaree, K. Jacksi, M. M.Sadeeq, K. Hussein, State of art for semantic analysis
of natural language processing, Qubahan Academic Journal 1 (2021) 21–28. doi:10.48161/qaj.
v1n2a40.
[21] W. Wang, G. Chen, H. Wang, Y. Han, Y. Chen, Multilingual sentence transformer as a multilingual
word aligner, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for
Computational Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates, 2022, pp. 2952–2963. doi:10.18653/v1/2022.findings-emnlp.215.
[22] C. Malzer, M. Baum, A hybrid approach to hierarchical density-based cluster selection, in: 2020
IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems
(MFI), 2020, pp. 223–228. doi:10.1109/MFI49285.2020.9235263.
[23] B. Ghojogh, A. Ghodsi, F. Karray, M. Crowley, Uniform manifold approximation and projection
(umap) and its variants: Tutorial and survey, 2021. URL: https://arxiv.org/abs/2109.02508.
[24] S. Hahn, H. Choi, Self-knowledge distillation in natural language processing, in: R. Mitkov,
G. Angelova (Eds.), Proceedings of the International Conference on Recent Advances in Natural
Language Processing (RANLP 2019), INCOMA Ltd., Varna, Bulgaria, 2019, pp. 423–430. doi:10.
26615/978-954-452-056-4_050.
[25] P. Liu, X. Wang, L. Wang, W. Ye, X. Xi, S. Zhang, Distilling knowledge from bert into simple
fully connected neural networks for eficient vertical retrieval, in: Proceedings of the 30th ACM
International Conference on Information &amp; Knowledge Management, CIKM ’21, Association
for Computing Machinery, New York, NY, USA, 2021, p. 3965–3975. doi:10.1145/3459637.
3481909.
[26] I. Yurchuk, D. Boiko, Extending monolingual asymmetric semantic search models for
multilingual query processing using knowledge distillation, in: V. Snytyuk, V. Morozov, I. Javorskyj,
V. G. Levashenko (Eds.), Proceedings of the Information Technology and Implementation (IT&amp;I)
Workshop: Intelligent Systems and Security (IT&amp;I-WS 2024: ISS), Kyiv, Ukraine, November 20
- 21, 2024, volume 3933 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 1–10. URL:
https://ceur-ws.org/Vol-3933/Paper_1.pdf.
[27] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge
distillation, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, 2020, pp. 4512–4525. doi:10.18653/v1/2020.emnlp-main.365.
[28] T. Ge, X. Zhang, F. Wei, M. Zhou, Automatic grammatical error correction for sequence-to-sequence
text generation: An empirical study, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 6059–6064. doi:10.18653/v1/P19-1609.
[29] A. Saini, A. Chernodub, V. Raheja, V. Kulkarni, Spivavtor: An instruction tuned Ukrainian text
editing model, in: M. Romanyshyn, N. Romanyshyn, A. Hlybovets, O. Ignatenko (Eds.), Proceedings
of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024,
ELRA and ICCL, Torino, Italia, 2024, pp. 95–108. URL: https://aclanthology.org/2024.unlp-1.12.
[30] N. Muennighof, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X.</p>
      <p>Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Webson,
E. Raf, C. Rafel, Crosslingual generalization through multitask finetuning, in: A. Rogers,
J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
Toronto, Canada, 2023, pp. 15991–16111. doi:10.18653/v1/2023.acl-long.891.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Nigatu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Tonja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rosman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Solorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>The zeno's paradox of 'low-resource' languages</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>17753</fpage>
          -
          <lpage>17774</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>983</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Bias and fairness in large language models: A survey</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>1097</fpage>
          -
          <lpage>1179</lpage>
          . doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Adilazuarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lavania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. O'Neill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Modi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>Towards measuring and modeling “culture” in LLMs: A survey</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>15763</fpage>
          -
          <lpage>15784</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>882</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          , T. Wang, MS MARCO:
          <article-title>A human generated machine reading comprehension dataset</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/1611.09268.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelcey</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          , Natural Questions:
          <article-title>A benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>452</fpage>
          -
          <lpage>466</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lueck</surname>
          </string-name>
          , E. Chen,
          <string-name>
            <given-names>R.</given-names>
            <surname>Quispe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Craswell, MIMICS: A large-scale data collection for search clarification</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2006</year>
          .10174.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaplynskyi</surname>
          </string-name>
          ,
          <source>Introducing UberText 2</source>
          .
          <article-title>0: A corpus of Modern Ukrainian at scale</article-title>
          , in: M.
          <string-name>
            <surname>Romanyshyn</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Second Ukrainian Natural Language Processing Workshop</source>
          (UNLP),
          <article-title>Association for Computational Linguistics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .unlp-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Starko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rysin</surname>
          </string-name>
          ,
          <article-title>Creating a POS gold standard corpus of Modern Ukrainian</article-title>
          , in: M.
          <string-name>
            <surname>Romanyshyn</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Second Ukrainian Natural Language Processing Workshop</source>
          (UNLP),
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>