<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wikipedia-grounded Dataset for Question Answering, Verification, and Text Generation for Ukrainian Language⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emil Nasirov</string-name>
          <email>enasirov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Taranukha</string-name>
          <email>volodymyr.taranukha@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Marchenko</string-name>
          <email>omarchenko@univ.kiev.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anatoly Anisimov</string-name>
          <email>anatoly.v.anisimov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Technologies and Systems</institution>
          ,
          <addr-line>40, Acad. Glushkova av., Kyiv, 03187</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>129</fpage>
      <lpage>143</lpage>
      <abstract>
        <p>Information Technology and Implementation (IT&amp;I-2025), November 20-21, 2025, Kyiv, Ukraine * Corresponding author. These authors contributed equally.</p>
      </abstract>
      <kwd-group>
        <kwd>dataset</kwd>
        <kwd>Ukrainian language</kwd>
        <kwd>Wikipedia</kwd>
        <kwd>ODQA</kwd>
        <kwd>fact verification</kwd>
        <kwd>text generation 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
with downstream use
research. There are some good modern benchmarks and resources for this task.</p>
    </sec>
    <sec id="sec-2">
      <title>1] show how to condition generation on</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] gives example
how table-to-text items in training sample has to include highlighted cell selections and lexical
constraints, which reduces hallucinations and teaches controlled generation from structured data.
KILT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] shows that each item can be bound to a stable Wikipedia snapshot via page IDs and
character offsets, enabling deterministic rebuilds, disambiguated entity targets, and leakage checks
between train/dev/test. PAQ [4
questions. Such generation must be used carefully by keeping source passages for each question,
without spurious patterns. BEIR offers a heterogeneous retrieval benchmark spanning fact
checking, QA, and more, suitable for zero-shot retrieval evaluation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. MIRACL provides
expertannotated, multilingual monolingual retrieval over Wikipedia across 18 languages (not including
Ukrainian), aligning with modern IR pipelines[6
-hop questions, matching LLM reasoning evaluation[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. AmbiQA and successors
model questions with multiple valid interpretations and answers, recent work on generative
disambiguation sharpens evaluation under ambiguity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. FEVEROUS extends FEVER to require
structured + unstructured evidence, closer to real Wikipedia use [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. AmbiFC introduces 10k claims
with fine-grained evidence, advancing robustness beyond single-evidence claims [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, all
mentioned works do not support Ukrainian language.
      </p>
      <p>
        The authors have experience in designing linguistic resources and datasets [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. In order
to facilitate further research in Computational Linguistics for Ukrainian language we aim to design
and implement a reproducible Ukrainian language centered Wikipedia-to-Tasks pipeline that will
do the following:
•
•
•
•
•
robustly extracts and cleans content from MediaWiki snapshots;
performs adaptive chunking that respects section/paragraph boundaries and augments
queries with retrieval context;
builds lexical, dense, and hybrid indices with explicit provenance;
generates and normalizes instruction and QA pairs supported by retrieval evidence; and
filters answer mismatches by embedding
similarity) with targeted human review.
      </p>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>Nevertheless, several published resources can serve as practical surrogates.</p>
        <p>MKQA provides aligned question answer pairs across 26 languages (including Ukrainian) for
ope ]. Multilingual QA pairs not bound to a specific Wikipedia snapshot or
For our task, we must mine evidence on Ukrainian Wikipedia and bind each item to page/section
IDs to make it usable for training, as an evaluation</p>
        <p>UNLP
questions (ZNO) [15</p>
        <p>
          1.0 suite publishes Ukrainian evaluation tasks for sanity checks and model
selection [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. It is a broad Ukrainian evaluation battery with no Ukrainian Wikipedia passage
provenance. It can be useful after training to check tokenization/morphology and to detect
regressions unrelated to retrieval.
blished in ACL Findings 2021 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. It is
highlighting [18
enrich entity fields in the index, supervise evidence span highlighting, and evaluate entity fidelity
in generated answers.
        </p>
        <p>
          TRWU contributes Ukrainian news messages with stance/sentiment/bias labels for robustness
experiments [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. It contains kept for robustness (behavior on
c retrieval and
pages) if we later extend beyond QA.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Source acquisition and filtering</title>
        <p>The Wikimedia Dumps portal documents the pages-articles.xml.bz2 files, directory structure, and
checksums. We pinned the corpus to a single Wikimedia dump snapshot of the target Wikipedia
(e.g., ukwiki-20250401-pages-articles.xml.bz2) to guarantee reproducibility and leakage control. We
pulled the dump from https://dumps.wikimedia.org/ukwiki/20250401/ using wget with resumed
transfers and verified integrity against the published checksums. We recorded the snapshot date,
dump file names, file sizes, and checksums in a machine-readable manifest (snapshot.json). We
chose the pages-articles dump (namespace 0) so only encyclopedic articles are included by default.
We also archived the stub-meta-current.xml.gz for quick metadata lookups. The manifest stores
{project: "ukwiki", snapshot: SNAP, dump: filename, md5} and is checked into version control.</p>
        <p>Grounding all tasks in a single Wikipedia snapshot is the standard practice that underpins
reproducibility, comparability, and leakage control in knowledge-intensive benchmarks (e.g.,
KILT). Binding our pipeline to a specific snapshot ensures that retrieval sets and generated
examples are stable across reruns and across research groups. Compared to pipelines that mix API
reads with live content, our snapshot discipline yields deterministic corpora, mirroring best
practice in KILT and BEIR while focusing on Ukrainian. This improves auditability (we can always
point to the exact article revision) and allows fair head-to-head retriever evaluations.</p>
        <p>
          We converted wikitext to cleaned plain text using WikiExtractor[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] with JSON output enabled,
then performed a light second-pass cleanup with mwparserfromhell for edge cases. Each output
record contains {id, title, url, lang, revision, section_path, text}. WikiExtractor is the de-facto tool
for transforming Wikipedia dumps into plain text, handling templates and producing scalable
shards, mwparserfromhell[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] adds robust programmatic markup handling for edge cases. Using
both minimizes noise without losing provenance. Some public corpora ship only monolithic text or
rely on ad-hoc regex cleaning. Our extractor pipeline outputs structured JSONL with section paths,
enabling section-aware retrieval and generation. This improves downstream RAG quality and
facilitates manual audits (reviewers can jump to the exact section).
        </p>
        <p>Although pages-articles already targets namespace 0, we applied additional page-type and
quality filters before indexing:
1. Redirects &amp; disambiguation: dropped pages beginning with #REDIRECT and those
containing common disambiguation templates, we also filtered pages whose entire body is
a bulleted list of links (list-pages and set-index pages).
2. Minimum content , we
removed articles dominated by tables/lists (&gt;70% non-prose).
3. Structural sanity: excluded pages with extreme template density (ratio of markup tokens to
words) and pages with no alphabetic tokens in the first paragraph.</p>
        <p>Retrieval systems are sensitive to noisy or weakly structured pages, filtering out non-prose
content improves dense retriever effectiveness and reduces hallucination in RAG. Benchmarks
such as BEIR and MIRACL curate Wikipedia content with quality controls (native judgments,
explicit exclusions), which supports our conservative filtering stance. Relative to English-centric
pipelines (KILT/BEIR), our Ukrainian-first filtering aggressively removes list-pages/disambiguation
pages that often poison top-k retrieval. This is expected to lower the proportion of non-answerable
passages retrieved by &gt;10% while retaining coverage of long-tail topics.</p>
        <p>
          Next sub-step is provenance &amp; audit trail. Precise provenance enables error analysis and
deterministic rebuilds key requirements echoed across KILT, BEIR or FEVEROUS. Maintaining
revision IDs supports future diff-based updates without full re-extraction. Now every passage
carries strong provenance: {page_id, revision_id, title, url, section_path, snapshot_id, timestamp,
lang}. We computed a stable document hash (SHA-1 of normalized wikitext) and stored it in both
the JSONL[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and the manifest. During ingestion we wrote structured logs per step (download,
extract, clean, filter), each with counters and timing information. Many open releases expose only
raw text or opaque indices. Our intermediate results are inspection-friendly and ready for hybrid
retrieval, with enough metadata to reproduce any result or review any generated answer.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Chunking and Tokenization</title>
        <p>Our segmentation design converts article-level text into retrieval-ready passages with the
following requirements: indexability under subword token limits of modern encoders, contextual
coherence sufficient for sentence-level factual questions, and consistency across languages
(Ukrainian primary, multilingual optional) if possible.
={100,
200, 300}) with overlap O (default O=0.2L) using the tokenizer aligned to the downstream encoder,
so consecutive windows start at token indices.</p>
        <p>
          We adopted the tokenizer of the encoder actually used for passage representation to eliminate
train test mismatches in subword boundaries. Our default tokenizer is BERT-family
WordPiece[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] (30k vocabulary, using [CLS]/[SEP] specials), but the pipeline is model-agnostic, so
other models can be used also, such as SentencePiece (Unigram/BPE), byte-level BPE, and variants
required by multilingual encoders (e.g., XLM-R[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], mDeBERTaV3[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]) and sentence-embedding
models (e.g., E5[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]/BGE[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]). (Discussion on other tokenizers is placed in Discussion section). In
masked-LM encoders, subword tokenizers (such as WordPiece, SentencePiece) stabilize OOV
handling in morphologically rich languages and reduce vocabulary size without sacrificing
coverage.
        </p>
        <p>The window family is motivated by prior retrieval-augmented systems that operate on short
passages (approx. 100 200 tokens), and by recent analyses of long-context behavior indicating
sensitivity to the position of evidence. Overlap protects discourse continuity across boundaries,
away from the beginning or end of the prompt, by ensuring that the same sentence can occur near
retrieval and generation. For sections shorter than L it is accepted that they form a single passage,
very long sentences are allowed to straddle boundaries to preserve token-level determinism.
However, we ensure at least one boundary falls at whitespace. We record token_span and
char_span for every passage to retain reproducibility and facilitate highlighting.</p>
        <p>We performed simple test on a small subset of our corpus with bag-of-words representation and
queries derived from section leads. We compared different values of L (100,200,300) with different
overlaps O (0.1, 0.2, 0.3) under a simple retriever. Shorter windows (100 200) consistently showed
better retrieval precision at short queries, while 300-token windows slightly improved recall for
multi-sentence questions. Overlap at 0.2 was found as optimal point between index growth and
recall. Going for 0.3 offered limited additional benefit. These patterns accord with prior literature
that emphasizes short, self-contained passages for retrieval-augmented QA.</p>
        <p>Fast tokenizers provide deterministic normalization and alignment maps, enabling exact
reconstruction of text spans for audits. This matters for retrieval because span misalignment
degrades dense similarity and complicates evidence highlighting. Finally, overlap increases index
size, we mitigate this with deduplication keyed by (page_id, token_span) and by pruning windows
that are near-empty after cleaning.</p>
        <p>There is a certain risk in such approach since fixed windows can split discourse phenomena
(anaphora, enumerations). There are other options but we decided against using them in spite of
mentioned risks due to their disadvantages.</p>
        <p>Sentence-aware splitting uses sentence boundaries to avoid mid-sentence cuts, optionally
packing adjacent sentences until a token budget is reached. It improves readability and fewer
dangling coreferences, but there are disadvantages: reliance on sentence segmenters that may
degrade on titles, lists, or scientific notation and introduce language-specific errors.</p>
        <p>Semantic (embedding-aware) splitting provides adaptive boundaries chosen by embedding
similarity between adjacent sentences, chunks expand or contract to keep intra-chunk coherence
high. It often produces fewer irrelevant tokens per chunk, but it has disadvantages: higher</p>
        <p>Hierarchical or hybrid splitting organizes passages at multiple granularities (section to
paragraph to token windows) to enable coarse-to-fine retrieval. It is flexible with recall/precision
trade-offs, but it has such disadvantages as index complexity and fusion logic at query time.</p>
        <p>We adopted fixed token windows as the primary method for their determinism, speed, and
However, our code path keeps sentence-aware
and semantic splitters as pluggable components for ablations.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Storing Data</title>
        <sec id="sec-2-4-1">
          <title>5.1. Storage of chunks in an indexed JSON format</title>
          <p>We adopted a line-oriented JSONL corpus as the basis store for all passage-level artifacts produced
-granularity supports incremental processing,
graceful degradation (malformed lines quarantined without corrupting neighbors), and easy diffing
across releases. It also simplifies re-tokenization experiments: the same JSONL payload can be
re-tokenized into alternative chunkings without re-extracting Wikipedia content, preserving
provenance fields verbatim across regenerations.</p>
          <p>Although JSONL lacks intrinsic indexing, line boundaries permit offset indexing without
parsing the entire file. During writing we captured tell() positions and compressed-block
boundaries, at finalize time we produced an index table (SQLite) keyed by doc_id. This way,
retrieving k passages from disparate shards requires only k random seeks and decompression of at
most k gzip members.</p>
          <p>Each passage is recorded as a self-contained JSON object on a separate line, enabling stream
processing and line-granular recovery. We bound every passage to an immutable, 64-bit doc_id and
preserved both char_span and token_span derived during segmentation, so that auditors can
reconstruct the exact context byte-accurately from the source text. For efficient random access, we
generated a compact offset index that maps doc_id into tuple of (file_path, byte_offset, byte_len)
and persisted it as a SQLite container provide supporting functionality (sidecar). This yields O(1)
document lookup while keeping the storage human-inspectable.</p>
          <p>The extraction pipeline produced gzipped shards (approx. 1
lang/snapshot_id and by the first digits of doc_id. We normalized Unicode (NFC), compacted
whitespace, and retained minimal yet sufficient metadata: page_id, revision_id, title, url,
section_path, lang, snapshot_id, tokenizer, encoder, tokens, token_span, char_span, and text. The
resulting JSONL serves simultaneously as: ground truth for all later layers, as an audit log of
segmentation decisions, and a portable interchange format requiring only a text reader.</p>
          <p>Schema (abbreviated). {doc_id:int64, page_id:int64, revision_id:int64, title:str, url:str,
section_path:list&lt;str&gt;, lang:str, snapshot_id:str, tokenizer:str, encoder:str, tokens:int32,
token_span:list&lt;int32&gt;, char_span:list&lt;int32&gt;, text:str}. We deliberately avoided nested blobs that
duplicate the raw article, keeping each passage atomic and append-only. When a subsequent
snapshot changes an article, we allocate fresh doc_ids, ensuring immutability of previously
released shards.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>5.2. Building a vector index for RAG from JSON</title>
          <p>
            We constructed the initial retrieval subsystem entirely from the JSONL corpus, omitting any
text field was embedded using basic BERT [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] (default:
bert-base-multilingual-cased) with mean pooling over the last hidden layer gated by the attention
mask. We then L2-normalized the resulting vectors and stored them as a contiguous matrix
(float32, NumPy .npy). At query time, we compute a BERT embedding for the query with the same
procedure and perform brute-force cosine search against the matrix. Although asymptotically more
expensive than specialized ANN structures, this JSON-only baseline provides deterministic
alignment with the tokenizer used in segmentation, high auditability, and a strong sanity-check for
later, more complex indices.
          </p>
          <p>For this purpose JSONL shards were streamed in batches, texts were prefixed with no special
markers (BERT does not require instruction prefixes), tokenized with the exact tokenizer pinned in
Part 2, and encoded on GPU where available. We retained exact tokenizer/model revisions in the
shard manifest to guarantee reproducible embeddings. The embedding matrix and a parallel
doc_ids.npy vector (int64) were written per shard, with simple statistics (mean norm is approx. 1.0,
zero NaNs) logged for each batch.</p>
          <p>While specialized retrieval encoders exist, BERT remains a robust baseline for multilingual
semantics when paired with an appropriate pooling strategy. Using the same tokenizer/wordpiece
inventory as segmentation removes a source of drift between segmentation and retrieval and
simplifies ablation: any retrieval gains can be attributed to indexing rather than tokenization
mismatch.</p>
          <p>For a query string, we apply the same BERT tokenizer and mean-pooling, normalize the vector,
and compute cosine similarities against the passage matrix using blocked matrix multiplication to
keep CPU caches warm. We then select top-k by partial sorting and return the corresponding
doc_ids. Empirically, on a single 16-core server, brute-force cosine over approx. 250k passages
remains within interactive latencies (approx. 10 60 ms) due to efficient BLAS kernels. Beyond that
scale, sharding the matrix across processes provides linear throughput gains.</p>
          <p>In parallel to embeddings, we distilled a small SQLite sidecar keyed by doc_id holding title, url,
section_path, lang, and char_span. Given retrieved doc_ids, we reconstruct human-readable
evidence by reading the passage from JSONL (via the offset index explained in Part 4) and
highlighting the char_span. This minimal path preserves strict provenance while keeping the
indexing stack lightweight.</p>
          <p>The JSON-only, BERT-based retrieval provides a transparent and reproducible baseline that
reaches acceptable Recall@k for definition and entity-centric queries on our snapshot while
maintaining deterministic evidence reconstruction. It establishes a floor against which later
improvements (alternate encoders or ANN structures) can be measured without confounds.</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>5.3. HNSW: index construction and optimization for fast nearest‑neighbor search</title>
          <p>
            We transitioned from the JSON-only, brute-force cosine baseline to a graph-based approximate
nearest-neighbor (ANN) index using Hierarchical Navigable Small Worlds (HNSW)[
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]. The goal
was to maintain high Recall@k while driving down tail latency for interactive RAG. HNSW
organizes vectors into a multi-layer navigable small-world graph; queries descend levels greedily
and then perform local best-first search at the ground layer.
          </p>
          <p>We trained no additional model: with BERT-derived, L2-normalized passage embeddings
produced before we adopted inner product as equivalent to cosine on unit-norm vectors and
constructed HNSW indexes per shard (approx. 1 5 M passages). We set default hyperparameters to
M = 32, efConstruction = 200, efSearch = 64, with adaptive increases of efSearch for queries
exhibiting low score margins. Indexes were wrapped in IndexIDMap2 to preserve 64-bit doc_ids
and serialized alongside a manifest that records (dim, metric, M, efConstruction, efSearch,
checksum).</p>
          <p>HNSW yields a favorable latency recall frontier on sentence-level embeddings and supports
incremental additions. Memory overhead scales roughly linearly with M and the number of nodes;
thus, M = 32 balances recall and RAM, while efConstruction = 200 improves connectivity without
prohibitive build time. Empirically (on our Wikipedia-like corpus), this configuration reduced
median query time by an order of magnitude relative to brute force, with high Recall@20
preserved. We verified that these trends match established evaluations reported in independent
benchmarking environments and vendor-neutral documentation.</p>
          <p>We parallelized construction across shards. Each worker (CPU-bound for HNSW) mapped
embeddings_bert.npy read-only, built a IndexHNSWFlat(d, M, METRIC_INNER_PRODUCT), set
efConstruction, and added vectors in contiguous ID order to favor cache locality. After
add_with_ids, the index was reparameterized with efSearch defaults and written to disk. We
performed post-build sanity checks: degree distribution histograms, connectedness probes (BFS
sample), and a smoke-test query set with known neighbors.</p>
          <p>Queries encoded by BERT are normalized and searched with HNSW k-NN. We monitor margin
-1 and top-2) and empty-hit rate. If a margin is small or the hit set is nearly
duplicate, efSearch is raised (e.g., 64 is changed to
reranker cost in downstream generation.</p>
          <p>Search returns doc_ids, which we join to the SQLite sidecar to recover title, url, section_path,
lang, and exact char_span. The index and metadata are version-locked via manifest hashes to
guarantee reproducibility.</p>
          <p>HNSW is robust to moderate noise and local clustering; however, performance can degrade on
purely random vectors. We therefore encode with stable BERT embeddings and normalize vectors
to mitigate norm drift. We also reject shards whose post-build checks show abnormally high
disconnected components or degree pathology and recompute with higher M. The HNSW
deployment produced an interactive retrieval layer with stable p90 latency and competitive
Recall@k, while preserving strict, file-based reproducibility. This forms the first production-ready
search substrate atop our JSONL corpus.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>Templates for retrieval‑augmented generation</title>
        <sec id="sec-2-5-1">
          <title>6.1. Template choice</title>
          <p>We evaluated several variants of templated with understanding that they are not mutually
exclusive.</p>
          <p>
            QA style (Question/Context/Answer) - replaces Instruction with Question. atural for
factoid QA and pairs well with DPR retrievers[
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]. ess expressive for summarization
or style-constrained tasks.
          </p>
          <p>ReAct-style RAG interleaves Thought and Act(Search) steps to fetch more passages during
inference. It improves factuality and reduces hallucination by allowing targeted retrieval. But it
requires a controller and tool integrations (increasing development complexity) and is costlier and
harder to cache.</p>
          <p>Demonstrate-Search-Predict bootstraps few-shot demonstrations with retrieval in the loop
approach is strong for tasks that benefit from pipeline-aware exemplars. However, it is even more
complex and manpower intense than ReAct RAG and one must be careful with curation needed to
avoid leakage which rises manpower requirements even more.</p>
          <p>XML-constrained RAG wraps segments in tags:</p>
          <p>It is machine-readable and plays well with validators, also it is helpful for structured output
requirements. Disadvantages are: sensitivity to novel tags and a bit reduced human readability.</p>
          <p>JSON schema RAG when</p>
          <p>directly machine-readable and simplifies downstream scoring (same as XML-constrained).
However, it can over-constrain free-form exposition and is fragile if the model produces trailing
commentary.</p>
          <p>We decided to use the Instruction/Context/Answer (I/C/A) for ground generation in retrieved
evidence because it is sufficient for the task at hand. We adopted triple-quotes for marking the
context block to minimize boundary errors and support verbatim spans, while keeping machine
parsing trivial. Each context passage is accompanied by [doc_id] and char_span for provenance.
Our post-processor verifies that the final answer is attributable to identified sources and/or
satisfies RAG-specific metrics.</p>
          <p>System: You are a careful assistant. Use only the Context. If the Context is insufficient, say so.
Instruction:{instruction}
Context:
"""
[{doc_id_1}:{char0_1}-{char1_1}]{passage_1}
[{doc_id_2}:{char0_2}-{char1_2}]{passage_2}
"""
Answer:</p>
          <p>This format decouples Instruction (task control) from Context (evidence) and exposes explicit
provenance tags that our evaluator can match against citations.</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>6.2. Test implementation</title>
          <p>On the technical side we implemented a context packer that assembles the top-k passages returned
by HNSW into a bounded-length window. Passages are ranked by a convex combination of
similarity, recency (revision date), and entity overlap with the instruction. We de-duplicate
near-identical passages by doc_id and Jaccard overlap on token sets to reduce redundancy.</p>
          <p>
            We used Mistral-7B-Instruct [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] as the default generator owing to its latency quality balance
on commodity GPUs. The serving stack treats the generator as stateless: prompts are assembled per
request. To test that generated text is supported by provided evidence, we integrated very simple
attribution-focused metrics. Each answer is accompanied by a list of predicted citations (doc IDs);
our scorer computes faithfulness and context relevance using per token Jakard.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>Adaptive prompts for QA: template design, instantiation, and ablation</title>
        <p>Having established a high-recall nearest-neighbor index over BERT embeddings, we designed a
family of adaptive prompt templates question in a normalized slot, optionally
elicit reasoning traces for more demanding tasks, and standardize the answer channel to facilitate
evaluation. We additionally support a Template insertion control (e.g.,</p>
        <p>) that toggles the latent reasoning mode without enforcing a particular trace length.</p>
        <sec id="sec-2-6-1">
          <title>7.1. Formats and design trade-offs</title>
          <p>Structured prompts reduce ambiguity and provide latent affordances for reasoning strategies
(decomposition, plan/execution separation). Empirical studies report large changes from format
choices alone, even when semantic content is held constant. Reasoning-eliciting templates
consistently improve success on compositional and arithmetic tasks. Meanwhile, formatting
sensitivity implies that stable, parseable layouts (XML/JSON) can mitigate variance and ease
downstream evaluation.</p>
          <p>We tried several forms to test robustness and accommodate different inference regimes (not
templates):</p>
          <p>Minimalist (Question - Answer):
#Question:{question}
#Answer: &lt;final answer here&gt;.</p>
          <p>Such approach allows for lowest token overhead, strong for simple factoid QA when retriever is
accurate. However, it is fragile on multi-step reasoning, there is no explicit affordance to plan.</p>
          <p>Self-Ask
#Question:{question}
#Sub-question 1: &lt;question&gt; #Answer 1: &lt;answer&gt;
#Sub-question 2: &lt;question&gt; #Answer 2: &lt;answer&gt;</p>
          <p>It encourages explicit decomposition and can integrate search actions between sub-questions.
The disadvantages are longer prompts and it requires reliable gating to avoid over-decomposition.</p>
          <p>Plan-and-Solve header
#Plan
#
#
#Solve &lt;reasoned answer&gt;</p>
          <p>This approach separates planning from execution and reduces missing-step errors. But, it also
adds 10 25% token overhead and is sensitive to header labels.</p>
          <p>Least-to-Most
#Problem:{question}
#Easier subproblems: [s1,s2,
#Solve in order and combine.</p>
          <p>Such approach helps with compositional generalization. Though it has high manpower cost for
subproblems and there is a risk of leakage if subproblems reveal answers.</p>
          <p>Tree-of-Thought controller
#Goal:{question}
#Think in branches of up to B alternatives per step; depth D; evaluate with heuristic H.
#Return only the final answer.</p>
          <p>It enables breadth exploration and backtracking, but it has highest inference cost and requires
custom controller and stopping rules.</p>
          <p>Schema-constrained / XML-tagged
#&lt;q&gt;{question}&lt;/q&gt;&lt;r/&gt;
#&lt;format&gt;final-only&lt;/format&gt;</p>
          <p>It has best advantages: it is parseable, it aligns with downstream extraction and plays well with
safety filters. The only drawback is that LLMs are sensitive to unknown tags, so continuity is
paramount.</p>
          <p>In our ablations, minimalist and Plan-and-Solve variants yielded the best latency accuracy
balance under tight budgets, while Self-Ask and Tree-of-Thought offered higher ceilings on
multi-hop tasks at increased token cost. The XML-tagged variant proved most robust to noisy
context windows and is our default for production datasets because it simplifies evaluation and
redaction.</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>7.2. Template spec and serialization</title>
          <p>We formalize an Instance as a 4-tuple (question, reasoning, answer, template_label) with optional
reasoning and template_label. Each instance is serialized into two orthogonal views:
•
•
a human-readable (for this manuscript), and
a machine template with explicit delimiters for injection into LLM contexts.</p>
          <p>Our default machine template for adaptive prompts:
&lt;task&gt;
&lt;question&gt;{question}&lt;/question&gt;
&lt;reasoning&gt;{reasoning}&lt;/reasoning&gt;&lt;!--optional--&gt;
&lt;answer&gt;{answer}&lt;/answer&gt;
&lt;/task&gt;
&lt;controls&gt;
&lt;template&gt;{template_label}&lt;/template&gt;
&lt;/controls&gt;
&lt;!--e.g.,</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Explain</title>
      <p>step
by
step:
[{question}]--&gt;</p>
      <p>The XML-like markers act as sentinel delimiters to reduce boundary errors during
concatenation and to ease post-hoc parsing. We preserve the underlying JSONL record as ground
truth by mirroring these fields as keys .</p>
      <p>Prompt-format sensitivity is substantial, so structure and delimiters are not cosmetic. Our
ablations showed double-digit swings in exact-match when toggling separators and tag names,
formatting strongly affects LLM behavior.</p>
      <p>For each question we normalized the user questions by lowercasing only function words and
preserving named entities. This prevents capitalization-based retrieval drift while keeping entity
surface forms intact. Each question is linked to its retrieval bundle (top-k passages + doc_ids) for
provenance.</p>
      <p>For easy questions (factual, single-hop), we leave &lt;reasoning&gt; field empty and rely on evidence
citations during generation. For compositional or multi-hop questions, we generate a concise,
model-internal outline during dataset construction (two to four sentences) that expresses salient
relations or decompositions (e.g., -wavelength
). We use the &lt;template_label&gt; to switch on an instruction phrase (e.g., Explain step by
step), without prescribing the full chain-of-thought at inference -this lets the serving model decide
how much intermediate text to expose.</p>
      <p>We keep &lt;answer&gt; empty at prompting time (for test splits) or fill it for training instances used
in instruction tuning. In both cases the template reserves the slot and the post-processor knows</p>
      <p>This kind of templates naturally support hidden reasoning or outline-level hints but allows to
evaluate only the final answer against evidence.</p>
      <p>By consolidating on the XML-
-byachieved consistent improvements on multi-hop questions without exposing long reasoning traces,
lowered annotation overhead than with Self-Ask and Tree-of-Thought, and created robust parsing
in our evaluation harness. The template also integrates seamlessly with our RAG context, as tags
cleanly delimit query, evidence, and expected outputs.</p>
      <sec id="sec-3-1">
        <title>Validation and filtering of examples</title>
        <p>After all is done the remaining bottleneck for dataset quality is example hygiene: overly general
queries must be removed, near-duplicate instructions must be collapsed, and examples whose
answers are not attributable to the retrieved context must be rejected.</p>
        <p>We adopt a three-gate policy executed in the following order: (G1) query specificity, (G2)
instruction de-duplication, (G3) context-to-answer consistency. The ordering is motivated by cost:
pre-retrieval processing is (G1) cheapest, near-duplicate detection (G2) amortizes costs over entire
shards, and faithfulness checks (G3) is the hardest. All gates leave a machine-readable audit trail
(JSON) that explains the decision with scalar scores and thresholds.</p>
        <p>All gates produce structured audit records that log features, scores, thresholds, and the reason
for accept/reject. This allows re-running the validator across snapshots with different
hyperparameters without re-encoding texts. We deliberately bias toward precision (filtering too
much) in G1 and G2, because failures escalate to a rewrite/curation queue; in G3 we balance
precision/recall so as not to discard legitimate paraphrastic answers.</p>
        <p>Of course there are some issues. Entity-free queries that are still specific in closed domains (e.g.,
mathematics) may be over-filtered. Lexically diverse paraphrases may evade MinHash. Embedding
proximity misses subtle contradictions. Nevertheless automation greatly reduces amount of
required manual filtering.</p>
        <sec id="sec-3-1-1">
          <title>8.1. Gate 1 - Overly general queries</title>
          <p>The objective is to identify and remove prompts whose information need is under-specified (e.g.,
evaluation. We formalize under-specification as low lexical specificity.</p>
          <p>
            We combine pre-retrieval predictors from [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] with lightweight lexical features:
•
•
•
avgIDF and sumIDF over content tokens computed from our passage collection, low values
indicate generic terms.
          </p>
          <p>Stopword ratio and entity density (named entities per token), to penalize boilerplate
phrasings while retaining entity-anchored requests.</p>
          <p>Length controls (minimum 3 content tokens after stopword removal) to rule out one-term
commands.</p>
          <p>We compute these signals per query and apply a learned decision rule trained on a small,
hand-labeled dev set. The classifier is a calibrated logistic regression over standardized features
(avgIDF, stopword ratio, lexical specificity, content length), chosen for transparency.</p>
          <p>Applying G1 removes a consistent tail of vague prompts while preserving entity-centric and
compositional questions. This aligns with reports that pre-retrieval specificity predictors are strong
proxies for query quality and reduces the frequency of ungroundable generations in our RAG
evaluation.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>8.2. Gate 2 - Duplicate instruction detection</title>
          <p>The objective is to collapse near-duplicate Instruction and Question variants to avoid training and
test leakage, evaluation bias, and over-representation of popular intents.</p>
          <p>We combine lexical features MinHash to capture surface-form duplicates and semantic cosine
with the same BERT mean-pooled embeddings used elsewhere. So, each normalized instruction is
shingled (word 5-grams), hashed into a MinHash signature, and indexed. The BERT embedding
matrix of instructions is searched with HNSW to collect cosine neighbors. We merge close items
into clusters. This hybrid approach catches both trivial paraphrases and deep semantic duplicates.
It directly follows evidence that deduplication improves generalization and lowers memorization in
LLMs, and mirrors web-scale dataset construction practices that rely on MinHash at scale.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>8.3. Gate 3 - Context-answer consistency</title>
          <p>The objective is to detect and reject examples whose Answer is not supported by the Context
returned by the retriever, thereby enforcing attribution and reducing hallucinations.</p>
          <p>We combine a vector-proximity test with a textual alignment score:
•
•
•</p>
          <p>Attribution compliance. If the template requires citations, we verify that cited doc_ids are a
subset of the retrieved doc_ids. This is formal procedure.</p>
          <p>Embedding proximity test. We split the answer into sentences, encode each with BERT
mean-pooling, and compute its maximum cosine similarity to any sentence from the packed
context. An example passes if sufficient number of scores of answer sentences exceed a
threshold. This test is indexing-agnostic and efficiently batched.</p>
          <p>For curated subsets we compute a factual consistency score with an alignment model
between the context and the answer. Low alignment flags unsupported claims even when
embeddings yield false positives.</p>
          <p>We first run the proximity test, failing items are dropped immediately. Passing items are passed
to the alignment scorer. The two-stage test rejects obviously unsupported answers cheaply and
reserves expensive alignment modeling for ambiguous cases. In practice this reduces hallucination
rate and raises attribution scores in our RAG evaluations without inflating latency during dataset
construction.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Discussion and future work</title>
        <p>While BERT-compatible WordPiece tokenization is our default for interoperability with BEIR-style
retrievers, we considered several alternatives. Each of them has strong advantages and they are
listed below, but disadvantages made them untenable at this stage of development.</p>
        <p>
          XLM-R [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] is a SentencePiece Unigram miodel. Its advantages are: strong cross-lingual
performance, robust to morphology, trained on large CommonCrawl in 100 languages. The
disadvantages consist of: 512-token limit, unigram segmentation differs from WordPiece, leading to
different window statistics, larger models incur higher latency.
        </p>
        <p>
          mDeBERTaV3 [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] is a multilingual DeBERTa. Its advantages are: improved pretraining
objective (RTD) and parameter sharing, strong multilingual transfer, available HF checkpoints. The
disadvantages consist of: tokenizer is SentencePiece-like, sequence limits and memory use
comparable to BERT-base, fewer production-grade embedding heads than E5/BGE.
        </p>
        <p>
          MPNet[
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]/RoBERTa-family work as byte-level BPE. Its advantages are: byte-level resilience,
strong sentence-embedding variants (e.g., all-mpnet-base-v2), larger maximum input length and
fast tokenization. The disadvantages consist of: mismatch with WordPiece passage stats, byte-level
normalization can change token counts for Cyrillic punctuation.
        </p>
        <p>
          Advanced embedding models such as E5-v2[
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] and BGE-M3 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Their advantages are:
state-of-the-art retrieval transfer, multilingual coverage, explicit instructions for query/passage
encoding, often tolerant to moderately longer chunks. The disadvantages consist of: require careful
prompt formats (e.g., query:/passage: prefixes) and may expect different maximum lengths (e.g.,
512 1024 tokens) and pooling strategies.
        </p>
        <p>
          As for performance, there is another option for memory-sensitive applications. IVF+PQ [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] is a
technique that combines two indexing methods for efficient search in high-dimensional vector
data: Inverted File (IVF), which structures data based on similarity, and Product Quantization (PQ),
which compresses vectors by dividing them into subvectors and quantizing them. This allows for
significantly faster search and reduced memory requirements for storing large data sets. It excels in
memory efficiency while HNSW excels in recall latency balance on CPU so it can be done.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusions</title>
        <p>We successfully designed and implement a reproducible Ukrainian language centered
Wikipediato-Tasks pipeline that can do the following:
•
•
•
•
•
robustly extracts and cleans content from MediaWiki snapshots;
performs adaptive chunking that respects section/paragraph boundaries and augments
queries with retrieval context;
builds lexical, dense, and hybrid indices with explicit provenance;
generates and normalizes instruction and QA pairs supported by retrieval evidence; and
filters answer mismatches by embedding
similarity) with targeted human review.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Acknowledgements</title>
        <p>We are very grateful for support we received during this research from Department 165 of Institute
of Information Technologies and Systems and Department of Mathematical Informatics from
Faculty of Computer Sciences and Cybernetics of Taras Shevchenko National University of Kyiv.
The research was funded by the National Research Foundation of Ukraine.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Declaration on Generative AI</title>
        <p>During the preparation of this work, the authors used ChatGPT-5 in order to: Text Translation,
Grammar and spelling check. After using this tool, the authors reviewed and edited the content as
.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wiseman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections</article-title>
          . In:
          <article-title>Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          .
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>193</fpage>
          <lpage>209</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.17. URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>17</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gehrmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Faruqui,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>ToTTo: A Controlled Table-to-Text Generation Dataset</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP</source>
          <year>2020</year>
          ).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1173</fpage>
          <lpage>1186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.89. URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>89</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Petroni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Piktus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yazdani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; De Cao, N.;
          <string-name>
            <surname>Thorne</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Jernite,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Maillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>KILT: a Benchmark for Knowledge-Intensive Language Tasks</article-title>
          . In:
          <article-title>Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT</article-title>
          <year>2021</year>
          ).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2523</fpage>
          <lpage>2544</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.200. URL: https://aclanthology.org/
          <year>2021</year>
          .naaclmain.
          <volume>200</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Minervini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Küttler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Piktus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Stenetorp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>9</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>1098</fpage>
          <lpage>1115</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00415</fpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .tacl-
          <volume>1</volume>
          .65/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Thakur</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rücklé</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models</article-title>
          .
          <article-title>NeurIPS 2021 Datasets and Benchmarks</article-title>
          . URL: https://datasets-benchmarksproceedings.neurips.cc/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.
          <fpage>pdf</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thakur</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogundepo</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamalloo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alfonso-Hermelo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezagholizadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages</article-title>
          . TACL,
          <volume>11</volume>
          ,
          <fpage>1114</fpage>
          <lpage>1131</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00595</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Trivedi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khot</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabharwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>MuSiQue: Multihop Questions via Single-hop Question Composition</article-title>
          . TACL,
          <volume>10</volume>
          ,
          <fpage>539</fpage>
          <lpage>554</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .tacl-
          <volume>1</volume>
          .31.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Berant</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Answering Ambiguous Questions through Generative Evidence Fusion and Disambiguation</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2021</year>
          <article-title>(Long)</article-title>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long.253.pdf</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Aly</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al. (
          <year>2021</year>
          ).
          <article-title>FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information</article-title>
          .
          <article-title>NeurIPS 2021 Datasets and Benchmarks</article-title>
          . URL: https://datasetsbenchmarks-proceedings.neurips.cc/paper/2021/file/68d30a9594728bc39aa24be94b319d21- Paper-round1.
          <fpage>pdf</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Glockner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al. (
          <year>2024</year>
          ).
          <article-title>AmbiFC: Fact-Checking Ambiguous Claims with Evidence</article-title>
          .
          <source>TACL</source>
          ,
          <volume>12</volume>
          ,
          <fpage>1792</fpage>
          <lpage>1811</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00629</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Marchenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Anisimov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nykonenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rossada</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Melnikov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Authorship Attribution System</article-title>
          . In: Frasincar,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Ittoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Métais</surname>
          </string-name>
          ,
          <string-name>
            <surname>E</surname>
          </string-name>
          . (eds)
          <source>Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science</source>
          , vol.
          <volume>10260</volume>
          . Springer, Cham,
          <year>2017</year>
          , pp.
          <fpage>227</fpage>
          <lpage>231</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -59569-6_
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Anisimov</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Marchenko</surname>
            ,
            <given-names>O. O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vozniuk</surname>
            ,
            <given-names>T. G.</given-names>
          </string-name>
          <article-title>Determining Semantic Valences of Ontology Concepts by Means of Nonnegative Factorization of Tensors of Large Text Corpora</article-title>
          .
          <source>Cybernetics and Systems Analysis</source>
          ,
          <volume>50</volume>
          ,
          <fpage>327</fpage>
          <lpage>337</lpage>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1007/s10559-014-9621-9.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Taranukha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Horokhova</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Linder</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>On Semi-Automatic Creation of Dataset for MultiDocument Automatic Summarization of News Articles and Forum Threads</article-title>
          .
          <source>In: Selected Papers (IT&amp;I-</source>
          <year>2021</year>
          ),
          <source>Workshop Proceedings (CEUR Workshop Proceedings</source>
          , vol.
          <volume>3179</volume>
          ).
          <source>CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>15</fpage>
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Longpre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daiber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>MKQA: A Linguistically Diverse Benchmark for Multilingual Open-Domain Question Answering</article-title>
          .
          <source>TACL</source>
          ,
          <volume>9</volume>
          ,
          <fpage>1389</fpage>
          <lpage>1406</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00433</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>-Tuning and Retrieval Augmented Generation for the UNLP 2024 Task on Ukrainian Question Answering</article-title>
          .
          <source>In Proceedings of UNLP @ LREC-COLING</source>
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .unlp-
          <volume>1</volume>
          .10.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hamotskyi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al. (
          <year>2024</year>
          ).
          <article-title>Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian Language Models</article-title>
          .
          <source>In Proceedings of UNLP</source>
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .unlp-
          <volume>1</volume>
          .13.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharjee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mubasshir</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.-F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
          </string-name>
          , Y.-B.,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shahriyar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Findings of ACL-IJCNLP 2021</article-title>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findingsacl.
          <volume>413</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chaplynskyi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanyshyn</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian</article-title>
          .
          <source>In Proceedings of UNLP</source>
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .unlp1.4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ustyianovych</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>TRWU: Instant Messaging Platforms News Multi-Task Classification for Stance, Sentiment, and Discrimination Detection</article-title>
          .
          <source>In Proceedings of UNLP</source>
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .unlp-
          <volume>1</volume>
          .5.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Attardi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>WikiExtractor (project site)</article-title>
          .
          <source>GitHub Pages</source>
          .
          <year>2012</year>
          2025. URL: https://attardi.github.io/wikiextractor/.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Kurtovic</surname>
            ,
            <given-names>B. MWParserFromHell</given-names>
          </string-name>
          <year>v0</year>
          .
          <article-title>7: Documentation. Read the Docs</article-title>
          . URL: https://mwparserfromhell.readthedocs.io/.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>JSON</given-names>
            <surname>Lines. JSON Lines</surname>
          </string-name>
          :
          <article-title>A Text Format for Storing Structured Data</article-title>
          .
          <year>2014</year>
          2025. URL: https://jsonlines.org/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In: Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 1 (Long and Short Papers)</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423. URL: https://aclanthology.org/N19-1423/.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Khandelwal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chaudhary</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wenzek</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guzmán</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Unsupervised Cross-lingual Representation Learning at Scale</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL</source>
          <year>2020</year>
          ).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          <lpage>8451</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.747. URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          /. Also arXiv:
          <year>1911</year>
          .02116.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>DeBERTaV3: Improving DeBERTa using ELECTRA-Style PreTraining with Gradient-Disentangled Embedding Sharing</article-title>
          .
          <source>In: The Eleventh International Conference on Learning Representations (ICLR</source>
          <year>2023</year>
          ), Kigali, Rwanda,
          <source>May 1 5</source>
          ,
          <year>2023</year>
          . OpenReview.net,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=
          <fpage>sE7</fpage>
          -XhLxHA
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Multilingual E5 Text Embeddings</surname>
          </string-name>
          :
          <source>A Technical Report. arXiv:2402.05672</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2402.05672
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Xiao,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Zhang,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. BGE</surname>
          </string-name>
          <article-title>M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation</article-title>
          .
          <source>arXiv:2402.03216</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2402.03216
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Malkov</surname>
            ,
            <given-names>Y. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yashunin</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Efficient</surname>
          </string-name>
          and
          <article-title>Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>42</volume>
          (
          <issue>4</issue>
          ),
          <fpage>824</fpage>
          <lpage>836</lpage>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2018</year>
          .
          <volume>2889473</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <article-title>-t. Dense Passage Retrieval for Open-Domain Question Answering</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP</source>
          <year>2020</year>
          ).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          <lpage>6781</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.550. URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>A. Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sablayrolles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mensch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bamford</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chaplot</surname>
          </string-name>
          , D. S.;
          <string-name>
            <surname>de las Casas</surname>
          </string-name>
          , D.;
          <string-name>
            <surname>Bressand</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lengyel</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Lample,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Lavaud</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. R.</surname>
          </string-name>
          ; Lachaux,
          <string-name>
            <given-names>M.</given-names>
            -A.;
            <surname>Stock</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; Le Scao,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>El Sayed</surname>
          </string-name>
          , W. Mistral 7B.
          <source>arXiv:2310.06825</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2310.06825. URL: https://arxiv.org/abs/2310.06825
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Zendel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ; et al. QPPTK@
          <article-title>TIREx: Simplified Query Performance Prediction Toolkit</article-title>
          .
          <source>In: Proceedings of TIREx@CLEF</source>
          <year>2024</year>
          ,
          <article-title>CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://ceur-ws.org/Vol3689/WOWS_2024_paper_6.pdf
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Liu, T.-Y. MPNet:
          <article-title>Masked and Permuted Pre-training for Language Understanding</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (NeurIPS
          <year>2020</year>
          ).
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf. Also arXiv:
          <year>2004</year>
          .09297.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Improving Text Embeddings with Large Language Models</article-title>
          . In:
          <article-title>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>11897</fpage>
          <lpage>11916</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long.642</article-title>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>642</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Facebook</given-names>
            <surname>Research</surname>
          </string-name>
          .
          <article-title>The index_factory. FAISS Wiki (GitHub), updated 2025</article-title>
          . URL: https://github.com/facebookresearch/faiss/wiki/The-index-factory
          <source>(accessed 20 Sep</source>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>