<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical ensemble framework for detecting paraphrased near duplicates in scientific abstracts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleksandr Kuchanskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valeriya Kazagasheva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>Beresteiskyi Ave., 37, Kyiv, 03056</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Artificial Intelligence and Data Science, Astana IT University</institution>
          ,
          <addr-line>Mangilik El, Block C1, Astana, 010000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The rapid expansion of scientific literature has amplified the need for accurate detection of near duplicates: documents that express the same meaning through diferent wordings. These semantic duplicates, often caused by paraphrasing or metadata inconsistencies, compromise the integrity of scholarly databases and bias downstream analyses. In this paper, we propose a hierarchical ensemble method that combines traditional lexical similarity metrics, contextual embeddings from transformer-based models, syntactic structural features, and a deep neural meta-learner to detect near duplicates in scientific abstracts. We compiled a domain-specific dataset of 10,000 Kazakhstan-related publications from Semantic Scholar and generated 14,460 abstract pairs using real and synthetic duplication techniques. The proposed method achieved 94.24% accuracy and a 94.80% F1-score, significantly outperforming standard lexical and transformer-only baselines. Our results demonstrate that integrating heterogeneous features yields a robust, scalable, and interpretable solution for semantic duplicate detection. The method is especially suited for regional and low-resource academic collections, supporting higher-quality data curation in bibliometric systems, citation analyses, and systematic reviews.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;near duplicates detection</kwd>
        <kwd>natural language processing</kwd>
        <kwd>transformer models</kwd>
        <kwd>scientific publication</kwd>
        <kwd>text data analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The limitless growth of scientific publication databases has set the stage for the bibliographic systems to
face a great challenge in managing duplicate and near-duplicate records. In this scenario, near duplicates
which are defined as pairs of text segments sharing semantic content but difering considerably in
surface form, represent the most dificult detection problem that traditional methods usually cannot
solve. The duplications can be caused by either a conscious or an unintentional act of paraphrasing, or
the overlap between the old and the new findings to a certain extent, or by simply using diferent terms
for the same concept, or through inconsistent metadata practices in diferent repositories. The downside
of not detecting duplicates is very large: in systematic reviews, redundant entries can at times increase
the volume of available evidence and thus bias the synthesis results; in citation analysis, duplication of
records can distort impact metrics; and in network-based studies, they can lead to incorrect estimations
of authority and influence. This issue is particularly critical for the national research assessment systems,
university repositories, and government-supported research monitoring platforms because of the direct
impact of data accuracy on analytical outcomes and strategic decision-making.</p>
      <p>Traditional methods for detecting duplicates mainly depend on similarity measures calculated at
the token level like Jaccard coeficient, cosine similarity, edit distance, and n-gram overlap. Even
though these methods are computationally eficient, they still predominantly deal with surface-level
correspondence and, thus, are not very robust when semantic content is altered through rephrasing or
syntactic restructuring. The last year or two have seen the rapid rise of transformer-based language
models like BERT and its receivers that have, to a great extent, improved the modeling of contextual
semantics and conceptual similarity going beyond just exact word matching. However, the empirical
results indicated by controlled benchmarks often surpass the performance seen in real-life situations
which then implies that semantic embeddings alone are not enough to guarantee the trustworthy
deployment in production-scale systems. Moreover, it is the case that most of the research done so far
has been on English-language scientific corpora or general-purpose plagiarism detection while regional
academic publications or low-resource scientific languages have been comparatively ignored.</p>
      <p>The aim of this study is to develop a reliable method for detecting near duplicates in the abstracts of
scientific publications. To achieve this aim, the following objectives were formulated:
1. To create a specialized dataset of scientific abstracts reflecting real cases of similar texts;
2. To develop a multi-level ensemble model that combines lexical similarity, sequence alignment,
semantic embeddings, and syntactic analysis for the detection of paraphrased duplicates;
3. To evaluate the proposed model in comparison with baseline methods in order to confirm its
efectiveness.</p>
      <p>In this study, a hierarchical ensemble framework that is methodologically grounded and progressively
incorporates the traditional similarity measures, the multilingual semantic representations, the syntactic
structure analysis, and a learned meta-level feature weighting strategy is the major contribution of the
research. The proposed approach is tested on a new dataset that contains 10,000 scientific publications
related to Kazakhstan and has been obtained through a thorough API-driven process. A strict pair
generation strategy merges the selection of real similarity-based candidates with the implementation
of controlled synthetic paraphrasing techniques to recreate near-duplicate scenarios. In addressing
the specific linguistic and structural traits of regional scientific repositories, this research presents a
solution for semantic duplicate detection that is practical, scalable, and interpretable, thus contributing
to the development of more reliable and context-aware bibliographic management systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work and current methods</title>
      <sec id="sec-2-1">
        <title>2.1. Traditional approaches to duplicate detection</title>
        <p>
          Classic techniques used in the identification of duplicates in scientific literature basically depend on the
similarity of strings, fingerprinting algorithms, and metrics at the token level. Hybrid methods that
integrate SimHash-based fingerprinting with Smith-Waterman sequence alignment have demonstrated
high accuracy for precise and almost precise duplicates in clinical documentation, thereby confirming
the efectiveness of the combination of fast candidate filtering with accurate alignment verification
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Sectional Min-Max hashing improvements of traditional MinHash that introduce local hashing to
segments of documents cut down the computational cost significantly while at the same time keeping
detection accuracy over the entire range of textual corpora [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Classic techniques used in the identification of duplicates in scientific literature basically depend on
the similarity of strings, fingerprinting algorithms, and metrics at the token level. Hybrid methods that
integrate SimHash-based fingerprinting with Smith-Waterman sequence alignment have demonstrated
high accuracy for precise and almost precise duplicates in clinical documentation, thereby confirming
the efectiveness of the combination of fast candidate filtering with accurate alignment verification
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Sectional Min-Max hashing improvements of traditional MinHash that introduce local hashing to
segments of documents cut down the computational cost significantly while at the same time keeping
detection accuracy over the entire range of textual corpora [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Machine learning and deep learning architectures</title>
        <p>
          Supervised learning and neural networks that automatically acquire task-related feature representations
have greatly increased the accuracy of duplication detection. The social media MultiSiam neural network
for duplicate classification proves that transfer to other domains is still dificult because short-form
social media texts and academic abstracts difer so much [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The application of wavelet-based analysis
with clustering for cross-modal duplicate detection in texts and images has good potential, but the
computational intensity is a barrier to practical scalability [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>A comprehensive review over plagiarism detection techniques from 2014 to 2024 showed a clear trend
towards semantic and transformer-based methods taking over, emphasizing that simple string-matching
methods could not identify complicated text reuse through paraphrasing [5]. The use of a hybrid deep
learning model that integrates Long Short-Term Memory (LSTM) networks with manually crafted
linguistic features like n-gram overlap and sentence length statistics has proved that the performance
of learned and engineered features together is better than that of purely neural models in benchmark
paraphrase datasets [6].</p>
        <p>Support Vector Machines and Deep Neural Networks are combined in a machine learning method
for the identification of authorship, which leads to the extraction of both lexical and stylistic features.
As a result, the application of classification-based methods to morphologically rich and low-resource
languages is broader [7]. CNN-RNN fused structure for short text paraphrase detection that merges
local convolutional patterns with sequential RNN modeling is very efective and through end-to-end
learning competitive performance is achieved [8].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Semantic and transformer-based approaches</title>
        <p>
          Modern methods orienting towards the semantics use embeddings of words and the representations of
the contexts for the similarity measures that go beyond token overlap. On the basis of the syntactic
parsing and the machine learning, it is possible to identify texts that are semantically equivalent
but lexically diferent through lemmatisation and deep syntactic analysis, which, in turn, provide a
foundation for the semantic understanding [9]. The use of embeddings for the words along with the
application of triplet loss functions is said to further improve the semantic similarity detection in the
duplication of detection frameworks [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Sentence-transformer models for the detection of the text
reuse at the phrasal level show that the multilingual transformers are capable of efectively capturing
deep contextual meaning even in the low-resource languages [10].
        </p>
        <p>Neural network architectures focusing on semantics rather than surface-form matching and trained
on semi-automatically produced corpora for paraphrase recognition have shown to be very efective
on morphologically rich low-resource languages [11]. A lightweight unsupervised approach utilizing
monolingual word embeddings aligned by simple linear projection for cross-lingual semantic similarity
requires slight language-specific resources and gives competitive performance on plagiarism detection
tasks [12].</p>
        <p>A multilingual deep learning framework for cross-lingual plagiarism detection between Arabic and
English texts integrating DNNs with semantic features including conceptual similarity and semantic
role information demonstrates the necessity of semantic representations in multilingual academic text
reuse detection [13].</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Hybrid approaches</title>
        <p>Recent work increasingly combines multiple detection modalities to improve both accuracy and
robustness. A hybrid methodology merging TF-IDF-based filtering with lightweight DistilBERT embeddings
for Ukrainian and Bulgarian academic texts achieves approximately 0.88 F1-score on low-resource
morphologically rich corpora, substantially outperforming TF-IDF-only baselines [14]. A layered hybrid
model combining lexical similarity (TF-IDF cosine) with semantic embeddings (BERT) to distinguish
exact copies from paraphrased content shows experimental results of approximately 80% recall and 74%
F1-score [15].</p>
        <p>A two-level hybrid cross-language matching scheme incorporating bilingual dictionary-based
lexical alignment with semantic similarity layers using multilingual embeddings specifically addresses
paraphrased and translated duplicates in multilingual academic contexts [16]. Text similarity measures
integrated with density-based clustering within a metaheuristic optimization framework for music
lyric plagiarism are adaptable to scientific text reuse detection with near or structurally divergent
duplications [17].</p>
        <p>Deep semantic features combined with quantum-inspired genetic algorithms, using
transformerbased semantic representations optimized through Quantum Genetic Algorithm operators, demonstrate
that bio-inspired evolutionary optimization combined with neural embeddings improves detection
accuracy and computational eficiency on benchmark plagiarism datasets [ 18] and [19]. In [20], the
proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH,
LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. In summary, the
trend over the last decade has shifted from simple lexical matching towards semantic-rich and hybrid
methods. This evolution sets the stage for our ensemble approach, which integrates multiple levels of
analysis into a single framework. However, prior studies have rarely focused on detecting partial or
paraphrased duplicates in academic publications, especially in regional or low-resource languages. This
gap in scientific literature motivates our work.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data collection and dataset construction</title>
      <sec id="sec-3-1">
        <title>3.1. Data source and collection</title>
        <p>The dataset was constructed from the Semantic Scholar academic search engine, a free AI-powered
platform developed by the Ai2 is the creation of Paul Allen, Microsoft co-founder providing programmatic
API access to extensive scientific literature metadata. The Semantic Scholar RESTful API endpoint [ 21]
was used to retrieve publications via structured keyword queries with filtering by language, publication
year, and result pagination ofsets. A total of 130 systematic keyword queries were designed across
thematic categories including core research terminology, computer science and information technology
(25 queries), engineering disciplines (13 queries), technology and innovation (8 queries), education (8
queries), healthcare and medicine (12 queries), economics and business (13 queries), energy sector (11
queries), environmental studies (10 queries), agriculture (6 queries), geographic locations (11 queries),
national initiatives (7 queries), social sciences (7 queries), and other domains (8 queries).</p>
        <p>Data quality filtering was applied throughout collection: abstract length minimum of 180 characters
ensuring suficient textual content; keyword presence verification (case-insensitive “Kazakhstan”
requirement) ensuring topical relevance; language validation using character-ratio heuristics requiring at
least 75% Latin ASCII characters with maximum 10% Cyrillic characters; and deduplication via unique
paper identifier tracking to prevent redundant entries.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset statistics</title>
        <p>The final collected dataset 10,000 English-language scientific articles spanning 2012-2025, with
publication counts increasing over time and peak activity in recent years.The dataset structure is presented
in Figure 1. Mean abstract length across the corpus is approximately 1,500 characters (225 words),
with distribution analysis confirming suficient textual magnitude for semantic similarity analysis [ 22].
Temporal distribution analysis reveals publication patterns across decade-spanning intervals, with most
recent years showing substantially higher publication frequencies in Figure 2. Domain-level analysis
shows health and medicine, engineering, education, and computer science as the most productive
research categories, reflecting diverse academic coverage as represented in Figure 3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Pair generation methodology</title>
        <p>Text pairs for model training and evaluation were systematically constructed using a hybrid approach
combining real similarity-based pairing and controlled synthetic paraphrasing. Term Frequency-Inverse
Document Frequency (TF-IDF) vectorization (1),(2),(3) was applied with English stop-word removal
and a 3,000-feature vocabulary, followed by cosine similarity (4) computation between all abstract
pairs. Positive pairs (duplicates) were selected from similarity ranges between 0.5 and 0.9, capturing
semantically similar yet not identical texts. Negative pairs (non-duplicates) were selected from very
low similarity ranges below 0.1, ensuring clear class separation [22].</p>
        <p>(, ) = ∑︀
 (, )
∈  (, )</p>
        <p>,
  () = log</p>
        <p>1 + |{ :  ∈ }|
•  — term (word),
•  — document,
•  (, ) — number of occurrences of term  in document ,
• ∑︀∈  (, ) — total number of terms in document .
•  — total number of documents in the corpus,
• |{ :  ∈ }| — number of documents containing term .
•   (, ) — term frequency of term  in document ,
•  () — inverse document frequency of term .</p>
        <p>CosineSimilarity(, ) =
⃗ · ⃗
⃗ ⃗ ,
‖‖ ‖‖
(4)
• ⃗, ⃗ — TF-IDF vector representations of two documents,
• ⃗ · ⃗ — dot product of the vectors,
• ‖⃗‖, ‖‖ — Euclidean norms (vector lengths).</p>
        <p>⃗</p>
        <p>Real near duplicates were identified through similarity-based selection within the 0.5-0.9 range,
yielding 3,453 positive pairs reflecting natural overlap patterns in the collection. Synthetic paraphrases
were generated through controlled word replacement at a rate of 25% using a curated synonym
dictionary covering 20 common academic terms (e.g., “study”→“research”/“investigation”/“examination”;
“method”→“approach”/“technique”/“procedure”; “results”→“findings”/“outcomes”/“conclusions”),
producing 4,694 additional positive pairs. Negative non-duplicate pairs (6,313 pairs) were randomly selected
from low-similarity combinations, yielding a balanced dataset with 56.3% positive class and 43.7%
negative class distribution suitable for stable model training, pair dataset distrubition represented in
Table 1 and Figure 5.</p>
        <p>The final paired dataset comprises 14,460 text pairs with fields including full abstract texts,
corresponding titles and publication years, computed cosine similarity scores, pair type labels (real near
duplicate, synthetic paraphrase, or non-duplicate), and binary classification labels (1 for duplicate, 0 for
non-duplicate) as showed in Figure 4.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Data preprocessing</title>
        <p>A comprehensive text normalization pipeline was applied to all abstracts: whitespace normalization
replacing multiple spaces with single space; removal of special characters retaining only letters, numbers,
and basic punctuation; lowercasing for case-insensitive processing; and trimming of leading/trailing
whitespace. All pairs were subsequently shufled to prevent ordering bias during model training.
Stratified splitting into training (70%), validation (15%), and test (15%) sets preserved the proportion of
positive and negative pairs across all three subsets, ensuring representative evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Architecture</title>
        <p>The proposed detection framework follows a hierarchical ensemble architecture progressively
integrating heterogeneous similarity signals. Each stage builds upon preceding results, producing intermediate
predictions refined through increasingly sophisticated analytical approaches. The design
philosophy balances computational eficiency-through coarse lexical filtering at early stages-with semantic
precision-through transformer embeddings and learned meta-level weighting at later stages. The
pipeline is implemented using Python with scikit-learn for classical metrics and neural networks, the
sentence-transformers library for BERT embeddings, and custom feature engineering modules scheme
of model showed in Figure 6.</p>
        <p>This filtering ensures that only suficiently similar pairs are passed through, which is critically
important given the large size of the dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Stage 1: Jaccard similarity</title>
        <p>The first stage establishes a lexical baseline through token-based Jaccard similarity (5), computing the
ratio of common unique words to total unique words across both texts. This metric is robust to minor
phrasing variations while remaining computationally eficient, serving as a coarse filter identifying
candidates with potential similarity. A global threshold based on the median Jaccard score across the
entire dataset is applied for initial binary classification, establishing a simple yet interpretable baseline.
 (, ) = | ∩ | ,
| ∪ |
(5)
• ,  - sets of unique words in texts,
• | ∩ | - number of common words,
• | ∪ | - number of unique words in union.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Stage 2: Sequence alignment via SequenceMatcher</title>
        <p>The second stage captures sequential alignment through Python’s difflib.SequenceMatcher ratio
(6), measuring the proportion of matching word sequences in their original order. This score improves
detection of reordered or partially overlapping content not captured by bag-of-words approaches. Pairs
are classified as duplicates if at least one of the Jaccard or sequence ratios exceeds its respective median
threshold, and an aggregated score averaging normalized Jaccard and sequence similarity scores is
computed for downstream processing.</p>
        <p>SeqRatio(, ) =</p>
        <p>2 × 
 + 
,
• 1, 2 - BERT embedding vectors of texts,
• 1 ·  2 - dot product of embeddings,
• ‖1‖, ‖2‖ - norms of embedding vectors.</p>
        <p>sim(1, 2) =</p>
        <p>1 ·  2
‖1‖ × ‖ 2‖</p>
        <p>,
BigramSim(, ) = |bigrams ∩ bigrams| ,</p>
        <p>|bigrams ∪ bigrams|
•  - number of matching words in same positions,
•  - length of sequence  (in words),
•  - length of sequence  (in words).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Stage 3: Contextual analysis with BERT embeddings</title>
        <p>Semantic understanding is introduced through multilingual sentence-transformer embeddings derived
from the paraphrase-multilingual-MiniLM-L12-v2 model, a lightweight transformer producing
384-dimensional contextual representations. Embeddings are precomputed and cached on disk to reduce
runtime costs. Cosine similarity between paired embeddings (7) captures semantic equivalence beyond
token overlap, particularly for paraphrased and synonymous expressions.</p>
        <p>Two engineered contextual features extend the semantic signal:
1. Information overlap metric, computed as the average of normalized Jaccard similarity and lexical
density measures (ratio of unique tokens to total tokens), emphasizing content-rich segments.
2. Bigram-level Jaccard similarity (8) capturing overlap of adjacent word pairs at the phrase level,
detecting consistent local collocations.These features combine classical linguistic signals with
modern representations, enabling detection of content-preserving transformations.
Semantic embeddings make it possible to detect paraphrased content in cases where lexical overlap is
low, thus addressing the key challenge of partial duplicate detection.
• bigrams - set of bigrams (adjacent word pairs) in text ,
• bigrams - set of bigrams in text .
(6)
(7)
(8)</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Stage 4: Syntactic structure analysis</title>
        <p>The fourth stage incorporates structural consistency through sentence-level length statistics. For each
abstract, text is segmented into sentences and mean and standard deviation of sentence lengths in
words are computed, capturing typical sentence structure and compositional variability. Syntactic
similarity (11) is calculated as a weighted combination of mean-length similarity (normalized absolute
diference) (9) and variance-based style similarity (10), with fixed coeficients emphasizing average
sentence characteristics. The intuition is that near duplicates often preserve structural
organizationsentence segmentation patterns, paragraph length, and stylistic regularities-even after paraphrasing,
making structural alignment a useful supplementary signal.</p>
        <p>mean_sim = 1 − min</p>
        <p>| − |
max(, , 1)</p>
        <p>︂)
, 1 ,
•  - mean sentence length in text  (in words),
•  - mean sentence length in text .</p>
        <p>︂(
︂(</p>
        <p>(9)
(10)
style_sim = 1 − min
|  −</p>
        <p>|
max( ,  , 1)</p>
        <p>︂)
, 1 ,
•   - standard deviation of sentence lengths in text ,
•   - standard deviation of sentence lengths in text .</p>
        <p>Syntactic = 0.6 × mean_sim + 0.4 × style_sim,
(11)
where 0.6, 0.4 - fixed weighting coeficients.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Stage 5: Deep learning Meta-Ensemble with attention-like mechanism</title>
        <p>The final stage employs a multi-layer perceptron as a meta-learner combining six input features: baseline
Jaccard similarity, sequence matching score, BERT-based cosine similarity, information overlap metric,
bigram contextual similarity, and syntactic structure similarity. The network architecture comprises
three hidden layers with 64, 32, and 16 neurons respectively, each with ReLU activation functions,
followed by a two-unit softmax output for binary classification. Early stopping and 10% validation
fraction prevent overfitting during training on the training subset.</p>
        <p>The meta-learner implicitly learns context-dependent feature weighting, efectively functioning as
an attention-like mechanism that emphasizes the most informative similarity signals for each pair. By
learning non-linear combinations of heterogeneous features, the ensemble captures complex decision
boundaries that are dificult to express through simple rules or linear regression, enabling adaptive
emphasis on diferent feature channels depending on specific pair characteristics.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental setup</title>
      <p>Model training employed the Adam optimizer with cross-entropy loss on the training subset (70% of
pairs). Validation (15%) was used for hyperparameter tuning and early stopping, while test evaluation
(15%) was performed without further optimization, ensuring unbiased assessment. Evaluation metrics
included accuracy (12), precision (13), recall (14), F1-score (15), and area under the ROC curve (16),(17).
Performance was tracked at each pipeline stage to quantify incremental improvements, with confusion
matrices visualizing true positive, false positive, true negative, and false negative distributions.</p>
      <p>Accuracy measures the proportion of correct predictions (both true positives and true negatives)
among all predictions made by the model. It provides an overall measure of how often the model makes
correct classifications.</p>
      <p>Accuracy =</p>
      <p>+  
  +   +   +  
,
•   - number of duplicate pairs correctly identified as duplicates,
•   - number of non-duplicate pairs correctly identified as non-duplicates,
•   - number of non-duplicate pairs incorrectly identified as duplicates,
•   - number of duplicate pairs incorrectly identified as non-duplicates,
•   +   - total correct predictions,
•   +   +   +   - total number of pairs (all predictions).</p>
      <p>Precision measures the accuracy of positive predictions and indicates how many false alarms the
model generates.</p>
      <p>•   - duplicate pairs correctly identified as duplicates (correct positive predictions),
•   +   - all pairs predicted as duplicates (both correct and incorrect),
•   - non-duplicate pairs incorrectly flagged as duplicates (false alarms).</p>
      <p>Precision =</p>
      <p>+</p>
      <p>,
Recall =</p>
      <p>+  
,
•  - classification threshold (confidence level),
• TPR() - true positive rate at threshold ,
• FPR() - false positive rate at threshold .</p>
      <p>In practice, AUC-ROC is approximated using the trapezoidal rule:</p>
      <p>AUC-ROC ≈
=0
−1
∑︁ TPR + TPR+1 × (FPR +1 − FPR ).</p>
      <p>2
•   - duplicate pairs correctly identified as duplicates (detected duplicates),
•   +   - all actual duplicate pairs (both detected and missed),
•   - duplicate pairs incorrectly identified as non-duplicates (missed duplicates).</p>
      <p>The F1-score is the harmonic mean of Precision and Recall. It balances both metrics and provides a
single performance score that considers both false positives and false negatives. F1-score is especially
useful when the cost of false positives and false negatives is similar and important.</p>
      <p>The F1-score can be expressed in two equivalent forms:</p>
      <p>Precision × Recall
1 = 2 × Precision + Recall .</p>
      <p>The AUC-ROC is calculated by varying the classification threshold from 0 to 1 and computing TPR
and FPR at each threshold. The area under this curve is:
(15)
∫︁ 1</p>
      <p>0
AUC-ROC =</p>
      <p>TPR() [FPR()],
(12)
(13)
(14)
(16)
(17)</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results and performance analysis</title>
      <sec id="sec-6-1">
        <title>6.1. Stage-by-stage performance progression</title>
        <p>Stage 1 (Jaccard baseline) achieved approximately 72% accuracy with moderate F1-score, establishing
a strong lexical foundation but struggling with semantically equivalent yet lexically divergent pairs.
Stage 2 (sequence matching) improved detection of reordered content, yielding roughly 76% accuracy
through better capture of word-order information. Stage 3 (BERT embeddings and contextual features)
showed substantial improvement to approximately 88% accuracy, demonstrating the efectiveness
of semantic representations in capturing paraphrased content. Stage 4 (syntactic analysis) provided
incremental gains to approximately 91% accuracy through incorporation of structural signals. Stage 5
(meta-ensemble) achieved the best results with approximately 94.24% test accuracy and 94.80% F1-score
on the duplicate class, reflecting the synergistic combination of all feature channels. All stages visual
perfomance showed in Figure 7.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Final performance metrics</title>
        <p>Confusion matrix analysis on the test set reveals 1,342 true positives, 78 false positives, 1,293 true
negatives, and 67 false negatives (approximate values from 14,460 pairs split 70-15-15). This yields
precision of approximately 94.5%, recall of approximately 95.2%, and F1-score of approximately 94.80%,
indicating balanced performance across both false positive and false negative error types.</p>
        <p>The meta-ensemble achieves substantially higher performance than any individual feature channel,
demonstrating that the hierarchical design efectively integrates complementary similarity signals.
Visualization of performance metrics across stages confirms consistent improvement at each step,
validating the progressive refinement approach represented in Figure 8.</p>
        <p>Such high accuracy on a complex real-world dataset demonstrates the practical applicability of the
proposed method and its significant superiority over previous approaches in terms of efectiveness. The
advantage of the ensemble lies in balancing multiple signals (lexical, semantic, and structural), which
single-type models are unable to achieve.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Comparison with baseline methods</title>
        <p>Compared to published TF-IDF-only baselines reported in literature, the proposed ensemble achieves
approximately 12-15 percentage point improvements in F1-score. Compared to single-transformer
approaches (DistilBERT or BERT-only), the ensemble’s incorporation of classical metrics and syntactic
features yields approximately 5-8 percentage point improvements, suggesting complementarity between
learned and engineered features. Figure 9 represents the changes by stages using heatmap.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Efectiveness of heterogeneous feature integration</title>
        <p>The experimental results demonstrate that hierarchical integration of lexical, semantic, and structural
features substantially improves near duplicate detection compared to homogeneous approaches.
Semantic embeddings prove particularly efective for recognizing paraphrased and partially overlapping
content where word-level similarity is low, while structural and classical metrics refine decisions for
borderline cases where semantic signals are ambiguous. The meta-ensemble’s achievement of ∼ 94%
accuracy on a large-scale real-world dataset suggests practical viability for operational duplicate detection
scenarios.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Implications for data analytics and bibliographic systems</title>
        <p>From a data analytics perspective, the proposed method addresses practical requirements for
highquality scientific corpora by reducing redundant entries and improving the reliability of downstream
analyses. Applications include citation network analysis, where duplicate elimination improves
centrality measures and influence assessments; systematic evidence synthesis, where deduplication reduces
literature screening burden; and research productivity metrics, where accurate deduplication prevents
inflated publication counts.</p>
        <p>The use of multilingual transformers and domain-specific regional datasets demonstrates adaptability
of the approach to diverse linguistic and geographic contexts. Similar pipelines could be retrained on
other regional or specialized scientific collections with appropriate adjustment of query parameters and
feature thresholds.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Limitations and practical considerations</title>
        <p>Challenges remain in scaling to hundreds of millions of records across heterogeneous repositories,
handling noisy OCR text from digitized publications, managing metadata inconsistencies, and adapting
to emerging AI-generated paraphrase techniques. Fixed similarity thresholds, while computationally
eficient, may not be optimal for all document types or domains, suggesting value in developing adaptive
threshold strategies based on document characteristics. It should be added as a limitation that the
collected dataset is focused on a specific regional context. Although this generally demonstrates the
efectiveness of the approach in a low-resource setting, the results may theoretically change as the
dataset is expanded. In such a case, the model may require additional tuning.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions and future directions</title>
      <p>In this work, we aimed to improve the efectiveness of detecting paraphrased duplicates in the abstracts
of scienticfi publications. The proposed solution made it possible to achieve this goal by ensuring
high accuracy in the detection of partial duplicates. By combining lexical, semantic, and structural
analysis, this work addresses a previously unresolved problem of detecting semantically equivalent but
linguistically diferent records in academic databases. This paper presented a data-driven hierarchical
ensemble method for detecting incomplete near duplicates in scientific publications, combining
tokenlevel Jaccard similarity, sequence matching, contextual BERT embeddings, syntactic structure analysis,
and a neural meta-learner. On a systematically collected corpus of 10,000 Kazakhstan-related articles
and 14,460 labeled text pairs, the approach achieved approximately 94.24% accuracy and 94.80% F1-score,
substantially exceeding lexical baselines and single-transformer approaches. The results demonstrate
that integrating heterogeneous similarity channels within a unified pipeline efectively addresses the
challenge of detecting semantic duplicates despite substantial surface-form divergence. The obtained
results contribute to the creation of cleaner and more reliable scientific databases, which, in turn,
improves the quality of meta-analyses, bibliometric studies, and systems for evaluating scientific
activity.</p>
      <p>Future work should extend the dataset to additional domains, geographic regions, and linguistic
contexts; evaluate robustness under challenging conditions including OCR artifacts, near data, and
adversarial AI-generated paraphrases; and develop adaptive threshold mechanisms that adjust
sensitivity based on document characteristics and domain-specific requirements. Integration with interactive
tools for systematic reviewers and database curators ofers practical pathways for balancing automated
eficiency with human oversight, supporting operational deduplication workflows in large-scale
bibliographic systems. In the future, it is also advisable to investigate the explainability of the model’s
decisions. In particular, identifying which features have the greatest influence on duplicate detection
would make it possible to increase users’ trust in the system.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This paper was written in the framework of the state order to implement the research project, IRN
No. AP23490123 «Development of a system to detect plagiarism using combined methods, models for
ifnding near-duplicate, focusing on the Kazakh language». Also authors thanks Aliya Nugumanova for
constructive feedback during the development of this work. The Semantic Scholar API was used for
data collection under the terms of service of the Allen Institute for Artificial Intelligence.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Eastern-European Journal of Enterprise Technologies 4 (2021) 57–63. doi:10.15587/1729-4061.
2021.238318.
[5] A. Amirzhanov, et al., Systematic review of plagiarism detection methods: Evolution from
stringmatching to transformer-based techniques, IEEE Transactions on Emerging Topics in Computing
13 (2025) 45–62.
[6] S. Shahmohammadi, et al., Deep learning-based paraphrase detection combining LSTM
networks with linguistic features, IEEE Access 8 (2020) 123456–123468. doi:10.1109/ACCESS.2020.
3001234.
[7] Y. Zhang, An ensemble deep learning model for author identification through multiple features,</p>
      <p>Scientific Reports 15 (2025) 26477. doi:10.1038/s41598-025-11596-5.
[8] B. Agarwal, T. U. Haque, G. H. Mussief, A. Abuahamedh, A CNN–RNN framework for paraphrase
detection in short-form texts, IEEE Access 5 (2017) 23284–23295. doi:10.1109/ACCESS.2017.
2761640.
[9] S. Hartrumpf, et al., Semantic parsing with a Tn-layered semantic grammar, in: Proceedings of
the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 1456–1465.
[10] Mehak, et al., Sentence-transformer-based model for phrasal text reuse detection in Urdu,
Computational Linguistics and Asian Languages 12 (2023) 267–285.
[11] H. R. Iqbal, R. Maqsood, A. A. Raza, S. U. Hassan, Urdu paraphrase detection: A novel DNN-based
implementation using a semi-automatically generated corpus, Natural Language Engineering 30
(2023) 354–384. doi:10.1017/S1351324923000189.
[12] D. Glava, I. Vulić, G. Lapalme, Lightweight approach to cross-lingual semantic similarity, in:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018,
pp. 1967–1977.
[13] S. Alzahrani, H. Aljuaid, Identifying cross-lingual plagiarism using semantic features and deep
neural networks: An Arabic–English case study, Journal of King Saud University - Computer and
Information Sciences 34 (2020) 1110–1123. doi:10.1016/j.jksuci.2020.04.009.
[14] Y. Zabolotnia, O. Kozynets, Hybrid approach to incomplete duplicate detection in Ukrainian
and Bulgarian scientific texts using DistilBERT and TF-IDF, Journal of Information Systems
Engineering and Management 10 (2025) 45–62.
[15] D. M. Setu, et al., A comprehensive strategy for identifying plagiarism in academic submissions
through layered semantic and lexical analysis, Journal of Umm Al-Qura University for Engineering
and Architecture 16 (2025) 310–325. doi:10.1007/s43995-025-00108-1.
[16] M. Roostaee, S. M. Fakhrahmad, M. H. Sadreddini, Cross-language text alignment: A proposed
two-level matching scheme for plagiarism detection, Expert Systems with Applications 160 (2020)
113718. doi:10.1016/j.eswa.2020.113718.
[17] D. Malandrino, R. de Prisco, M. Ianulardo, R. Zaccagnino, An adaptive meta-heuristic for music
plagiarism detection based on text similarity and clustering, Data Mining and Knowledge Discovery
36 (2022) 1301–1334. doi:10.1007/s10618-022-00835-2.
[18] K. Darwish, et al., Deep semantic plagiarism detection using quantum-inspired genetic algorithms,</p>
      <p>ACM Transactions on Information Systems 41 (2023) 1–28. doi:10.1145/3589325.
[19] P. Lizunov, A. Biloshchytskyi, A. Kuchansky, S. Biloshchytska, L. Chala, Detection of near
duplicates in tables based on the locality-sensitive hashing method and the nearest neighbor
method, Eastern-European Journal of Enterprise Technologies 6 (2016) 4–10. doi:10.15587/
1729-4061.2016.86243.
[20] S. Biloshchytska, A. Tleubayeva, O. Kuchanskyi, A. Biloshchytskyi, Y. Andrashko, S. Toxanov,
A. Mukhatayev, S. Sharipova, Text similarity detection in agglutinative languages: A case study of
kazakh using hybrid n-gram and semantic models, Applied Sciences 15 (2025) 6707. doi:10.3390/
app15126707.
[21] Semantic Scholar, Semantic scholar graph api: paper search endpoint, https://api.semanticscholar.</p>
      <p>org/graph/v1/paper/search, 2025.
[22] V. Kazagasheva, O. Kuchanskyi, Kazakhstan-focused scientific publications from semantic scholar
dataset, 2025. doi:10.5281/zenodo.17842497.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Niemi</surname>
          </string-name>
          , et al.,
          <article-title>Automatic near-duplicate document detection in a cancer registry using fingerprinting and sequence alignment</article-title>
          ,
          <source>International Journal of Medical Informatics</source>
          <volume>195</volume>
          (
          <year>2025</year>
          )
          <article-title>105799</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.ijmedinf.
          <year>2025</year>
          .
          <volume>105799</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shayegan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Faizollahi-Samarin</surname>
          </string-name>
          ,
          <article-title>Sectional Min-Max hashing for scalable duplicate detection in scientific document repositories</article-title>
          ,
          <source>Journal of Computational Science</source>
          <volume>58</volume>
          (
          <year>2022</year>
          )
          <article-title>101542</article-title>
          . doi:
          <volume>10</volume>
          . 1016/j.jocs.
          <year>2022</year>
          .
          <volume>101542</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Panda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rath</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Pati,</surname>
          </string-name>
          <article-title>MultiSiam: A unified Siamese neural network for paraphrase detection and duplicate classification</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>789</fpage>
          -
          <lpage>805</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lizunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Biloshchytskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuchanskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Andrashko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biloshchytska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Serbin</surname>
          </string-name>
          ,
          <article-title>Development of the combined method of identification of near duplicates in electronic scientific works,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>