<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Beyond Traditional OCR: Exploring the Eficiency of LLMs in Document Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Narbona</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvador Ros</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Nacional de Educación a Distancia</institution>
          ,
          <addr-line>UNED</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Alcalá de Henares, UAH, Spain. Centro de Estudios Universitarios Ramón Areces</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This work explores two distinct OCR architectures for historical document processing, comparing a complex, multi-stage pipeline with a streamlined approach based solely on large language models (LLMs). Our findings demonstrate that the LLM-based architecture significantly outperforms the more traditional, complex systems, achieving superior accuracy with lower Character and Word Error Rates (CER and WER). The results highlight that, without the need for supplementary preprocessing or fusion steps, LLMs represent the current state-of-theart in OCR, ofering both high precision and eficiency. This approach marks a new era in historical document transcription, establishing LLMs as the most efective solution for heritage digitization workflows in the digital humanities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;OCR</kwd>
        <kwd>paper template</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Document Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The digitization and analysis of historical documents written in Spanish from the 17th to the 20th
century are of critical importance for historical, linguistic, and cultural research. These primary sources
ofer unique insights into the evolution of language, social customs, and the historical context of past
centuries. Optical Character Recognition (OCR) technology plays a pivotal role in the transformation
of these physical documents into machine-readable text, enabling large-scale access, searchability,
and computational analysis. Recent advances in artificial intelligence have significantly enhanced the
potential of OCR, facilitating the automated processing of fragile archival materials without physical
manipulation and converting them into structured, searchable digital formats.</p>
      <p>Despite these advancements, applying OCR to historical documents presents persistent challenges,
Manrique-Gómez et al. (2024). The physical degradation of paper—such as yellowing, brittleness, and
ink fading—combined with printing inconsistencies, often complicates accurate character recognition.
Historical Spanish texts frequently employ archaic or decorative typefaces that deviate markedly from
modern fonts, and handwritten marginalia or annotations further impede recognition accuracy. These
issues commonly result in OCR errors, including character misrecognition, incorrect punctuation, and
the introduction of extraneous symbols or artifacts.</p>
      <p>The quality of OCR output directly afects the eficiency and accuracy of subsequent text correction
workflows. High error rates require more intensive post-processing, which can be time-consuming and
computationally costly, particularly when correction systems struggle with distinguishing OCR-induced
noise from valid but rare historical language patterns. Moreover, significant linguistic challenges
arise when working with pre-modern Spanish. Between the 17th and 19th centuries, the Spanish
language underwent considerable orthographic and grammatical evolution. Variations in spelling,
archaic vocabulary, obsolete verb forms, and shifting syntactic norms present obstacles for both OCR
systems and modern natural language processing (NLP) tools.</p>
      <p>
        Addressing these challenges requires correction methodologies that are both linguistically informed
and computationally eficient. Large Language Models (LLMs) ofer promising capabilities in this domain.
Users increasingly seek to leverage LLMs to correct OCR outputs with high precision while adapting
to historical language variation. Promising directions include the use of prompt-based correction
methods, which are cost-efective and fast to deploy, versus fine-tuning strategies, which may yield
greater accuracy at the cost of higher computational demands. The optimal approach will depend on the
user’s constraints, including processing time, accuracy requirements, and available resources. Recent
work in historical OCR correction combines rule-based postprocessing with deep learning approaches,
leveraging domain-specific corpora, adaptive tokenization, and few-shot learning paradigms. These
hybrid models show improved performance in tasks involving diachronic language variation and high
noise levels typical of historical documents Subramani et al. (2021). In this study, we focus on the
PastReaders task, conducted within the framework of IberLEF2025, González-Barba et al. (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ), a shared
evaluation initiative aimed at benchmarking Natural Language Processing systems for the Spanish
language Montejo-Ráez et al. (2025b).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The Spanish language of the 17th to 19th centuries exhibits a dynamic and evolving character, with
significant grammatical and orthographic diferences from its modern counterpart. Orthographic
variants such as avía instead of había, fierro instead of hierro, and estoi instead of estoy were common.
Accentuation rules also diverged, with accents appearing more frequently or being placed diferently.
While grammatical evolution was less drastic, variations in verb conjugation and pronoun usage—such
as the occasional use of vosotros in parts of Latin America—highlight the linguistic heterogeneity of the
period.</p>
      <p>These historical linguistic characteristics pose a considerable challenge for Optical Character
Recognition (OCR) systems. The interaction between OCR-induced typographic noise and authentic historical
variation complicates eforts to accurately reconstruct the original text. Moreover, systematic
orthographic transformations further confound correction eforts. The Historical Ink project has cataloged
recurrent graphemic alternations, such as: á and a (e.g., hara → hará), é and e (fué → fue), í and i (decia
→ decía), ó and o (ocasion → ocasión), ú and u (ningun → ningún), i and y (mui → muy), j and g (jente
→ gente), v and b (gravado → grabado), s and x (espiró → expiró), j and x (méjico → méxico), c and s
(faces → fases), and s and z (dies → diez).</p>
      <p>A robust correction system must be intelligent enough to distinguish these historical variants from
spurious OCR artifacts. Misidentifying genuine linguistic features as errors may result in overcorrection
and the unwanted modernization of texts, compromising their historical authenticity. The methodology
developed by the Historical Ink project is particularly useful in this context, as it classifies changes—such
as accentuation—as surface-level features rather than errors, thereby preserving typographic norms
relevant to the period, Manrique-Gómez et al. (2024).</p>
      <p>Early attempts to address this problem relied on hybrid systems, handcrafted rules, and prompt
engineering to guide general-purpose models. These methods sought to compensate for the inability of
early language models to parse historical variation in Spanish. Although prompt-based strategies could
yield satisfactory results when contextual parameters were carefully defined, these systems often failed
to distinguish legitimate constructions from OCR noise, leading to distortion or loss of original forms,
Boros et al. (2024).</p>
      <p>
        Fine-tuning large language models (LLMs) on corpora like Latam-XIX marked a significant
improvement. Models trained on historical data better preserved grammatical and orthographic patterns, but
the approach remained resource-intensive and was limited by access to adequate training data and
computational infrastructure, Yenigün (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ). Specialized systems developed in the Historical Ink
project further improved precision by learning to diferentiate between diachronic variation and OCR
noise.
      </p>
      <p>
        Despite these advances, a paradigm shift is underway. Recent state-of-the-art LLMs demonstrate the
ability to perform high-quality corrections directly from raw OCR input using only well-formulated
prompts,Greif et al. (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ), Kim et al. (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ). These models consistently match or exceed the performance
of fine-tuned or hybrid systems without requiring extensive adaptation. Our findings confirm that
these next-generation models possess a remarkable capacity to understand typographic noise, recognize
historical forms, and preserve textual authenticity—thus rendering many traditional support techniques
obsolete. In this context, the role of LLMs transitions from compensating for model limitations to
capitalizing on their generalizability and contextual fluency. Their simplicity, eficiency, and precision
establish a new standard in the digitization and correction of historical Spanish texts.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In the present work we use a methodological approach grounded in the need to reconcile the inherent
complexity of historical Spanish texts with the evolving capacities of OCR and large language models. At
the heart of this process lies a progressive rethinking of architecture: from a comprehensive, agent-based
system designed to maximize redundancy and robustness, to a more streamlined and eficient model
driven by advances in model fluency and OCR reliability.</p>
      <p>We began by establishing a baseline architecture that could respond to the multifaceted challenges of
digitizing historical documents. This system was conceived not as a linear pipeline but as a constellation
of loosely coupled agents, each responsible for a distinct facet of the correction process. The notion of
an "agent" is deployed here in its most general and flexible sense—any discrete process, tool, or model
that can operate independently yet collaboratively within the broader architecture. This design allowed
us to capture the granular complexity of historical variation while providing a modular framework for
experimentation.</p>
      <p>
        At its core, the baseline architecture, (See Figure 1), integrated a preprocessing module with multiple
OCR systems whose outputs were then synthesized through a large language model acting as a fusion
engine. The selection of three OCR as core components of the OCR pipeline reflects a deliberate strategy
grounded in complementarity rather than reliance on a single solution. No single OCR system ofers
consistently optimal performance across the diverse range of document types, languages, and layouts
encountered in large-scale digitization projects. Each tool was chosen for its distinct strengths. The
OCR systems selected were Surya and OlmOCR together a Gemini 2.0. OlmOCR, developed by AllenAI,
is a context-aware toolkit for linearising PDF documents into text, integrating visual–textual co-training
and layout modelling to tackle noisy scans and typographic variation with high fidelity, CAllenai
(
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ) . Its adaptive pipeline excels in multilingual archives and early modern prints being particularly
efective for complex or historical documents. Surya is an open-source OCR framework optimised
for 90+ languages, ofering robust line-level text detection, layout analysis (tables, images, headers),
and reading-order inference—even on degraded or low-resource scans, CibinQuadance (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ). Its
lightweight architecture balances accuracy and speed, making it ideal for transnational DH workflows.
Finally, Gemini 2.0 a Google DeepMind’s multimodal large language model augments OCR pipelines by
jointly processing text and images, enabling advanced post-OCR tasks such as semantic annotation,
layout-aware classification, and multilingual reasoning, without replacing dedicated recognition engines,
Google (
        <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
        ). Together, these systems exemplify the convergence of linguistic insight and machine
learning in service of scholarly digitisation. Their integration into Digital Humanities pipelines not only
enhances transcription accuracy but also supports interpretability and methodological transparency
in AI-assisted research. On the other hand, the fusion model used was GPT-4o, and was tasked with
reconciling divergent OCR hypotheses and producing a single, coherent transcription. The fusion
model operates by employing a form of Chain of Thought prompting, which guides the language model
through a structured sequence of reasoning steps. This approach significantly improves its ability
to synthesize fragmented or partially erroneous textual inputs, especially when dealing with noisy
OCR outputs. In this study, we designed and tested several prompts tailored to this purpose. One
representative example is the following:
Prompt=
"You are an expert in text correction and OCR error fixing.
      </p>
      <p>Your task is to combine and correct several OCR outputs of the same text.
Here are the texts:
[Insert OCR outputs here]
Instructions:
1. Combine the texts, correcting any OCR errors.
2. Provide only the corrected text, without any additional commentary.
3. Maintain the original structure and formatting.
4. Do not add any new information or explanations.
5. Join any words that have been separated by a hyphen at the end of a line.
If there’re blank spaces after the hyphen,
remove them so the two parts of the word get joined correctly.
6. The text is written using archaic Spanish spelling.
7. Maintain all diacritical marks, old-fashioned spellings,
and historical punctuation, such as the use of ’fué’ instead of ’fue’,
’dió’ instead of ’dio’, ’ví’ instead of ’vi’, ’á’ instead of ’a’ in prepositions.
Do not replace older words or grammatical structures with modern equivalents.
8. Ensure that all words retain their original diacritics,
such as accents (é, á, ó), tildes (ñ), and umlauts (ü), without alteration.
9. Focus on fixing spelling and obvious OCR mistakes.
10. End your response with ’===END===’ on a new line.</p>
      <p>Corrected text:"
*All prompts used in this study are available at:
https://github.com/sros-UNED/pastreader</p>
      <p>This prompt proved particularly efective for aligning semantically equivalent text fragments and
ifltering out common OCR errors, while simultaneously preserving the historical orthography and
structural fidelity of the source material. By embedding explicit constraints and a stepwise reasoning
process, the model is not only more accurate in its outputs but also more robust to noise, misrecognition,
and orthographic irregularities in historical documents. Finally, the postprocessing stage corrected
residual noise while preserving historically plausible forms, informed by typographic norms and
diachronic linguistic data. This architecture is further strengthened by a dedicated preprocessing stage
aimed at enhancing image quality prior to OCR. Specifically, adaptive thresholding techniques were
employed to improve text-background contrast, while Gaussian corrections were applied to mitigate
noise and uneven illumination. These image enhancement methods ensure more reliable character
recognition, particularly in degraded or visually inconsistent scans, thereby increasing the overall
accuracy and stability of the downstream pipeline.</p>
      <p>Yet, while this architecture proved efective in principle, its operational overhead prompted us to
explore a leaner alternative. Our revised architecture, (See Figure 2), motivated by the empirical results
and performance bottlenecks of the baseline, dispenses with multi-engine redundancy and fusion.
Instead, it relies on a high-performance OCR system—Google’s Gemini 2.5 PRO model, an evolution of
the previous uses Gemini 2.0 model—as the sole recognition agent. In this particular implementation,
it was observed that even a relatively simple prompt can yield highly efective results when working
with advanced multimodal large language models for post-OCR enhancement. The following prompt,
for instance, was used with consistent success to extract and normalize historical Spanish text while
preserving its linguistic integrity:
Prompt=
"Perform OCR (Optical Character Recognition) on this image.</p>
      <p>Extract ALL visible text without modernizing or modifying Old Spanish.
Correct spelling and punctuation while preserving the original language and format.
Respond ONLY with the extracted text, without additional comments"</p>
      <p>Despite its simplicity, this instruction proved suficient to achieve high-quality transcription results.
The model demonstrated strong alignment with the task goals, requiring minimal additional tuning to
handle historical orthography and formatting. This highlights the potential of prompt-based
architectures to simplify complex processing pipelines through carefully designed language instructions. Also
it is not necessary a preprocessing stage, the model makes it straightforward ans its output is passed
directly to a postprocessing module, which fulfills the same curatorial role as before, filtering OCR
artifacts without imposing anachronistic normalization. This final architecture, while deceptively simple,
emerged as remarkably robust, leveraging the maturity of modern OCR engines and the precision of
post-hoc linguistic correction to achieve results that rival more complex configurations.</p>
      <p>Finally, both proposed architectures underscore the critical need for a robust postprocessing layer to
ensure the usability and fidelity of the OCR-derived textual output. Despite advances in recognition
accuracy, raw OCR outputs often include extraneous elements that obscure the primary content and
hinder downstream processing. To address this, a comprehensive postprocessing routine was implemented
to systematically clean and normalize the extracted text. This includes the removal of institutional
attributions (e.g., repeated references to the Biblioteca Nacional de España), typographic artifacts such
as copyright symbols (©), pilcrows (¶), or decorative punctuation, and numerical strings often linked
to cataloging or pagination. Moreover, the procedure targets frequent OCR errors at the beginning of
lines—such as spurious punctuation marks or repeated characters—using regular expressions for precise
correction. By trimming superfluous whitespace and blank lines, the postprocessing layer ensures a
coherent and structured output, thereby facilitating more reliable linguistic, semantic, or downstream
machine learning analyses.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Description</title>
        <p>The dataset used in this study is derived from the historical press collection digitized by the National
Library of Spain (Biblioteca Nacional de España, BNE) and accessible via the Hemeroteca Digital platform.
As of this writing, the full collection comprises 298 press titles, 88,748 issues, and over 8 million digitized
pages. This repository spans publications from the 17th to the 20th centuries and continues to expand
through ongoing digitization eforts by the BNE.</p>
        <p>For the purpose of development and evaluation, a dedicated development partition was curated
from this broader corpus. This subset consists of 500 representative pages selected to ensure a
stratified sampling across publication types, time periods, and layout complexities. Each item in the
development set includes the following components:
• The original scanned PDF page.
• The corresponding OCR output (as generated by the legacy OCR system used in Hemeroteca</p>
        <p>Digital).</p>
        <p>• The manually corrected transcription serving as ground truth.</p>
        <p>These corrected transcriptions were prepared under the BNE’s collaborative initiative ComunidadBNE,
which enables users to contribute to OCR correction on selected publications. Correction projects are
chosen based on factors such as historical significance, consultation frequency, and technical feasibility
of user-driven annotation.</p>
        <p>The OCR quality in the collection varies substantially due to factors like digitization date, scanning
technology, page layout complexity, and the physical condition of the originals. The development set,
in particular, was assembled to reflect this heterogeneity, thereby ensuring that evaluation metrics are
representative of real-world variability in historical press digitization.</p>
        <p>The full development dataset is publicly available and can be accessed via the Hemeroteca Digital
platform. It forms a critical component of our pipeline design and evaluation, allowing for robust
benchmarking of OCR and postprocessing strategies under controlled yet realistic conditions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>To evaluate this system, we adopted a pragmatic sampling strategy grounded in cluster analysis based
on k-means algorithm. This process allowed us to simulate a full-corpus evaluation while dramatically
reducing computational cost and review time. For this purpose we used the dataset prepared for this
task, Montejo-Ráez et al. (2025a). All reference documents used in the clustering process were first
inspected to identify the distinct text–image characteristics. With this insight about the structure of
the diferent document in the corpus, we build a feature vector for each document based on de CER
and WER metrics compute using three traditional OCR engines. According to the elbow algorithm,
the optimal number of cluster was six. Therefore, the corpus, composed of heterogeneous sources
spanning three centuries, was algorithmically partitioned into six clusters. For each cluster, the centroid
document was selected to identify a single representative document—an exemplar that distilled the
dominant characteristics of its group. These six clustering fit with our previous exploratory analysis of
the corpus.</p>
      <p>Figures 3 and Figure 4 illustrate pages containing only textual lines: the former exhibits the typical
yellowing associated with paper aging, while the latter shows minor geometric distortions in the text
baselines despite retaining a predominantly white background. Figures 5 and 6 depict pages that combine
text with graphical elements and complex structures—titles interleaved with illustrations or
multicolumn tables—posing challenges for both segmentation and layout analysis. Finally, Figures 7 and 8
present characteristic two-column pages: one cluster features uniform background textures, whereas
the other contains varying background tones, necessitating adaptive thresholding and column-detection
strategies.</p>
      <p>Extending this analysis, each cluster suggests tailored preprocessing and recognition workflows.
For the homogeneously aged and white-background text clusters (Figs. 4–8), binarization methods
must compensate both for color degradation and slight skew, ensuring accurate line segmentation. The
mixed-media clusters (Figs. 5–6) require hybrid layout engines capable of discriminating between textual
and graphical regions, often combining connected-component analysis with rule-based heuristics to
preserve reading order. Two-column pages with consistent backgrounds (Fig. 7) can leverage fixed grid
models to detect column boundaries, whereas those with uneven backgrounds (Fig. 5) benefit from
locally adaptive thresholding and morphological filtering to achieve reliable text extraction. Together,
these six representative image types form a comprehensive basis for optimizing OCR pipelines and
evaluating model robustness across diverse document archetypes.</p>
      <p>Each version of the architecture—baseline and final—was applied to these representative texts,
enabling us to assess the comparative performance of the systems across a spectrum of historical,
typographic and linguistic conditions. Metrics such as Word Error Rate (WER) and Character Error
Rate (CER) were calculated to quantify system output.</p>
      <p>In Table 1, the first three column-pairs— Gemini 2.0, Surya, and OlmOCR—report the Character Error
Rate (CER) and Word Error Rate (WER) immediately after each OCR engine has processed the document
images through the dedicated preprocessing pipeline (binarization, deskewing, layout normalization,
etc.). These values thus capture the raw recognition performance of each system on cleaned inputs.
The fourth column-pair, labeled GPT-4o, presents CER and WER measured after a fusion step in which
OCR outputs are combined and disambiguated by GPT-4o’s multimodal reasoning. By leveraging
contextual cues across line segments and graphical regions, this fusion reduces insertion, deletion,
and substitution errors, yielding consistently lower error rates than any individual OCR source. The
POSTPRO column-pair shows CER and WER following a post-processing stage.</p>
      <p>Taken together, these results illustrate how successive stages—preprocessing, multimodal fusion,
and linguistic post-processing—each contribute to progressive error reduction demonstrating a clear
trajectory of improvement in both character- and word-level accuracy.</p>
      <p>Preprocessing →−
⏟</p>
      <p>remov⏞es
image artifacts</p>
      <sec id="sec-4-1">
        <title>Fusion</title>
        <p>l⏟everages
⏞
complementary outputs
→−</p>
      </sec>
      <sec id="sec-4-2">
        <title>Postprocessing</title>
        <p>⏟ enforc⏞es
grammatical consistency
→−
{︃CER &lt; 2%</p>
        <p>WER ≈
4%− 8%</p>
        <p>Finally the fifth column-pair, labeled Gemini 2.5 Pro Only, presents CER and WER measures of the
second architecture, where we only use a simple prompt and this LLMs for the full process. The results
obtained from the second architecture demonstrate performance comparable to the final output of the
initial architecture. We would like to highlight that the only case in which the model is outperformed
by the first architecture involves a document featuring a title and two-column layout. This discrepancy
arises because Gemini extracts the information in a structurally diferent manner compared to the
ground-truth annotations of that document. A simple post-processing step would be suficient to
resolve this issue. Therefore, unlike the latter, the second approach ofers significant advantages in
terms of cost-eficiency and implementation simplicity. Therefore, models such as Gemini 2.5 Pro
rightfully deserve recognition as state-of-the-art performers, combining high accuracy with streamlined
deployment. What emerges from this iterative methodological refinement is not merely a more efective
pipeline, but a new paradigm for historical OCR correction. The shift from complex fusion to eficient
minimalism mirrors a broader transition in the field: one in which redundancy is no longer required to
compensate for the limits of AI, but becomes an unnecessary burden in the face of increasingly fluent
and historically aware models.</p>
        <sec id="sec-4-2-1">
          <title>4.1. OCR system selection for evaluation process</title>
          <p>In the present study, we adopted a two-tiered approach to OCR-based document analysis, whereby
diferent models were strategically employed across distinct stages of the experimental workflow.
Specifically, PaddleOCR, Doctr, and Surya were utilized during the clustering and document selection
phase, whereas the evaluation of OCR architectures was conducted using Surya, Gemini 2.0, and
OlmOCR. This methodological divergence is grounded in both technical rationale and the diferentiated
objectives intrinsic to each stage of analysis. During the initial phase—dedicated to document clustering
and the identification of key representative texts—we deliberately employed a heterogeneous ensemble
of OCR systems. By integrating PaddleOCR, Doctr, and Surya, we sought to harness the complementary
biases and strengths of each engine. This ensemble strategy allowed for a richer and more nuanced
textual representation of the corpus, thereby enhancing the granularity and fidelity of the clustering
process. Such an approach is supported by the principle of complementary redundancy, whereby
aggregating outputs from diverse OCR systems mitigates model-specific artifacts and augments the
representativeness of selected documents. In this context, OCR outputs were not evaluated per se, but
served as intermediary linguistic proxies, enabling the unsupervised grouping of documents along
latent textual patterns.</p>
          <p>In contrast, the subsequent evaluation phase necessitated a diferent set of criteria, oriented toward the
quantitative and qualitative assessment of OCR performance. Here, the focus shifted to a comparative
study of advanced architectures, namely Surya, Gemini 2.0, and OlmOCR. Surya was retained as a
baseline to preserve methodological continuity across phases, while Gemini 2.0 and OlmOCR were
selected for their capacity to address domain-specific challenges—such as degraded text, historical
typography, and script variability. This targeted selection reflects the need for precision and robustness
in OCR tasks involving complex or noisy input, particularly within heritage and archival contexts.
Moreover, the evaluation was carried out on a curated subset of documents identified during the
clustering phase, ensuring that the testbed was both representative and suficiently challenging.</p>
          <p>The use of distinct OCR models across the pipeline also reflects a broader methodological stance:
that of functional diferentiation. In complex document processing workflows, it is methodologically
sound—and increasingly common—to delegate distinct tasks to specialized components. Clustering, by
its exploratory and heuristic nature, benefits from model plurality and interpretative breadth; evaluation,
by contrast, demands controlled variables and analytic precision. In summary, the bifurcation in model
selection reflects a deliberate alignment between methodological purpose and technical afordance.
A diverse OCR ensemble facilitated robust document clustering and selection, while the evaluation
phase privileged state-of-the-art architectures to benchmark recognition performance under challenging
conditions. This stratified approach enabled a more nuanced and efective exploration of OCR capabilities
across the document corpus.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Task resolution</title>
      <p>Following the experiments presented above, and after confirming that the architecture based solely on
a large multimodal language model (LLM) achieved the best performance metrics in terms of Character
Error Rate (CER) and Word Error Rate (WER), we proceeded to apply this architecture to the text dataset
provided by the PastReaders task. The oficially reported results are as follows, Table 2:</p>
      <p>It can be observed that our system outperformed the proposed baseline across all benchmarks.
Likewise, it surpassed all participants in the task, demonstrating that multimodal LLMs ofer remarkable
performance combined with exceptional ease of use. According with the requirements of the task we
include the results of CO2´s emissions of our system, Table 3.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study demonstrates the efectiveness of a tiered, empirically driven approach to OCR pipeline
design for historical Spanish documents, leveraging both traditional recognition engines and multimodal
large language models. By first establishing a modular, agent-based architecture and subsequently
simplifying it through the empirical evaluation of its components, we illustrate a trajectory from
redundancy-oriented robustness to streamlined eficiency without sacrificing transcription quality.</p>
      <p>Our results confirm that the integration of advanced OCR engines such as Gemini 2.0, alongside fusion
via GPT-4o and postprocessing, yields significant improvements in character and word recognition
rates across heterogeneous document types. Notably, each stage of the pipeline—preprocessing, fusion,
and postprocessing—contributed cumulatively to error reduction, with the final output exhibiting CER
below 2% and WER consistently between 4% and 8%.</p>
      <p>Through cluster-based sampling and representative document selection, we ensured that system
evaluation was both computationally tractable and methodologically representative. This stratified
approach uncovered critical layout-dependent challenges, especially in documents with complex structures
(e.g., multi-column layouts or mixed graphical-text content), which informed the design of adaptive
recognition strategies.</p>
      <p>Crucially, our comparative analysis of the second, leaner architecture—based solely on the Gemini 2.5
Pro model—shows that its performance closely matches, and in most cases surpasses, the more complex
baseline architecture. The only notable discrepancy, observed in a single two-column layout case,
results from a structural misalignment between the model’s output and the ground-truth segmentation,
a diference that can be resolved with minimal postprocessing. This finding underscores the growing
maturity of modern OCR-LM integrations and their capacity to deliver state-of-the-art results with
minimal configuration overhead.</p>
      <p>Beyond performance, the implications of this work are methodological. The use of distinct OCR
systems across pipeline stages—ensemble models for clustering and high-precision engines for
evaluation—highlights the value of functional diferentiation. By aligning model capabilities with the specific
afordances and requirements of each task, we achieved both interpretive depth and evaluative rigor.
The results advocate for a modular yet adaptive approach to document processing in digital humanities
workflows, one that balances the need for precision with the practicalities of scalability and cost.</p>
      <p>Ultimately, our findings position models such as Gemini 2.5 Pro not merely as enhancements
to traditional OCR workflows but as viable standalone solutions for historical document
transcription—combining accuracy, eficiency, and accessibility. As large language models continue to evolve,
their integration into scholarly digitization pipelines promises to redefine both the technical boundaries
and methodological assumptions of textual heritage research.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research has been supported by the CLS INFRA project (Grant Agreement ID: 101004984), funded
by the European Union. Special thanks to the funding agency for providing the necessary resources to
make this work possible.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used LLM in order to: Develop and evaluate the
architectures, grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>Online Resources</title>
      <sec id="sec-9-1">
        <title>The sources for this work are available at: • Pastreaders Code,</title>
      </sec>
      <sec id="sec-9-2">
        <title>The full code and prompts used are included in the repository.</title>
        <p>Montejo-Ráez, A., Sánchez Nogales, E., Expósito Álvarez, G., Ureña López, A., Martín-Valdivia,
M. T., Collado-Montañez, J., Cabrera de Castro, I., Cantero Romero, M. V., García Serrano, A.,
Ortuño Casanova, R., and Torterolo Orta, Y. A. (2025a). Pastreader 2025. https://doi.org/10.5281/
zenodo.15084265. [Data set].</p>
        <p>Montejo-Ráez, A., Sánchez-Nogales, E., Expósito-Álvarez, G., Ureña-López, L. A., Martín-Valdivia, M. T.,
Collado-Montañez, J., Cabrera-de Castro, I., Cantero-Romero, M. V., and Ortuño-Casanova, R. (2025b).
Overview of pastreader shared task in iberlef 2025: Transcribing texts from the past. Procesamiento
del Lenguaje Natural, 75.</p>
        <p>Subramani, N., Matton, A., Greaves, M., and Lam, A. (2021). A Survey of Deep Learning Approaches for</p>
        <p>OCR and Document Understanding. arXiv:2011.13534 [cs].</p>
        <p>
          Yenigün, O. (
          <xref ref-type="bibr" rid="ref2 ref5">2025</xref>
          ). Fine-Tuning T5 for Grammar Correction: A Step-by-Step Guide.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Boros</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanello</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Najem-Meyer, S., and
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Post-Correction of Historical Text Transcripts with Large Language Models: An Exploratory Study</article-title>
          . In Bizzoni, Y.,
          <string-name>
            <surname>Degaetano-Ortlieb</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazantseva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Szpakowicz</surname>
          </string-name>
          , S., editors,
          <source>Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage</source>
          , Social Sciences,
          <article-title>Humanities and Literature (LaTeCH-CLfL</article-title>
          <year>2024</year>
          ), pages
          <fpage>133</fpage>
          -
          <lpage>159</lpage>
          , St. Julians, Malta. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>CAllenai</surname>
          </string-name>
          (
          <year>2025</year>
          ). OlmOCR. original-date:
          <fpage>2024</fpage>
          -
          <lpage>09</lpage>
          -17T14:
          <fpage>53</fpage>
          :
          <fpage>40Z</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>CibinQuadance</surname>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>surya-OCR</article-title>
          . original-date:
          <fpage>2024</fpage>
          -
          <lpage>02</lpage>
          -21T07:
          <fpage>43</fpage>
          :
          <fpage>20Z</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>J. Á.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jiménez-Zafra</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages</article-title>
          .
          <source>In Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2025</year>
          ),
          <article-title>co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Google</surname>
          </string-name>
          (
          <year>2025</year>
          ). Gemini.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Greif</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Griesshaber</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Greif</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents</article-title>
          .
          <source>arXiv:2504.00414 [cs] version: 1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baudru</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryckbosch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bersini</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ginis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Early evidence of how LLMs outperform traditional systems on OCR/HTR tasks for historical records</article-title>
          .
          <source>arXiv:2501.11623 [cs] version: 1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Manrique-Gómez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodríguez-Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Manrique</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities</source>
          , pages
          <fpage>132</fpage>
          -
          <lpage>139</lpage>
          . arXiv:
          <volume>2407</volume>
          .12838 [cs].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>