<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Quality of Corpora in the LLM Era</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucía Sevilla-Requena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Corpus Linguistics</institution>
          ,
          <addr-line>Natural Language Processing, Artificial Intelligence, Quality Corpora, Corpus Evaluation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99</institution>
          ,
          <addr-line>E-03080, Alicante</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Doctoral Symposium on Natural Language Processing</institution>
          ,
          <addr-line>25</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The present study addresses the need for a systematic and scalable framework to determine the quality of linguistic corpora in the era of Large Language Models (LLMs). As the success of LLMs increasingly depends on the quality of their training data, traditional corpus evaluation methods have become obsolete. This research analyses and proposes a new methodology that defines comprehensive quality criteria, introduces a classification system, and designs interpretable and partially automatable metrics. The resulting framework will be validated through empirical experiments that assess how corpus quality impacts model performance, robustness, and fairness. Therefore, this work aims to bridge the methodological gap in Natural Language Processing (NLP) by providing an updated, reproducible, and practical corpus assessment and creation tool.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Datasets</p>
    </sec>
    <sec id="sec-2">
      <title>1. Justification of the Research</title>
      <p>
        Evaluating the quality of linguistic corpora is a fundamental aspect of their design and usage, as it
plays a key role in determining the reliability and overall usefulness of the data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In an era marked
by the spread of digital content, accessibility to information is no longer an obstacle. However, the
vast abundance of sources presents a new challenge: distinguishing quality data from overwhelming
information available. According to Austermühl [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], accessing online data is relatively easy, but finding
accurate information can be complex and often frustrating. This highlights the need to establish solid
criteria that allow the development and assessment of the quality of digital resources.
      </p>
      <p>
        In recent years, large language models (LLMs) have marked a profound transformation in the field of
Artificial Intelligence (AI) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Although much of this progress is attributed to innovations in model
design and training techniques, another crucial factor has gained prominence: revising the criteria used
to determine the validity and usefulness of the data employed to train these systems.
      </p>
      <p>
        Early LLMs highlighted the importance of having coherent and high-quality textual data for training
models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To achieve this, they began using structured document-level corpora drawn from specific
domains, such as Wikipedia and BookCorpus1 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], thus moving away from earlier approaches based on
minimal linguistic units, like individual sentences [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This shift responded to the need for longer and
more contextually cohesive data. As these models increased in scale and complexity, large-scale web
scraping became a widely adopted strategy for collecting massive volumes of textual data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The limitations of web-scraped data without human supervision soon became apparent. Studies
such as Radford et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] emphasised the importance of data curation and cleaning, showing that
carefully selected datasets consistently outperformed raw Web content. This insight led to the creation
of so-called “high-quality” corpora such as The Pile [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which integrates web data with books, scientific
articles, and conversations from social media.
      </p>
      <p>
        In addition, data cleaning has become a critical step in corpus preparation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], ofering additional
benefits such as reduced dataset size [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and more eficient training cycles. Along these lines, a
      </p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
recent study by Eldan and Li on TinyStories [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a synthetically generated dataset for training neural
networks in English, demonstrates the potential impact of high-quality data. Their results suggest that
well-crafted, high-quality data can alter scaling laws, allowing smaller models to achieve performance
levels traditionally associated with much larger systems.
      </p>
      <p>
        Beyond this, developing domain-specific and multimodal models further reinforces the importance
of using specialised data. Corpora focused on specific topics have been successfully used to build
biomedical models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and conversational models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>In this context, the present thesis addresses a critical challenge in the field of Natural Language
Processing, which is the need for a systematic and scalable framework for assessing corpus quality
in the era of LLMs. As LLMs increasingly advance, it has become clear that the quality of training
data is a decisive factor in model performance, fairness, and reliability. Nevertheless, current corpus
evaluation practices are often based on outdated or fragmented criteria that are insuficient for the
complexity and demands of modern models. This PhD thesis proposes a structured set of quality criteria,
a classification methodology, and interpretable metrics that will guide the critical assessment of existing
corpora, particularly regarding their suitability for fine-tuning LLMs.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background and Related Work</title>
      <p>The present section provides an overview of the existing literature and frameworks relevant to corpus
quality evaluation, starting with a conceptual clarification of the term “quality” and progressing through
the main criteria developed in the pre-LLM era. Finally, it examines recent advances in data curation
and quality assessment in the context of LLMs training, highlighting how modern developments have
reshaped traditional evaluation paradigms.</p>
      <sec id="sec-3-1">
        <title>2.1. The concept of quality corpus</title>
        <p>
          Before reviewing the existing criteria for evaluating high-quality corpora, it is essential to clarify what
“quality” means and what constitutes a quality corpus. The notion of quality, much like the concept
of information, is employed in various contexts, often without a clear or consistent definition [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
According to the Oxford English Dictionary, quality is “the standard of something when it is compared
to other things like it; how good or bad something is” [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. It is a synonym for the term “excellence”,
but defining quality solely as that ofers a limited practical guide to establish and apply specific criteria
in corpus evaluation.
        </p>
        <p>
          The British Standards Institution (BSI) provides a more functional definition which describes quality
as “the totality of characteristics of an entity that bear on its ability to satisfy stated and implied needs”
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. In other words, the quality reflects how much a product or service fulfils its purpose [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Because
those needs vary from context to context, there is no absolute quality standard. Instead, the focus is on
specific attributes, such as reliability, eficiency, or robustness, directly influencing a resource’s ability
to satisfy its intended audience. This highlights the importance of establishing clear and appropriate
evaluation parameters while building a corpus and ensuring its suitability for the intended purpose.
        </p>
        <p>
          When turning this towards linguistic corpora, the notion of “quality” again depends on the project’s
aims. In the words of Wieczorkowska [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], “to obtain a representative and balanced corpus, the purpose
of creating a given corpus should be defined first, as should the target user”.
        </p>
        <p>
          In empirical theoretical linguistics, for example, researchers prioritise rigorously controlled sampling
and minimally invasive data cleaning to preserve authenticity. In contrast, computational linguistics and
language technology applications often require more aggressive preprocessing to optimise algorithmic
performance [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. These diferent priorities illustrate that the “quality corpus” is a set of context-driven
criteria.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Pre-LLM Evaluation Criteria</title>
        <p>
          Several scholars have proposed criteria frameworks aimed at standardising corpora evaluation (see
Appendix). To start, Cooke [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] proposed a comprehensive set of ten parameters: (1) purpose, (2)
coverage, (3) authority and reputation, (4) accuracy, (5) currency and maintenance, (6) accessibility,
(7) information presentation and layout, (8) ease of use, (9) comparison with other sources, and (10)
overall quality. This extensive list reflected an integrative approach to the evaluation of digital content,
encompassing both technical and content-related considerations.
        </p>
        <p>
          In addition, other authors advocated for simpler models, such as Alexander and Tate [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], who
suggested five core parameters: authority, coverage, objectivity, accuracy, and currency. These dimensions
balance reliability and relevance while ensuring that the content reflects real and current
perspectives. Codina [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], by contrast, proposed a slightly diferent configuration, adding usability-oriented
parameters such as ergonomics, brightness, and visibility to more traditional factors like authority and
content.
        </p>
        <p>
          Jiménez Piano and Ortiz-Repiso Jiménez [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] ofered another variation by identifying five core
aspects in evaluating digital resources: search and retrieval functionalities, authority, content, resource
management, and design. These categories highlighted the interplay between user interaction and
content quality. Added to this, the comparative analysis conducted by Gordon-Murnane [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] on twelve
web evaluation services also underscored this variability. The findings revealed that content was the
only universally recognised evaluation parameter, suggesting a significant divergence in how evaluation
priorities are conceived and implemented.
        </p>
        <p>
          Across the literature, certain parameters, such as authority, accuracy, presentation, and currency,
emerged as consistently valued, as shown below in Table 1. These are often cited either as stand-alone
criteria [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] or as components of broader constructs like “content quality” [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Additionally, more
recent contributions highlighted the growing importance of design in digital environments. Other
scholars, such as Gafney [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], Adreon et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and the Pearl K. Wise Library [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], point to visual
and structural design as critical factors influencing users’ perceptions of quality and usability [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ].
        </p>
        <p>
          However, there is still a lack of consensus on which criteria are the most essential or how they
should be prioritised, according to Buendía Castro and Ureña Gómez-Moreno [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. The variability
among proposed frameworks reflects diferent priorities and evaluative traditions, often shaped by the
technological and academic contexts in which they have emerged over the years.
        </p>
        <p>Evaluation criteria
Purpose
Coverage
Authority and reputation
Accuracy
Currency and maintenance
Accessibility
Presentation / Layout Design
Ease of use / Usability
Objectivity
Content
Resource management</p>
        <p>Cooke (1999) Alexander &amp; Tate (1999)
x
x x
x x
x x
x x
x
x
x
x</p>
        <p>Codina (2000)</p>
        <p>Jiménez Piano &amp; Ortiz-Repiso (2007) Other authors*
x
x
x (visibility, brightness)
x (ergonomics)
x
x
x
x
x
x (search/retrieval)
x
x
x
x
x
x</p>
        <p>Lastly, it is crucial to recognise that these evaluation frameworks were developed in a pre-Large
Language Model (LLM) era. As valuable as these criteria may be, they emerged before the widespread
adoption of Artificial Intelligence tools capable of generating, summarising, interpreting, and even
creating corpora and digital content with unprecedented fluency and scale. The emergence of LLMs
represents a paradigm shift in how linguistic corpora are produced, accessed, and used.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Quality corpus criteria in the LLM era</title>
        <p>
          Training LLMs requires huge amounts of textual data, but quantity alone is not enough: Data quality
plays a decisive role in achieving good model performance. Although LLMs are often trained on massive
aggregated corpora [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], these datasets must carefully find a balance between data volume and quality.
Raw sources such as CommonCrawl [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], while abundant, are often noisy and unstructured, making
them ineficient and less efective for immediate use in pre-training.
        </p>
        <p>
          To address this issue, the NLP community has developed refined corpora that apply rigorous filtering
and cleaning processes to transform raw data into high-quality, structured training resources. Examples
include C4 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], RedPajama [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], SlimPajama [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] and DCLM-baseline [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], which use techniques such
as scoring models, deduplication with MinHash and heuristic rule-based filtering. These data sets
represent a major step forward in addressing issues such as redundancy, low linguistic quality, and the
presence of irrelevant or toxic content.
        </p>
        <p>
          More recently, datasets such as RefinedWeb [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], FineWeb [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] and FineWeb-2 [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] have set new
benchmarks for data quality, significantly improving the eficiency and efectiveness of LLM pre-training.
FineWeb, for instance, was created by Hugging Face using a multi-stage pipeline that includes URL
ifltering, language classification and custom quality filters targeting line punctuation ratios, short line
prevalence, average words per line, n-gram repetition, and document length [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>
          Additionally, a seminal contribution by Zhou et al.[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] is the development of LIMA (Less Is More for
Alignment), an LLM designed to investigate the art of instruction tuning: rather than the volume of
data, it is the quality that dictates the model’s performance. Remarkably, LIMA demonstrates that even
a limited amount of carefully curated, high-quality data can significantly enhance a model’s ability to
follow instructions. While this underscores the critical role of data quality, the question of automatically
identifying high-quality data from a vast ocean of available datasets remains under investigation.
        </p>
        <p>
          These advances reflect a broader consensus: while data volume remains important, high-quality,
diverse and well-selected sources, often drawn from specific domains such as scientific literature, books
or encyclopaedic content, are critical for improving the capabilities of models [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ],[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. Content scraped
from the web alone is no longer considered suficient; rather, careful data selection has become an
essential component of the training pipeline.
        </p>
        <p>Lastly, these developments highlight a shift in the way the NLP field approaches data: no longer as a
mere volume-driven input, but as a carefully designed and evaluated component that is central to the
success of modern LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Main Hypothesis and Objectives</title>
      <p>The present thesis addresses a critical methodological gap in NLP: the lack of a systematic, interpretable,
and scalable framework for evaluating the quality of linguistic corpora used in the fine-tuning of Large
Language Models (LLMs). While the field has significantly progressed in model architecture, scale, and
capabilities, corpus evaluation practices have failed to keep pace. Current methods are often based on
paradigms that precede LLMs and do not capture the nuanced aspects of data quality essential for the
efective adaptation of these models to specific domains and tasks.</p>
      <p>This study is grounded in the hypothesis that corpus quality is a key determinant of model
performance. A well-constructed, high-quality corpus can significantly enhance a language model’s accuracy,
robustness, and fairness, whereas low-quality data can result in biased, unreliable, or inefective systems.
Consequently, developing rigorous and clearly defined evaluation methodologies is essential to ensure
that existing corpora meet the standards required for fine-tuning LLMs responsibly and efectively.</p>
      <p>Given the fundamental role of the corpora’s quality in model results, there is an urgent need for an
evaluation system that goes beyond technical metrics. Such a system must also integrate linguistic
knowledge to identify potential weaknesses or representational gaps in the data, factors that can
profoundly influence model performance. The challenge is to create an evaluation framework that is
robust, reproducible, scalable and adaptable to the increasing complexity of LLMs.</p>
      <p>To this end, the following specific objectives are proposed:
• O1: To define and systematise a comprehensive set of quality criteria for linguistic corpora used
in fine-tuning LLMs.
• O2: To propose a classification methodology that categorises corpora into quality levels (raw,
bronze, silver, and gold) according to the defined criteria.
• O3: To design an interpretable and, as far as possible, automatable corpus quality metric that
operationalises these criteria in measurable terms.
• O4: To apply the proposed framework to evaluate existing corpora, thereby demonstrating its
applicability.
• O5: To validate the framework’s utility by analysing existing corpora and conducting fine-tuning
experiments to explore the relationship between corpus quality and model performance.</p>
      <p>Tho summarise, the present thesis aims to address a notable gap in the literature through these
interconnected objectives. It seeks to provide researchers with a scalable, linguistically grounded tool
for ensuring the quality for fine-tuning datasets. This applies regardless of the language in which the
datasets are created or the domain for which they are intended, ensuring compliance with the stringent
requirements of contemporary LLM architectures.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <p>
        Following Creswell and Plano Clark [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], this thesis will adopt a methodology based on a mixed
approach, combining the theoretical definition of quality criteria with their practical application and
empirical validation through experiments with language models fine-tuned on corpora of diferent
quality levels. The research unfolds through the following successive stages:
1. Literature Review. This study begins with a comprehensive review of the state of the art of
corpus evaluation and data quality in the field of NLP, covering both pre-LLMs frameworks and
subsequent emerging proposals. Through this review, it is intended to identify methodological
gaps, define the most recurrent quality criteria, and discover possible indicators specifically
relevant for corpora used in fine-tuning LLMs.
2. Examining Corpus Quality Criteria. Based on the literature review, the most commonly used
quality criteria in corpus evaluation (such as authority, accuracy, and coverage) will be identified
and systematised. In addition, new indicators will be proposed, specifically designed to address
the challenges and particularities posed using corpora in contexts involving large-scale language
models (LLMs). Each criterion, both traditional and newly introduced, will be precisely defined
and accompanied by a reasoned justification outlining its relevance and impact on the overall
quality of the corpus.
3. Design of a Quality Metric. Based on the defined criteria, a quantitative evaluation methodology
will be developed to assign each corpus a quality level: raw, bronze, silver or gold. This metric will
combine automated measures with supervised linguistic evaluations to ensure interpretability
and reproducibility.
4. Evaluation of Existing Corpora. The proposed metrics will be applied to a selection of reference
datasets ranking them according to established quality levels. This will test the metric’s behaviour
on various types of resources and calibrate the threshold values separating each quality level.
5. Empirical Validation through Fine-Tuning. Finally, lightweight, parameter-eficient
finetuning experiments (e.g., using LoRA or adapters) may be conducted on corpora of diferent quality
levels. The goal is to explore the relationship between corpus quality and LLM behaviour by
assessing how diferences in data quality influence model performance, robustness, and fairness.
      </p>
      <p>Through this methodology, the present thesis will provide a solid conceptual model for corpus
evaluation in the LLM era and the NLP community with efective and empirically validated tools to
improve the reliability and fairness of modern language technologies.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Research Issues to Discuss</title>
      <p>To conclude, this section poses key questions to guide the research, identify gaps and develop a
comprehensive framework for assessing corpus quality. Addressing these questions will help refine the
focus of the study and contribute to the broader field of NLP by exploring more efective approaches to
corpus evaluation for fine-tuning LLMs.</p>
      <p>• RQ1. How can the quality of a corpus be evaluated? Corpus quality evaluation is central to
this research. What are the most efective ways to assess the corpus quality, particularly in the
context of fine-tuning large language models? Can existing evaluation methods be adapted to
reflect modern datasets’ complexities and model requirements more accurately? It is crucial to
explore whether traditional evaluation metrics can capture the nuanced aspects of data quality
that afect the efectiveness of LLMs, especially as the scale and diversity of corpora continue to
grow.
• RQ2. What methodologies exist for corpus evaluation? Numerous methodologies have been
proposed for evaluating corpus quality, but many are outdated or insuficient for the needs of large
language models. What are the strengths and limitations of current evaluation methodologies,
and are there other, potentially more suitable methods for assessing corpus quality? This question
emphasises the need for an updated and comprehensive approach incorporating technical and
linguistic considerations to determine a corpus’s suitability for LLM fine-tuning.
• RQ3. How does corpus quality impact the performance of language models? The
relationship between corpus quality and model performance is crucial. To what extent do variations
in data quality afect a model’s efectiveness, robustness, and fairness during fine-tuning? Can
improvements in corpus quality result in significant gains in the overall performance of a language
model? Exploring this dynamic will help determine the role of corpus evaluation in responsible
and efective LLM adaptation.</p>
      <p>The discussion generated by these questions and additional considerations that may arise throughout
the research process will play an essential role in enriching the direction of the PhD thesis.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-8">
      <title>Appendix: Definitions of Evaluation Criteria</title>
      <p>
        Purpose: Refers to its goals and communicative intent, including the intended audience, thematic
scope, and any declared limitations in coverage; this helps assess whether the resource meets the user’s
information needs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Coverage: Refers to the subject areas and types of materials included, as well as the breadth (variety
of topics), depth (level of detail), and any stated limitations that define the resource’s scope [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Authority and Reputation: Evaluates the identity and credibility of the authors or institutions
responsible for the resource, considering their expertise, professional afiliation, prior publications, and
recognised standing in the relevant field [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ][
        <xref ref-type="bibr" rid="ref19">19</xref>
        ][
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Accuracy: Refers to how factually correct and verifiable the information in a resource is, including
the presence of reliable sources, data, and objective evidence to support claims [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Currency and Maintenance: Assesses whether the resource is regularly updated, includes recent
information, removes outdated content, and ofers clear mechanisms for indicating revisions or accessing
previous versions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Accessibility: Encompasses the ease with which a user can access and use the resource, considering
technical factors such as URL stability, device compatibility, open or restricted access, and the absence
of unnecessary usage barriers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Presentation/Layout Design: Refers to the visual and structural design of the resource, including
the use of colours, images, the layout of text, and navigation tools, and the extent to which these
elements support understanding and improve the delivery of information [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref19">19</xref>
        ][
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Luminosity: Refers to the number of external links a website contains to other web pages. It
measures how many outgoing references the site provides, contributing to its integration within the
broader web ecosystem [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Visibility: Refers to the number of other websites that link to the site being analysed. Also known
as “popularity”, it is a factor used by some search engines to estimate the relevance of a website [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Ease of Use/Usability: Refers to the overall user experience while navigating the resource, including
structural clarity, availability of help tools, intuitive browsing, and design choices that reduce cognitive
efort-closely related to accessibility and design [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Ergonomics: Refers to the ease of reading and using a website, considering factors such as the
appropriate contrast between text and background that facilitate the readability of the information [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Objectivity: Assesses whether the information is presented in a neutral and balanced way, especially
on controversial topics, and whether the author’s position is clearly stated without manipulating facts
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Content: Refers to the core information provided by the resource, evaluated based on its validity,
accuracy, completeness, originality, intellectual organisation, timeliness, and relevance to the intended
audience [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Resource Management: Includes the planning, task assignment, maintenance protocols, quality
control, and organisational policies that support the long-term stability and continuous improvement
of the web resource [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bildhauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Evert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goldhahn</surname>
          </string-name>
          , U. Quasthof,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          , T. Zesch,
          <article-title>Scalable construction of high-quality web corpora</article-title>
          ,
          <source>Journal for Language Technology and Computational Linguistics</source>
          <volume>28</volume>
          (
          <year>2013</year>
          )
          <fpage>23</fpage>
          -
          <lpage>59</lpage>
          . doi:
          <volume>10</volume>
          .21248/jlcl.28.
          <year>2013</year>
          .
          <volume>175</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Austermühl</surname>
          </string-name>
          , Electronic Tools for Translators, St. Jerome, Manchester,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Penedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Malartic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cojocaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alobeidli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pannier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Almazrouei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Launay,</surname>
          </string-name>
          <article-title>The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.01116. arXiv:
          <volume>2306</volume>
          .
          <fpage>01116</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies</article-title>
          and reading books,
          <year>2015</year>
          . URL: https://arxiv.org/abs/1506.06724. arXiv:
          <volume>1506</volume>
          .
          <fpage>06724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chelba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          , T. Robinson,
          <article-title>One billion word benchmark for measuring progress in statistical language modeling</article-title>
          ,
          <year>2014</year>
          . URL: https://arxiv.org/ abs/1312.3005. arXiv:
          <volume>1312</volume>
          .
          <fpage>3005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/ abs/
          <year>1910</year>
          .10683. arXiv:
          <year>1910</year>
          .10683.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nabeshima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Presser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leahy</surname>
          </string-name>
          ,
          <article-title>The pile: An 800gb dataset of diverse text for language modeling</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/2101.00027. arXiv:
          <volume>2101</volume>
          .
          <fpage>00027</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yauney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <article-title>A pretrainer's guide to training data: Measuring the efects of data age, domain coverage</article-title>
          , quality, &amp; toxicity,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.13169. arXiv:
          <volume>2305</volume>
          .
          <fpage>13169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Tinystories: How small can language models be and still speak coherent english</article-title>
          ?,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.07759. arXiv:
          <volume>2305</volume>
          .
          <fpage>07759</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1903</year>
          .10676. arXiv:
          <year>1903</year>
          .10676.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Thoppilan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Freitas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kulshreshtha</surname>
            , H.-
            <given-names>T.</given-names>
            Cheng, A. Jin, T.
          </string-name>
          <string-name>
            <surname>Bos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H. S.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ghafouri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Menegali</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Krikun</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lepikhin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bosma</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Chang</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Krivokon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Rusch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pickett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Man</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Meier-Hellstern</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>R. Delos</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Duke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Soraker</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zevenbergen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Prabhakaran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Olson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Molina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hofman-John</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rajakumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Butryna</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lamm</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kuzmina</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fenton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kurzweil</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Aguera-Arcas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Croak</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Lamda: Language models for dialog applications</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2201.08239. arXiv:
          <volume>2201</volume>
          .
          <fpage>08239</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cooke</surname>
          </string-name>
          ,
          <article-title>A Guide to Finding Quality Information on the Internet: Selection and Evaluation Strategies</article-title>
          , 2nd ed.,
          <source>Library Association Publishing</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] Oxford English Dictionary, Quality, https://www.oed.com/dictionary/quality_n, n.d.
          <source>Retrieved April 15</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>British</given-names>
            <surname>Standards</surname>
          </string-name>
          <string-name>
            <surname>Institution</surname>
          </string-name>
          ,
          <article-title>Quality management and quality assurance: vocabulary</article-title>
          , British Standards Institution, London,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wieczorkowska</surname>
          </string-name>
          ,
          <article-title>Methodology for obtaining high-quality speech corpora</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <year>1848</year>
          . URL: https://doi.org/10.3390/app15041848. doi:
          <volume>10</volume>
          .3390/app15041848.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alexander</surname>
          </string-name>
          , Web Wisdom:
          <article-title>How To Evaluate and Create Information Quality on the Web (1st ed</article-title>
          .), CRC Press,
          <year>1999</year>
          . URL: https://doi.org/10.1201/9780429195556. doi:
          <volume>10</volume>
          .1201/ 9780429195556.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Codina</surname>
          </string-name>
          , Parámetros e indicadores de calidad para la evaluación de recursos digitales, in: Actas de las VII Jornadas Españolas de Documentación.
          <source>La gestión del conocimiento:</source>
          retos y soluciones de los profesionales de la información, Bilbao, España,
          <year>2000</year>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiménez Piano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ortiz-Repiso</surname>
          </string-name>
          <string-name>
            <surname>Jiménez</surname>
          </string-name>
          , Evaluación y calidad de sedes web, Ediciones Trea,
          <string-name>
            <given-names>S.L.</given-names>
            ,
            <surname>Gijón</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gordon-Murname</surname>
          </string-name>
          ,
          <article-title>Evaluating net evaluators</article-title>
          ,
          <source>Searcher</source>
          <volume>7</volume>
          (
          <year>1999</year>
          )
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>P. F. Anderson</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Allee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Grove</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hill</surname>
          </string-name>
          ,
          <article-title>Development of a web evaluation tool in a clinical environment</article-title>
          , http://www-personal.umich.edu/~pfa/pro/courses/WebEvalNew.pdf,
          <year>1999</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-9.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gafney</surname>
          </string-name>
          ,
          <source>Website evaluation checklist v1.1</source>
          , http://www.infodesign.com.au/ftp/WebCheck.pdf,
          <year>1998</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-12.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Adreon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Catey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Strysick</surname>
          </string-name>
          ,
          <article-title>An educator's guide to credibility and web evaluation</article-title>
          , http: //www.ed.uiuc.edu/wp/credibility-2002/index.html,
          <year>2002</year>
          . Accessed:
          <fpage>2025</fpage>
          -4-16.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Pearl</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wise Library</surname>
          </string-name>
          ,
          <article-title>Web evaluation form</article-title>
          , http://www.cpsd.us/CRLS/Library/PDFs/ WebEvaluationForm.pdf,
          <year>2006</year>
          . Accessed:
          <fpage>2025</fpage>
          -4-16.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buendía Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ureña</surname>
          </string-name>
          Gómez-Moreno, ¿cómo diseñar un corpus de calidad? parámetros de evaluación, Sendebar: Revista de la Facultad de Traducción e Interpretación (
          <year>2010</year>
          )
          <fpage>165</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/ paper/2020/hash/1457c0d6bfcb4967411061d95457d756-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .
          <article-title>5: A party of foundation models!</article-title>
          , http://qwenlm.github.io/blog/qwen2.5/,
          <year>2024</year>
          . Blog; accessed
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Agnew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Groeneveld</surname>
          </string-name>
          , M. Mitchell, M. Gardner,
          <article-title>Documenting large webtext corpora: A case study on the colossal clean crawled corpus</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08758</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Together</surname>
            <given-names>Computer</given-names>
          </string-name>
          , Redpajama:
          <article-title>An open source recipe to reproduce llama training dataset</article-title>
          , https://github.com/togethercomputer/RedPajama-Data,
          <year>2023</year>
          .
          <source>Accessed</source>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Soboleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Al-Khateeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Steeves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hestness</surname>
          </string-name>
          , N. Dey,
          <article-title>SlimPajama: A 627B token cleaned and deduplicated version of RedPajama</article-title>
          , https://cerebras.ai/blog/ slimpajama-a
          <article-title>-627b-token-cleaned-and-deduplicated-version-of-</article-title>
          <string-name>
            <surname>redpajama</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https: //huggingface.co/datasets/cerebras/SlimPajama-627B.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fang</surname>
          </string-name>
          , G. Smyrnis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ivgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gadre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Keh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heckel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mercat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Albalak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bitton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nezhurina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kilian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pratt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          , G. Ilharco, G. Daras,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marathe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gokaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chandu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , I. Vasiljevic,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kakade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanghavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Faghri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pouransari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Groeneveld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jitsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kollar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Dimakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Carmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <article-title>Datacomp-lm: In search of the next generation of training sets for language models</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>G.</given-names>
            <surname>Penedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kydlicek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Ben</given-names>
            <surname>Allal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozhkov</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Von</given-names>
            <surname>Werra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <article-title>The fineweb datasets: Decanting the web for the finest text data at scale</article-title>
          , in: The Thirty-eighth
          <source>Conference on Neural Information Processing Systems Datasets and Benchmarks Track</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=n6SCkn2QaG.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>G.</given-names>
            <surname>Penedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kydlıcek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sabolcec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Messmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Foroutan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaggi</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
          </string-name>
          , T. Wolf,
          <article-title>Fineweb2: A sparkling update with 1000s of languages, december 2024b</article-title>
          , URL https://huggingface. co/datasets/HuggingFaceFW/fineweb-2 (????).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A. Efrat,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Ghosh,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , Lima: Less is more for alignment,
          <year>2023</year>
          . URL: https://arxiv.org/ abs/2305.11206. arXiv:
          <volume>2305</volume>
          .
          <fpage>11206</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Le Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bideman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Elsahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          , et al.,
          <article-title>What language model to train if you have one million gpu hours?</article-title>
          ,
          <source>arXiv preprint arXiv:2210.15424</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2210.15424.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ). URL: https://arxiv.org/abs/
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Creswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L. P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <source>Designing and Conducting Mixed Methods Research</source>
          , 3rd ed.,
          <string-name>
            <surname>Sage</surname>
            <given-names>Publications</given-names>
          </string-name>
          , Thousand Oaks, CA,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>