<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Survey of Web Content Control for Generative AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Dinzinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Heß</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Granitzer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Innstraße 41, 94032 Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The groundbreaking advancements around generative AI have recently caused a wave of concern culminating in a row of lawsuits, including high-profile actions against Stability AI and OpenAI. This situation of legal uncertainty has sparked a broad discussion on the rights of content creators and publishers to protect their intellectual property on the web. European as well as US law already provides rough guidelines, setting a direction for technical solutions to regulate web data use. In this course, researchers and practitioners have worked on numerous web standards and opt-out formats that empower publishers to keep their data out of the development of generative AI models. The emerging AI/ML opt-out protocols are valuable in regards to data sovereignty, but again, it creates an adverse situation for a site owners who are overwhelmed by the multitude of recent ad hoc standards to consider. In our work, we want to survey the diferent proposals, ideas and initiatives, and provide a comprehensive legal and technical background in the context of the current discussion on web publishers control.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Web crawling</kwd>
        <kwd>Text &amp; Data Mining</kwd>
        <kwd>Content control</kwd>
        <kwd>Data protection</kwd>
        <kwd>Generative AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The recent advancements in text- and image-generating Large Language Models (LLMs), such as
ChatGPT, LLaMA, and Stable Difusion, have brought Artificial Intelligence (AI) into the forefront of
our daily lives [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. This technological breakthrough has sparked widespread excitement in regards
to its potential for increased productivity and societal progress. However, it has also raised significant
concerns due to its potentially disruptive efects [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. One of the primary issues is that LLMs are trained
on vast amounts of data from the web, often collected without the explicit permission of the authors.
Furthermore, these models cannot trace back and cite the original sources of the human-generated
content they were trained on. This poses a particular problem for publishers and content creators who
want to preserve the authenticity and economic value of their original content. As a result, they are
increasingly seeking ways to keep their web content out of the training datasets for these models, in
an efort to safeguard their intellectual property on the web. The current mechanisms available to
enforce such restrictions are however inadequate, leading to a significant gap in web publishers’ ability
to control how their data is used downstream.
      </p>
      <p>Regulations for online data providers and consumers are a key measure for surmounting this
shortcoming, and the existing legal framework already provides a rough direction. It is now up to the web
community to establish simple and practical solutions that meet these requirements. New technical
standards may fill this gap and empower rightsholders with finer control over how their data shall be
used by well-intentioned practitioners. In this context, a web standard refers to a technical specification
(protocol specification, RfC, etc.) endorsed by a recognized web standards organization such as W3C,
IETF, etc. In contrast, pseudo standards are widely used specifications not formally recognized by such
organizations, while ad hoc standards are emerging ideas or practices concerning web technologies that
are not yet widely adopted or oficially standardized.</p>
      <p>There has been a notable efort among researchers and practitioners to bridge the communication
gap between data providers and users, leading to the development of various ad hoc standards. These
initiatives range from small software solutions to large-scale community projects. Our research aims
to contribute to these eforts by evaluating recent approaches and examining the legal and technical
aspects of this complex issue. Our work is focused on three main areas:
• The legal background around intellectual property and data protection, including EU’s 2019 DSM</p>
      <p>Directive, which ofers a regulatory framework for opting out of Text &amp; Data Mining (Section 2).
• A comprehensive review of past and present standards for controlling web data usage (Section 3).
• An in-depth evaluation of recent ad hoc standards and an analysis of their practical application
(Section 4).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Legal background</title>
      <p>This section details the legal background any standard on web data protection has to be embedded in.
The protection itself arises mainly from intellectual property and data protection law. Unfortunately,
as a general rule, each country’s legislation applies exclusively within its own borders so that this
background varies from one country to another. Harmonization has only been archieved up to a limited
extent. We therefore focus on EU and US law due to their major global importance.</p>
      <sec id="sec-2-1">
        <title>2.1. Intellectual property</title>
        <p>
          2.1.1. Scope
Web usage standards are mostly tailored towards intellectual property. This comprises copyright and
socalled related rights. For copyright protection, EU und US law provide essentially the same criteria. EU
law has no codified definition, but the ECJ (European Court of Justice) derives a two-step test from the
overall framework [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. A work must be, on the hand, suficiently original (an own intellectual creation
representing the person’s personality) and, on the other hand, be an expression of such originality
(identifiable with suficient precision and objectivity) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The standard for quality and quantity required
for a work to be considered copyright protected is however low, as even as few as eleven words can
sufice [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Most content will thus be protected by copyright.
        </p>
        <p>This protection generally starts at the moment of creation and the registration of the corresponding
work is not necessary. In the US, registration with the Register of Copyrights is an option but not
mandatory for basic protection. The initial rightsholder is the person who physically brought the
work into existence. In copyright terms, this person is referred to as the author, irrelevant of whether
they created art, books, music, videos, or any other work. In the area of social media, the initial
rightsholders are thus content creators, not host platforms. This applies even if an author has only
realized someone else’s concept. These rightsholders are conferred certain exclusive rights, inter alia,
the right to reproduction, i.e., creating a copy by any means and in any form, no matter how small or
limited in time. When someone wishes to reproduce a given work, he needs either the consent of the
rightsholder or a copyright exception to apply.</p>
        <p>Regardless of the purpose, during web crawling and scraping, the HTML file is downloaded, the full
text extracted and analysed. In each case, a local copy and thus a reproduction is created. Justification
for the reproductions in the crawling and scraping process is therefore always required. The applicable
law depends on the location of the server that hosts the content. Further steps depend on the concrete
purpose that may include more reproductions or other implications with copyright. One such case is
the training of generative AI models, in which parts of the training data are potentially imitated or
reproduced by the model output.</p>
        <p>Apart from copyright, intellectual property also includes related rights. These are either based on
the creation of new content (e.g. performers) or an economic investment (e.g. databases or press
publications). While EU law provides a range of rights, traditionally US law tends to be less expansive.</p>
        <p>Purpose</p>
        <p>Irrelevant
Scientific research</p>
        <p>Objections?</p>
        <p>Yes
No</p>
        <p>Waive by contract?</p>
        <p>Yes
No</p>
        <sec id="sec-2-1-1">
          <title>2.1.2. European law</title>
          <p>
            EU copyright law provides various statutory exceptions that used to cover some type of web crawling
and scraping but not its entirety [
            <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
            ]. The remaining cases were decided by national law. For instance,
crawling for conventional web indexing was considered legitimate because it is beneficial to all parties
involved [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. The German Supreme Court interpreted this as an implied consent, e.g. in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. Websites
endorse to a large extent the crawling of Google, Microsoft and alike because they indirectly profit
through user trafic that results from displaying the site on search result pages. If an author does not
object (e.g. via the Robots Exclusion Protocol), it must be assumed that he wishes to participate in this
standard.
          </p>
          <p>This lack for a specific rule was adressed in 2019 when the EU enacted the Directive on Copyright
in the Digital Single Market (DSMD). Its Art. 2 (2) DSMD defines Text &amp; Data Mining (TDM) as “any
automated analytical technique aimed at analysing text and data in digital form in order to generate
information which includes but is not limited to patterns, trends and correlations”. This encompasses
all types of TDM, including web crawling and scraping, but also further techniques as training an ML
model.</p>
          <p>Art. 4 DSMD lays down the general rule that there shall be an exception or limitation for reproductions
and extractions of lawfully accessible works and other subject matter (see Table 1). Reproductions are
thus permissible if carried with lawful access (which does not assume a lawful upload beforehand). The
reproductions may be retained for as long as it is necessary for the purposes of Text &amp; Data Mining.
Rightsholders can object if they expressly reserve the use of the works in an appropriate manner
(“opt-out”), e.g. through machine-readable metadata on web pages in case the content has been made
publicly available online.</p>
          <p>Additionally, Art. 3 DSMD provides specific rules for TDM for the purposes of scientific research.
These are more permissive since there is no possibility to object or waive by contract. It applies however
just to research organisations and cultural heritage institutions. Research organisations are any entity,
the primary goal of which is to conduct scientific research on a not-for-profit basis, by reinvesting all
the profits in its scientific research or pursuant to a public interest mission. Cultural heritage institution
means a publicly accessible library or museum, an archive or a film or audio heritage institution.</p>
          <p>Under the general rule, any web crawling and scraping is thus permissible by default. Authors can
however choose to opt out. The problem is that there is not yet a web standard to opt out from certain
uses only. The REP, for instance, is a suitable, machine-readable objection, but it merely allows for
a complete reservation without distinguishing between diferent purposes. This may be due to the
fact that the domain of crawling was decisively shaped by Search Engines. The protocol has been
formed as open standard under the collaboration of Google and consequently evolved towards the
requirements of web search and Search Engine Optimization (SEO). With the increasing prevalence of
AI technology, rightsholders may still not wish to object to TDM as a whole but only to the training of
generative AI. From an EU perspective, web control for authors depends thus on the establishment of
more fine-grained standards.
2.1.3. US law
Under US law, there are diferent justifications related to Text &amp; Data Mining, but the so-called fair use
principle in § 107 Title 17 of the US Code is considered to be most suitable. In determining whether the
use made of a work is fair, four factors should be weighed:
(1) the purpose and character of the use,
(2) the nature of the copyrighted work,
(3) the efect of the use upon the potential market for or value of the copyrighted work, and
(4) the amount and substantiality of the portion used in relation to the copyrighted work as a whole.</p>
          <p>
            Unlike in EU law, there is no secure mechanism for an opt-out, but courts will do an overall assessment.
TDM for Web indexing [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] or preservation purposes [11] were considered permissible. Concerning
TDM for AI training, the discussion is ongoing and there are still cases to come [12]. Both OpenAI
and Stability AI are currently involved in court proceedings. Rightsholders also dispute the first factor
(especially whether the use is transformative, i.e. whether something new is added), but the crux of the
matter will most probably lie within the (intertwined) third factor. Reproductions for web indexing or
research purposes were decided to have positive impacts on the market. For AI, the dynamics are likely
to difer but the debate already starts with question what even the relevant market is: the market for
the end product or, beforehand, the market for training data. The latter can at least not be extended too
far as theoretically every work can be used as training data.
          </p>
          <p>So while EU law is open to new standards for protection, US law has no statutory need. The situation
is still more open.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data protection</title>
        <p>In addition to copyright law, content may be protected by data protection law.</p>
        <p>In EU law, any processing of personal data needs a legal basis listed in Art. 6 GDPR. Processing
means any operation performed on personal data. It is thus needed a basis for downloading, extracting
information, etc., i.e. every single step in web crawling and scraping, but also later, e.g. when training
an AI. This applies not only to processors established within the EU (establishment rule), but also to any
processing activities where the data subject is within the EU and the processing is aimed at individuals
in the EU (marketplace rule).</p>
        <p>Consent could serve as a legal basis but will, most commonly, not be provided. It would have to be
given unambiguously, which cannot be assumed by the mere upload. Legitimacy therefore depends on
a balancing of interests as an alternative legal basis. The interests of the controller or other third parties
(including the general public) must be weighed against those of (a person in the place of) the data
subject, i.e. the person identified by the information contained in the content (which is not necessarily
the author in a copyright sense). If the data subject has further indiviudual reasons, it can object at any
time. It then requires a reevaluation of the particular case. Unlike in intellectual property, there is thus
no direct legal connection for a protocol. The resonating valuation, however, is to be considered in the
weighing of interests. For conventional web indexing there are already diferentiated lines of case law
(for delisting upon takedown request), but for the training of AI the situation is once again more open.</p>
        <p>In the US, there is no comprehensive federal data protection regulation. On a federal level, only
certain categories or industries are regulated (e.g. by the Health Insurance Portability and Accountability
Act). On a state level, some have enacted further regulations (e.g. the California Consumer Privacy Act).
Legality depends thus on the type of data or individual state laws.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Technical background</title>
      <p>The following section introduces the most important practices, ideas and initiatives that are relevant to
web data content control.</p>
      <sec id="sec-3-1">
        <title>3.1. Robots Exclusion Protocol</title>
        <p>In the context of web crawling, one protocol has prevailed as dominating mechanism for content control.
The Robots Exclusion Protocol (REP) is a common standard among websites and regulates the access
of autonomous bots. The protocol encourages webmasters to state access rules for any non-human
visitors in a robots.txt file that is placed at the root directory of the web server. These files mainly
consist of allow and disallow instructions referring to URL paths, which are structured in groups
and assigned to user agents. REP was initially introduced in 1994 by Martijn Koster at Nexor and finally
codified as an IETF RFC specification in 2022 [ 13]. As a common tool for both webmasters and Search
Engine operators, REP restrains disproportinate server trafic and improves the eficiency of crawlers.</p>
        <p>The HTML meta tag robots and the HTTP response header X-Robots-Tag declare instructions
addressing autonomous web agents, similar to robots.txt. However, the tags are not formally included
in the REP standard and furthermore apply to a lower level of structural granularity. Whereas robots.txt
relates to the entire site, the embedded robots tags are specific to a single delivered HTML document.
Their instructions guide Search Engines how to crawl, index and display information from this particular
web page, e.g., nofollow, noindex and max-snippet.</p>
        <p>Apparently, the protocol does not specify any enforcement mechanisms. This shortcoming of
enforcement given the limited traceability of web agents underlines the central role of trust in the
internet. Nonetheless, all major Search Engine operators generally respect the Robots Exclusion
standards and previous studies have shown that the protocol is widely adopted among websites [14, 15].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Relevant protocols and initiatives</title>
        <p>Due to its central role, there have been numerous eforts to extend REP and refine the communication
between web agents and servers. In 2007, organizatons from the publishing industry proposed the
Automated Content Access Protocol (ACAP).1 The protocol provides Search Engines with detailed information
on thumbnail and text snippets to be used for web indexing. The initial version ACAP 1.0 has never
been extensively used because of the lacking support of some major companies. From 2011 onwards,
the International Press Telecommunication Council (IPTC) governed the maintenance of ACAP, whose
subsequent version ACAP 2.0 was rebranded to RightsML.2 This protocol was tailored to the domain of
digital news media and not commonly used in the broad context of web crawling [16].</p>
        <p>The ambition of the former eforts by the publishing industry partially overlaps with the scope of
C2PA (Coalition for Content Provenance and Authenticity).3 This recent initiative establishes new
technical standards for content authentication that verify the origin and history of web data. Among
others, the C2PA manifesto specifies so-called Training And Data Mining Assertions. These assertions
can be associated to individual web assets and communicate the right to use the asset for Data Mining
and the training of AI models.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Relevant metadata standards</title>
        <p>In the last two decades, the domain of Semantic Web has yielded numerous metadata standards for
the machine-readable annotation of web documents. These eforts have resulted in several formats
for expressing copyright and licensing information. Most notably, markups like the Dublin Core
Metadata Initiative (DCMI) Terms,4 the WHATWG-specified Meta Extension rights-standard,5 the
copyright Meta Tag 6 or the license Microformat 7 are used to specify relevant meta information as
structured data within the web page. Moreover, the WHATWG-specified work8 Microdata type and the
Creative Commons Rights Expression Language (ccREL) 9 allow authors to include license information
directly in the body of HTML documents.</p>
        <p>A major drawback of many of the numerous metadata standards is the lacking adoption among
wider parts of the web community. Such an ill fate was also bestowed on Do Not Track (DNT), whose
1https://web.archive.org/web/20211011020458/http://the-acap.org/
2https://iptc.org/std/RightsML/2.0/RightsML_2.0-specification.html
3https://c2pa.org
4https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
5https://wiki.whatwg.org/wiki/MetaExtensions
6https://www.metatags.org/all-meta-tags-overview/meta-name-copyright/
7https://microformats.org/wiki/rel-license
8https://html.spec.whatwg.org/multipage/microdata.html#licensing-works
9https://opensource.creativecommons.org/ccrel/
community group publicly announced its termination in 2019. DNT and its spiritiual successor Global
Privacy Control (GPC) 10 are designed as optional HTTP request headers that allow internet users to
opt out from website tracking. Similar to ACAP, the DNT initiative was promising, but eventually
abandoned by the major players. This leaded to a lack of incentives for websites to respect it, resulting
in DNT’s inevitable failure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Recent ad hoc standards</title>
      <p>Following the recent advancements in generative AI technology, there has been an increased focus on
providing opt-out mechanisms for content creators. These mechanisms, conceptually similar to the Do
Not Track header, may act as a safeguard for web users.</p>
      <p>The challenge now lies in establishing a commonly accepted technical format for the opt-out. Various
protocols and metadata standards, including ACAP, RightsML, C2PA, rights-standard, copyright,
and license, present viable but overelaborate solutions to this issue. The past few months have
however yielded new, more practical techniques specifically designed to address opt-outs from the
training and inference of generative AI models. This section will introduce and evaluate these ad hoc
standards based on a structured schema. Moreover, an empirical study will assess the current web
adoption rate of these appoaches.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation schema</title>
        <p>The structured schema illustrated in Figure 1 guides our evaluation of technical ad hoc standards in the
context of web publishers control. It comprises the following four criteria:
• Technique defines the technical means through which a rightsholder’s preferences are
communicated. The three distinct techniques each cater to diferent scenarios and technical skills a
publisher may have.
• Level indicates the degree of granularity at which the rightsholder’s preferences are specified,
ranging from Site-, Page-, to Content-level. This diferentiation introduces a clear hierarchy in
opt-out directives: a Content-level specification is prioritized over those made at the Page- or
Site-level. Additionally, the ability of a rightsholder to modify settings across these levels may
be limited; for example, on a collaborative or social media platform, a rightsholder might not be
allowed to alter the entire site configuration, but rather more granular elements like paragraphs,
images, or individual web pages.
• Terms elaborates on the terms and conditions associated with the opt-out. This can be a simple
binary choice (either permitting or prohibiting the use of data) or a more detailed license agreement
specifying conditions of data use, such as limitations to non-commercial purposes, financial
compensation, etc. While policies themselves do not impose obligations or conditions, they
facilitate the automated acquisition of online licenses. The W3C-standardized Open Digital Rights
Language (ODRL) is the most prominent format purposed for automated license acquisition.
• The Scope of the opt-out ranges from a complete ban on any form of data utilization to specific
use cases, like Text &amp; Data Mining or the development of AI tools. For instance, site owners may
wish to restrict the use of their content in AI/ML applications while still enabling web indexing
and search engine discoverability.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Overview</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. robots.txt</title>
          <p>The following section presents six recent proposals that allow publishers to opt out from ML training.</p>
          <p>Technique: TXT file in the web server’s root directory
Level: Site-level
Terms: Binary signal (allow / disallow)
Scope: Any use</p>
          <p>The initial approach enhances the well-established Robots Exclusion Protocol by strictly adopting
its basic allow and disallow commands, as detailed in the IETF RFC document for REP [13]. Below
is an example of a robots.txt file section, purposefully designed to prevent the crawling of any image
media content:
user-agent: *
disallow: *.bmp
disallow: *.gif
disallow: *.ico
disallow: *.png
disallow: *.svg
disallow: *.tif
disallow: *.jpeg
disallow: *.webp
disallow: *.tiff</p>
          <p>This snippet was generated by an online tool of the initiative Spawning, which advocates against
unauthorized AI Data Mining.11 Spawning provides a web service featuring a simple generator for
disallow directives. Its aim is to protect specific media content types from being crawled. It thus
ofers website administrators a basic mechanism to block the exploitation of their site’s texts, images,
or videos. However, instructions in the robots.txt file do not allow for the definition of specific scopes,
revealing a significant limitation of the current REP standard. Moreover, implementing more complex
restrictions leads to an increase in the file’s size, making it overly verbose for scenarios such as excluding
only certain text or video files. As a result, the robots.txt file may become too lengthy and intricate for
administrators to manage it efectively, contradicting the REP guideline, which suggests keeping the
ifle size under 500 KiB.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Usage-specific agent names</title>
          <p>As previously discussed, the Robots Exclusion Protocol does not diferentiate between scenarios of data
use. To address this limitation, some web crawl operators have introduced specific product tokens for
particular data use cases. Notably, Google has unveiled a new user agent named Google-Extended,12
which joins the lineup of 14 identifiers for Google’s crawling mechanisms, including Googlebot,
AdsBot-Google, FeedFetcher-Google, among others. When Google-Extended is specified to
restrict access to certain website sections, Google ensures that the blocked content will not contribute
to the enhancement of its flagship AI products, Bard and Vertex AI.
11https://site.spawning.ai/spawning-ai-txt
12https://blog.google/technology/ai/an-update-on-web-publisher-controls/</p>
          <p>Such usage-specific product tokens provide a means to opt out from particular data applications,
such as the development of certain AI tools, without completely prohibiting all crawling activities.
However, as Google pointed out in their statement, it is necessary “to explore additional
machinereadable approaches to choice and control for web publishers”. The company’s VP of Trust express the
legitimate concern that “as AI applications expand, web publishers will face the increasing complexity of
managing diferent uses at scale”. Although Google’s market influence probably persuades stakeholders
to adopt their extended user agent, this approach does not yield a suficient solution in the long term.
The necessity to declare multiple user agents, each representing a diferent data use by individual crawl
operators, possibly results in an excessively complicated set of directives in the robots.txt file, placing
an undue burden on webmasters.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. learners.txt</title>
          <p>The subsequent suggestion, introduced in July 2023, does not modify but rather replicates the
robots.txt file [ 17]. It introduces a second file, similar to robots.txt, named learners.txt on the
web server. The instruction set within both files remains the same ( allow / disallow), intentionally
avoiding increased complexity for webmasters and preventing any confusion. It allows webmasters to
distinctly separate instructions intended for general web search crawling from those targeting AI/ML
data collection, dividing them into two separate documents. For example, a traditional search engine
crawler would adhere to the Robots Exclusion Protocol, while a tool designed for compiling ML training
data would follow the directives specified in the ‘learners’ file. A significant downside of this proposal
is the necessity for an additional retrieval action by the crawler. Given the learners.txt file’s potential
low adoption rate initially, it is likely that its directives may not be widely recognized, rendering the
learners’ instructions largely unnoticed.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>4.2.4. NoAI, NoArchive and NoCache Meta Tags</title>
          <p>While not oficially part of the REP specification, the robots HTML meta tags and X-Robots-Tag
HTTP response header are widely utilized methods for conveying machine-readable instructions to web
crawlers. In November 2022, the US-based online art community DeviantArt introduced the directives
noai and noimageai for the exclusion of AI training.13 Similarly, the search engine Mojeek supported
the idea of a noml meta tag,14 aligning with the core ideas of noai and noimageai.</p>
          <p>The launch of noai marked an early attempt to enhance control for web publishers over their
content’s use. However, its impact within the Web Search and Publishing sectors has largely been
symbolic, with limited practical adoption beyond the initially involved platforms. This limited usage is
partly due to the lack of support from key industry players, who have proposed their own mechanisms
instead. For example, Microsoft Bing has announced to utilize the existing meta tags noarchive and
nocache in their approach to ofering webmasters more control over their content. 15 According to
Bing, using the nocache tag ensures that the content is excluded from training Microsoft’s generative
AI models, and the noarchive tag prevents content from being referenced in Bing Chat responses. The
13https://www.deviantart.com/team/journal/update-all-deviations-are-opted-out-of-ai-datasets-934500371
14https://noml.info
15https://blogs.bing.com/webmaster/september-2023/Announcing-new-options-for-webmasters-to-control-u
reinterpretation of the existent instructions, however, presents a significant challenge for webmasters
due to the ambiguity of the two tags and the increased complexity of managing a multitude of tags.</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>4.2.5. NO_TRAIN Metadata Field</title>
          <p>A further strategy for enhancing control over the use of web resources involves embedding meta
information directly within the resources. The DoNotTrain Metadata Standard introduces a concept in
this vein, featuring a NO_TRAIN tag specifically for image metadata [ 17]. This tag is designed to clearly
indicate that the use of the image (or any digital medium it is applied to) for Machine Learning training
is explicitly prohibited. Importantly, the NO_TRAIN metadata can be incorporated into various file types
prevalent on the web, including audio, video, and text files, although it is most directly applicable to
images. Given that the first accusations on potential copyright violations of AI companies are related to
images [12], starting with image metadata for these eforts is a logical choice, likely to gain significant
attention and support from the practitioners community.</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>4.2.6. TDM Reservation Protocol</title>
          <p>Technique: JSON file in the web server’s .well-known directory; Robots meta directives
Level: Site- or Page-level
Terms: Binary signal (no tag / tdm-reservation); TDM Policy
Scope: Text &amp; Data Mining</p>
          <p>The TDM Reservation Protocol (TDM Rep), a web standard endorsed by the W3C, detaches from the
Robots Exclusion Protocol to be more fit to today’s increased awareness on data sovereignty [ 18]. This
protocol grants publishers the ability to specify their preferences regarding the Text &amp; Data Mining of
online resources under their control. Finalized in 2022, it aligns with the regulations of the EU’s DSM
Directive of 2019, which recommends that rightsholders may opt out of TDM activities by appropriately
asserting their rights in a machine-readable format. Recognizing the lack of such formats for TDM,
the creators of this protocol introduced a new standard that meets the needs of web publishers while
remaining straightforward and practical.</p>
          <p>TDM Rep allows webmasters to declare their TDM rights preferences by marking individual
documents with the tdm-reservation tag. This tag is binary: a value of 1 indicates a reservation of
rights, while 0 signifies no reservation, permitting web agents to mine the content without further
consultation with the rightsholder, as default under Article 4 of the DSM Directive. Additionally, the
protocol outlines the use of the tdm-policy tag for directing to a TDM Policy. This tag is linked to a
URL that hosts an ODRL policy document in JSON format, facilitating the automatic licensing of web
resources by detailing the rightsholder’s contact information and the terms and conditions of use.</p>
          <p>The protocol outlines three methods for communicating the decisions of rights holders. The first two
mirror the use of robots HTML meta tags and X-Robots-Tag HTTP response headers. The third involves
creating a JSONL document named tdmrep.json, placed within the .well-known directory of the
web server. This document lists JSON objects, each representing a rule with properties for location,
tdm-reservation, and optionally, tdm-policy. The location property, similar to the directives
in a robots.txt file, specifies a URL path within the site.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Empirical study</title>
        <p>This study assesses the adoption rate of the previously discussed ad hoc standards. Overall, we have
analyzed 60 million regular web pages along with their respective tdmrep.json files, if existent, as well
as 42 million robots.txt files. The documents were sourced from the web archive of Common Crawl, 16
ofering a broad and randomly selected cross-section of the internet. These files were all collected in
November and December 2023. Further details on the experimental framework and comprehensive
crawl statistics can be found in [19].</p>
        <p>REP Around 56.0 % of crawled websites currently provide a valid robots.txt file. This figure has
remained relatively unchanged for the past eight years, reflecting the Robots Exclusion Protocol’s
solidified role as the primary regulation tool for web crawlers. The robots HTML meta tag is present
in about 52.7 % of web pages, while the X-Robots-Tag header appears in merely 0.6 % of the HTTP
responses we analyzed.</p>
        <p>User agents Specific user agents that are used to exclude AI training, such as Google-Extended,
GPTBot or CCBot, have seen considerable adoption across websites. Google-Extended, for instance, is
mentioned in over 650,000 robots.txt files of the 42 million we examined; a remarkable increase since
its launch in September 2023. Notably, Google-Extended frequently appears alongside a disallow all
directive (Disallow: /). As of December 2023, approximately 653,800 websites categorically block the
web agent, accounting for 99.9 % of the around 654,300 robots.txt files that mention Google-Extended
and 1.6 % of all 42 million robots.txt files analyzed. Most prominently, news websites like lemonde.fr,
washingtonpost.com and nytimes.com, which are currently in a lawsuit against OpenAI, use these user
agents to opt out from AI training. Search Engine bots such as Googlebot and Bingbot, on the other
hand, are generally more welcomed by webmasters [19]. This trend suggests a preference among many
site owners to be indexed by major search engines while excluding their content from AI model training
and inference.</p>
        <p>Meta tags The introduction of noai, noimageai, and noml meta tags has yet to gain significant
traction within the online community. These tags were detected on the HTML pages of only 82 out
of around 1.4 million distinct hosts. The noarchive and nocache tags, however, are utilized by up
to 1.27 % of web pages, leading to an exclusion from Bing Chat’s response generation and training
processes. Microsoft’s decision thus remains questionable as the company profits from the low adoption
of these meta tags among websites because consequently more data is available for them to run and
improve their AI products. For that cause, they accept the increased ambiguity that comes along with
reinterpreting the existing robots instructions for this new purpose.</p>
        <p>TDM Rep The adoption of the TDM Reservation Protocol is still low. By January 2024, we found a
minuscule number of hosts (45) serving a tdmrep.json file within the .well-known directory of their
servers. Around 60 domains have implemented TDM Rep by including the tdm-reservation tag.
Particularly French websites, e.g. lefigaro.fr, appear to be leading the charge in committing to this new
protocol.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This survey examines the current discussion on web publishers control, which demands new web
standards for the opt-out from AI training. In this course, we want to highlight both legal frameworks
and existing technical standards. Regarding the legal aspects, the protection of intellectual property is
well-defined in the EU and US law. These regulations generally apply to web crawling and scraping,
requiring in any case a justification against the copyright law. Crawling for regular web indexing is
mostly considered legitimate because it is beneficial to all parties involved. Hence, the overall assessment
poses an implied consent by the site owners, profiting from having their website indexed and included
in search results. In 2019, the EU enacted the DSMD that regulates Text &amp; Data Mining and therefore
also any web search-related crawling, rendering the common practice of implied consent redundant.
According to Art. 4 DSMD, reproductions of web content are by default permissible but rightsholders are
given the opportunity to reserve the use of their works using a machine-readable opt-out. Additionally,
TDM for the purposes of scientific research is exempted from the opt-out regulation and thus always
permissible.</p>
      <p>The recent advancements on generative AI have yielded numerous ad hoc standards, accommodating
web publishers’ wish for more data sovereignty. These techniques are specifically designed to address
the opt-out from training and inference of generative AI models. Our paper evaluates the in this regard
most relevant proposals based on a structured schema. The proposals build of of conventional protocols,
such as the Robots Exclusion Protocol, or bring forward new initiatives, such as the TDM Reservation
Protocol. They were introduced by the online art, publishers and researchers community as well as by
Search Engines like Google, Microsoft Bing and Mojeek. Most of the ad hoc standards have - at least at
the beginning - only a small impact beyond their symbolic radiance. Only the usage-specific web agents
like Google-Extended or GPTBot, are relatively frequent in robots.txt files, expressing the publishers’
intention of excluding AI training through the Robots Exclusion Protocol.</p>
      <p>In summary, the presented approaches are either idealistic and thus poorly adopted, such as the
noai/noml meta tag or the learners.txt file, or they are specific to certain AI applications, such as the
Google-Extended user agent for Google’s Bard and Vertex AI, the user agents GPTBot and
ChatGPTUser for OpenAI’s products, and the nocache and noarchive meta tags for Microsoft’s Bing Chat.
Furthermore, it will always remain the possibility for AI companies to not crawl themselves, but use
existing open data such as Common Crawl, efectively evading any site-level control measures based
on robots.txt. As a result, webmasters are faced with an overwhelming task to implement all these
technical micro-standards or at least consider the most crucial of them, in order to protect their data
from any unwanted AI/ML-related use.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has received funding from the European Union’s Horizon Europe research and innovation
program under grant agreement No 101070014 (OpenWebSearch.EU, https://doi.org/10.3030/101070014).
[11] US Court of Appeals (2d. Cir.), Authors Guild, Inc. v. Hathitrust, 755 F.3d 87, Judgement of June 10,
2014.
[12] P. Samuelson, Generative AI meets copyright, Science 381 (2023) 158–161. URL: https://www.</p>
      <p>science.org/doi/abs/10.1126/science.adi0656.
[13] M. Koster, G. Illyes, H. Zeller, L. Sassman, Robots Exclusion Protocol, RFC 9309, 2022. URL:
https://www.rfc-editor.org/info/rfc9309. doi:10.17487/RFC9309.
[14] S. Kolay, P. D’Alberto, A. Dasdan, A. Bhattacharjee, A larger scale study of robots.txt, in:
Proceedings of the 17th Intern. Conf. on World Wide Web, WWW ’08, Association for Computing
Machinery, New York, NY, USA, 2008, p. 1171–1172. URL: https://doi.org/10.1145/1367497.1367711.
[15] Y. Sun, I. G. Councill, C. L. Giles, The ethicality of web crawlers, in: IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, IEEE, 2010. URL: https://doi.
org/10.1109/wi-iat.2010.316.
[16] Text and Data Mining Reservation Protocol Community Group, Past and existing initiatives, 2022.</p>
      <p>URL: https://w3c.github.io/tdm-reservation-protocol/docs/initiatives.html.
[17] D. Ippolito, Y. W. Yu, DONOTTRAIN: A Metadata Standard for Indicating Consent for Machine</p>
      <p>Learning, in: Workshop on Generative AI and Law, 2023.
[18] Text and Data Mining Reservation Protocol Community Group, TDM Reservation Protocol
(TDM</p>
      <p>Rep), 2022. URL: https://www.w3.org/2022/tdmrep/.
[19] M. Dinzinger, M. Granitzer, A longitudinal study of content control mechanisms, in: Companion
Proceedings of the ACM Web Conference 2024, WWW ’24, ACM, 2024. URL: https://doi.org/10.
1145/3589335.3651893, accepted for publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sébastien</given-names>
            <surname>Bubeck</surname>
          </string-name>
          et al.,
          <source>Sparks of artificial general intelligence: Early experiments with gpt-4</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.12712.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Touvron</surname>
          </string-name>
          et al.,
          <source>Llama: Open and eficient foundation language models</source>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2302.13971.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Wayne</given-names>
            <surname>Xin</surname>
          </string-name>
          Zhao et al.,
          <source>A survey of large language models</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303. 18223.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yogesh</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Dwivedi</surname>
          </string-name>
          et al.,
          <article-title>Opinion paper: “so what if ChatGPT wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy</article-title>
          ,
          <source>International Journal of Information Management</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1016/j.ijinfomgt.
          <year>2023</year>
          .
          <volume>102642</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>ECJ</surname>
          </string-name>
          , Infopaq, c-5/08, ecli:eu:c:
          <year>2009</year>
          :
          <fpage>465</fpage>
          ,
          <string-name>
            <surname>Judgement</surname>
          </string-name>
          of July 16,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>ECJ</surname>
          </string-name>
          , Levola hengelo, c-
          <volume>310</volume>
          /17, ecli:eu:c:
          <year>2018</year>
          :
          <fpage>899</fpage>
          , Judgement of November 13,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pawel</surname>
          </string-name>
          , P. Vladimir,
          <source>ELRC report on legal issues in web crawling</source>
          ,
          <year>2018</year>
          <article-title>(revised in 2021)</article-title>
          . URL: http: //www.elra.info/media/filer_public/
          <year>2021</year>
          /02/12/elrc-legal
          <source>-analysis-webcrawling_report-v11.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Truyens</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Van Eecke</surname>
          </string-name>
          ,
          <article-title>Legal aspects of text mining</article-title>
          ,
          <source>Comput. Law Secur. Rev</source>
          .
          <volume>30</volume>
          (
          <year>2014</year>
          )
          <fpage>153</fpage>
          -
          <lpage>170</lpage>
          . URL: http://dx.doi.org/10.1016/j.clsr.
          <year>2014</year>
          .
          <volume>01</volume>
          .009.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Bundesgerichtshof</surname>
          </string-name>
          , Vorschaubilder i,
          <source>Judgement of April 29</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>US</given-names>
            <surname>District Court</surname>
          </string-name>
          <string-name>
            <surname>Nevada</surname>
          </string-name>
          , Field v. Google Inc., 412
          <string-name>
            <given-names>F.</given-names>
            <surname>Supp</surname>
          </string-name>
          .
          <year>2d</year>
          1106, Judgement of January 19,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>