<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Do Origin and Facts Identify Automatically Generated Text??</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Judita Preiss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Lestari Paramita</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Shefield, Information School</institution>
          ,
          <addr-line>The Wave, 2 Whitham Road, Shefield S10 2AH</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a proof of concept investigating whether native language identification and fact checking information improves a language model (GPT-2) classifier which determines whether a piece of text was written by a human or a machine. Since automatical text generation is trained on writings of many individuals, we hypothesize that there will not be a clear native language for 'the writer' and therefore that a native language identification module can be used in reverse - i.e. when a native language cannot be identified, the probability of automatic generation is higher. Automatic generation is also known to hallucinate, making up content. To this end, we integrate a Wikipedia fact checking module. Both pieces of information are simply added to the input to the GPT-2 classifier, and result in an improvement over its baseline performance in the English language human or generated subtask of the Automated Text Identification (AuTexTification) shared task [1].</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;GPT-2 classifier</kwd>
        <kwd>native language identification</kwd>
        <kwd>Wikipedia fact checking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Detectors of human versus machine written text are often trained on large quantity of data
from various data sources, potentially with domain specific fine-tuning. However, constant
advances in text generation suggest more information may need to be incorporated to
create successful detection systems than classifiers built from language models alone. In
this work, we prototype the use of native language identification and fact checking within
a language model classifier.</p>
      <p>
        Text generation has been used in a range of applications, such as radiology report
generation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or conversational response generation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, there is potential for
the misuse of text generation, such as fake news [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or spam [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] generation. The ability to
create a classifier capable of distinguishing text generated by a human from that produced
by an AI system would therefore be widely useful. To this end, the Automated Text
Identification (AuTexTification) shared task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was set up as part of IberLEF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with
subtask 1 of the exercise evaluating submitted systems performing detection of automated
text in English or Spanish on a standard dataset. In this paper, we focus on the English
portion of this task.
      </p>
      <p>
        Numerous large language models have been employed for automatic text generation,
with the quantity and sources of training data varying. In general, text generation models
based on transformer architectures tend to produce text which is grammatically correct
and coherent. Specifically, we focus on GPT-2 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which is trained on a large collection
of internet articles, and its successor GPT-3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], trained on petabytes of data collected
over years of web crawling.
      </p>
      <p>
        With the advent of text generation, came the need for systems which detect
automatically generated text. Examples of such models include the Giant Language model Test
Room (GLTR) tool [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which uses statistical methods to make use of diferences between
text generated by GPT-2 and human written text. Such diferences include for example
the quantity of rare word usage, which is lower in the text generated by GPT-2 than in
the human written text. Using a linear classifier on top of an existing generation model
(such as GROVER [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) has been proved to be very successful, sparking discussion around
the need to make generation models public. The closest to the work presented in this
paper, is the RoBERTa detector [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which fine-tunes a RoBERTa model to achieve a
higher accuracy in detection than a fine-tuned GPT-2 model. However, the RoBERTa
detector needs a large quantity of examples – 200K examples are needed to attain 90%
accuracy – which was not available in this work. In addition, its accuracy may be an
upper bound for a trained model.
      </p>
      <p>
        In search for alternatives to a pure text trained classifier, it is necessary to explore
the errors made by generative models. Xu et al [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] discuss these in terms of machine
translation’s multidimensional quality metrics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], including measures such as accuracy
(addition or omission) and fluency (punctuation, spelling or grammar). We propose that
the integration of a fact checking component alongside a language model classifier may
allow errors of accuracy to be detected and therefore that its integration will lead to an
increase in the overall accuracy of detection. While the contribution of fact checking is
not clear, as humans may also introduce errors into their writing, it is hypothesized that
in combination with other features (such as those detected using the language model),
the additional information may be beneficial.
      </p>
      <p>A second component is also investigated: native language identification. Native language
identification automatically determines the first language of the writer. The motivation
behind this component is based on the manner in which generated text is produced: it is
generated by a model trained on many texts, written by many diferent authors. It is
therefore expected that automatically generated text will often not show a predominant,
or a clearly predominating, native language.</p>
      <p>The novel contributions of the work are therefore the integration of a fact checking
and a native language identification component into a GPT-2 based classifier detecting
generated (vs human created) English text, with results reported on AuTexTification
subtask_1_en. The paper structure is as follows: methodology is described in Section 2,
results are presented in Section 3 and the conclusions and future work are outlined in
Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. GPT-2 classifier</title>
        <p>
          GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained
on 8 million web pages [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Given the source of the AuTexTification datasets, a language
model pre-trained on a large quantity of web pages was deemed a suitable choice. The
pretrained model available from Huggingface (https://huggingface.co/gpt2) was fine-tuned
for a classification (rather than generation) task using the English training data of subtask
1 (which comprises 33,845 training instances with a binary, human or generated label).
No hyperparameter optimization was explored in this work, with default parameters used
while training for 4 epochs, with batch size 32 and max length 60.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Native language identification</title>
        <p>
          Native language identification (NLI) can be viewed as a (multi-class) classification task:
given a text segment, assign a class based on the author’s first language. The publicly
available implementation of NLI, BERT-NLI (https://github.com/stianste/BERT-NLI)
can be trained on languages of own choosing. The original work produced a classifier for
a subset of Indo-European languages, specifically 31 languages. For each language, the
writings of 104 native speakers were collected from Reddit with user origin identified from
the poster’s flairs [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The methodology is reused in this work to create training data for
a larger set of language families, specifically:
1. Subreddits likely to contain native speakers are manually identified for each language
sought. This includes selecting subreddits corresponding to countries and main cities
speaking the language, as well as information and language learning subreddits.
2. Using the Pushshift Multithread API Wrapper (available from pip as pmav), the
last 5 years of posts and comments are gathered from these subreddits.
3. The flair attribute, which can be set by the user on a per subreddit basis, is explored
to find instances where the user is believed to be identifying their country of origin.
4. Any users whose countries are identified from their flair, and are in our desired
country / language list, have their public Reddit footprint gathered over a period
of the last year.
        </p>
        <p>Identifying a writer’s native language requires a large quantity of training data. While
this is not always available for each of the languages of interest, it is possible to group
languages into their corresponding language families and thus reduce the number of
‘languages’ sought while retaining language traits. Reddit users’ texts are selected at
random such that multiple languages from the language family are represented in the
sample. The list of language families used in this work, and the corresponding languages
represented in the training data of the NLI module, can be found in Appendix A. Reducing
the dataset to language families, rather than individual languages, also has the benefit of
reducing the classifier’s training time.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Wikipedia fact checking</title>
        <p>
          Wikipedia has been utilised to fact-check information from the Web. In this study, we
utilised the WikiCheck API [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Given a statement (claim) in the dataset as the input, the
API returns the most relevant sentences from the Wikipedia corpus (using MediaWiki API
from https://www.mediawiki.org/wiki/API:Search) to be used as evidence for the claim.
The API returns whether each sentence supports the statement, refutes the statement, or
does not provide enough information. A probability score is also given for each decision.
        </p>
        <p>Given the length limitation provided by the API, only a maximum of 300 characters
can be used as the input statement. Therefore, when the sentence is more than 300
characters, we used the first 300 characters only and removed any incomplete words at
the end as the input statement.</p>
        <p>
          Previous studies [
          <xref ref-type="bibr" rid="ref14">14, 15</xref>
          ] incorporated CatBoost learning-to-rank model to decide the
most relevant evidence (Wikipedia sentence) for each claim and use the judgment for that
sentence only. However, due to the limited resources and time constraints, we utilised
a simpler approach. First, we selected sentences which refutes/supports the claim with
the highest probability score as the most relevant evidence. We extracted the label
(SUPPORT/REFUTES) and the probability score. If no statements were assigned as
supporting or refuting the claim, or if no relevant statements were returned by the API,
the sentence is labeled as ‘NOT ENOUGH INFO’ and the probability score is set to 0.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Integration of additional information with GPT-2</title>
        <p>While there are many approaches to building ensemble models, a simple approach is
taken in this work: to show that an improvement can be obtained by integrating NLI
and Wikipedia fact information, the information was simply prepended to the input text.
The following methods for encoding of the additional inputs were explored:
1. Information from the NLI component:
a) Listing of probabilities for each language family, i.e. a sequence of 15 numbers,
rounded to 2 decimal places (to allow generalization).
b) Listing the top language with its probability (2dp).
c) Listing the top language alongside the diference between the probabilities
of the top two suggested languages (2dp) – in some cases, the prediction is
confident about the native language, assigning a probability greater than 0.9.
However, sometimes the system suggests multiple languages with probabilities
between 0.3 and 0.5. Incorporating the diference of the probabilities reflects
such uncertainty.
d) Listing the top three languages with the diference between the top two values.</p>
        <p>As above, the diference between the top two probabilities reflects the system’s
uncertainty in the top prediction.
2. Information from Wikipedia fact checking:
a) The decision generated by Wikipedia, i.e. one of supports, refutes, insuficient
(representing not enough info) or failed (when the API failed to return a result).
b) The decision along with the probability associated with the decision. As in
point 2a along with the probability associated with the decision – again, this
allows the system to incorporate a measure of uncertainty into its model.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Native language identification</title>
        <p>The training data consists of 100 users for each of the 15 language families (distributed
equally among the languages contributing to the language family). A maximum of 20
* 100 sentences are randomly selected for a user from their posts and comments. The
median across the selected users is 4 * 100 sentences. The average F1 over a 10 fold
cross-validation is 0.58 and a confusion matrix can be seen in Figure 1. The confusion
matrix suggests that no systematic mistakes were made, with the system performing well
over all.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Wikipedia fact checking</title>
        <p>As shown in Table 1, WikiCheck API labelled the majority of the sentences as “Not
enough info”. One possible reason is the number of statements in the AuTexTification
dataset that may not contain any facts that could be fact-checked using Wikipedia, such
as sentence ID 20327: “@Jonasbrothers Guys... U rock!! I love all your songs! Thanks
for the love! We love hearing that our music brings”. In the training data, a quarter of
the data were labeled as supports/refutes. In the test data, the number of sentences that
were labelled as supports/refutes was higher (45%). In both datasets, there were a small
number of cases (&lt;1%) where the API returned no responses due to a system error.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Overall systems</title>
        <p>The system used 60% of the 33,845 English language examples for training, 20% for
validation and 20% for testing to select the best system(s). This revealed the top systems
to be as submitted, i.e. the top three predicted native languages (NLs) with the diference
of probabilities between the top two languages, as well as information provided by the
Wikipedia fact checking module. Either approach alone was found to perform worse than
the combination. The combined approach also outperformed the baseline (GPT-2 only)
system. The results on the test portion of the training data were supported on the 21,833
instance AuTexTification test set (Table 2) where the relative performance of the three
submitted approaches mirrored their performance on the test portion of the training data.
Most importantly, the additional information improved the performance of the GPT-2
classifier by over 5%.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future work</title>
      <p>We prototype the use of fact checking and native language identification for the purpose
of improving a language model classifier identifying automatically generated text from
that written by a human. The experiment proved successful, with a combination of the
two components yielding better results (F1) than the system alone (or the components
separately) on AuTexTification subtask 1 (human vs generated) in English.</p>
      <p>There are many potential future work avenues: no hyperparameter optimization is
performed on the underlying GPT-2 system which may yield better performance on the
task over all. Aside from experiments after hyperparameter optimization, the efect of
the components should also be investigated when diferent baseline models are employed
and significance evaluated. Diferent integration of the component information with the
baseline model could also be explored, in particular, logistic regression (which performed
exceedingly well in the task) in combination with the Wikipedia fact checking and NLI
components should be explored.</p>
      <p>The fact checking module was applied on a (shortened) sample of input text, whether
or not this contained facts. An initial check for fact content would likely increase the
performance of the module as well as giving the trained model additional information,
with an additional feature value of “lacking” when a text segment is found to be lacking
any facts.</p>
      <p>Most importantly, the text segments used are relatively short. It has been stated that
the OpenAI Text Classifier requires a minimum of 1,000 characters (about 150-250 words).
Some of the text segments are close to this boundary, and it would be interesting to see
the performance of the approach on texts of diferent lengths.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Ethical review</title>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The data for the NLI component was gathered and managed as specified in ethical
approval 052236, granted by the University of Shefield on 29/03/2023.</p>
      <p>Thanks to the Information Retrieval group at the University of Shefield Information
School through which JP learnt of the existence of the evaluation exercise.
ACM International Conference on Information &amp; Knowledge Management, 2021, pp.
4155–4164.
[15] A. Chernyavskiy, D. Ilvovsky, P. Nakov, WhatTheWikiFact: Fact-checking claims
against Wikipedia, in: Proceedings of the 30th ACM International Conference on
Information &amp; Knowledge Management, 2021, pp. 4690–4695.</p>
      <p>A. Language families used in NLI component
o n n
n i i
is ic ch ch
it n e
in ir se
s a n
d o
n t
a n
o a</p>
      <p>m c</p>
      <p>e
ic s
n e
c
i
v n
a n ia h n n
na -slo itav anu ilso issa irab cezh
ep lt la ith p ru se c
a l
b
o
r
u
e
- n
o a li ir i i i t e
ind -ryao eagnb jouphb jtraagu iidnh traah lieapn jabnu irsakn rudu lseahn
d m p s i
n s
i
c
i</p>
      <p>h n
n s a ch
am lig rm tu
r n e d
e e g
g</p>
      <p>e
e s n
c sh eu h n ia
naom ian gu rcen iltaa anm
r sp rto f i o</p>
      <p>r
p
n a m
a
i gu ad il la
ivd lu n m aay
a te an ta l
r a</p>
      <p>k
d</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Á.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Franco</given-names>
            <surname>Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of AuTexTification at IberLEF 2023:
          <article-title>Detection and attribution of machinegenerated text in multiple domains</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          , Jaén, Spain,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Sun,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brockett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dolan</surname>
          </string-name>
          , DIALOGPT :
          <article-title>Large-scale generative pre-training for conversational response generation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>278</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          . acl-demos.
          <volume>30</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-demos.
          <volume>30</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , CoRR abs/
          <year>2005</year>
          .14165 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2005</year>
          .14165. arXiv:
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <article-title>Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions, Technology Science (</article-title>
          <year>2019</year>
          ). URL: https: //techscience.org/a/2019121801/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y</surname>
          </string-name>
          <string-name>
            <surname>Gómez</surname>
          </string-name>
          ,
          <source>Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 8</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Strobelt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>GLTR: statistical detection and visualization of generated text</article-title>
          , CoRR abs/
          <year>1906</year>
          .04043 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1906</year>
          . 04043. arXiv:
          <year>1906</year>
          .04043.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rashkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Roesner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Defending against neural fake news</article-title>
          , CoRR abs/
          <year>1905</year>
          .12616 (
          <year>2019</year>
          ). URL: http: //arxiv.org/abs/
          <year>1905</year>
          .12616. arXiv:
          <year>1905</year>
          .12616.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Solaiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brundage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Release strategies and the social impacts of language models</article-title>
          , CoRR abs/
          <year>1908</year>
          .09203 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Tuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saxon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Not all errors are equal: Learning text generation metrics using stratified error synthesis, in: Findings of the Association for Computational Linguistics</article-title>
          : EMNLP,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. F.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ratnakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Macherey</surname>
          </string-name>
          , Experts, errors, and
          <article-title>context: A large-scale study of human evaluation for machine translation</article-title>
          ,
          <source>CoRR abs/2104</source>
          .14478 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2104.14478. arXiv:
          <volume>2104</volume>
          .
          <fpage>14478</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Goldin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wintner</surname>
          </string-name>
          ,
          <article-title>Native language identification with user generated content</article-title>
          ,
          <source>in: Proceedings of Empirical Methods in Natural Language Processing</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3591</fpage>
          -
          <lpage>3601</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trokhymovych</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Saez-Trumper,
          <article-title>WikiCheck: An end-to-end open source automatic fact-checking API based on Wikipedia</article-title>
          ,
          <source>in: Proceedings of the 30th</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>