<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Text Representations for Detecting Automatically Generated Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zayra Villegas-Trejo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio-Luis Ojeda-Trueba</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Ciencias, Universidad Nacional Autónoma de México</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Ingeniería,Universidad Nacional Autónoma de México</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas,Universidad Nacional Autónoma de México</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In today's rapidly advancing world of technology, artificial intelligence (AI) models have emerged that can generate text automatically. It has become increasingly challenging to discern the diference between machine-generated text and human-written text simply by reading it. This capability of AI poses a problem when it comes to creating fake content or malicious use of these models. This article presents our approach to the AuTexTification task at IberLEF 2023, focusing on two subtasks. The first subtask involves binary classification, distinguishing between text written by humans and text generated by AI. The second subtask is a multi-class problem involving six text generation models (A, B, C, D, E, and F). Both subtasks are conducted in English and Spanish languages. Our objective is to accurately determine whether a given text is authored by a human or generated by AI and also to detect the text generation model used. We extract features such as Bag-of-Words (BoW), N-gram structure, and others. Experimental evaluation is performed using Logistic Regression, Random Forest, and Support Vector Machine algorithms. Our results demonstrate that incorporating additional features improves the accuracy of text identification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>skills during the experiments [4]. Other researchers used the same methods and obtained similar
conclusion [5, 6]. Nevertheless, many people used their linguistic knowledge, language ideas,
and personal preferences during the experiments to determine the texts [5]. In consequence,
their assessment is subjective. These kinds of experiments resulted in ineficient results [ 5, 7],
as Clark et al. [6] mentions ”some evaluators focused on surface-level text qualities to make their
decisions and underestimated current NLG models capabilities´´. Therefore, the evaluations’
reliability depends on the evaluators’ qualification, making high-quality annotations [ 4], and
the personal perception and knowledge of the participants.</p>
      <p>Another way to detect automatically generated text is with automatic systems. They are
obtained by the latest generation models. Some companies, such as Google and OpenAI are
developing these tools.</p>
      <p>Recently OpenAI created a system to detect text written by ChatGPT and other IAs [8]. It
consists of a classification tool used model that was altered based on multiple source samples.
On the other hand, there is no certainty that Google can detect generated text, as John Mueller
replied in a much-cited interview in April 2022 ”I can´t claim that´´, when he is asked whether
Google can diferentiate between a human text or AI algorithm text. Although there are ways
to identify where the generated text comes from, this is not the most accurate and reliable
method. Even they are more improvements in models and bots are advancing quickly. Under this
limitation, Feizi and researchers from the University of Maryland used AI-based paraphrasing
tools to rephrase AI-generated text and fed it into various detectors. They got the accuracy
of most detectors dropped to nearly 50% [7]. In addition, they use a test called "impossibility
result" to show that models of AI become more human-like in the distribution of words in the
generated text, and detectors will have a hard time handling them [7].</p>
      <p>Our objective is to work on traditional machine learning models to detect text written by
humans and other texts generated by diferent machines, studying the word frequency and
some patterns in punctuation and length of documents.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Corpus</title>
      <p>The Autextification [ 2] proposed a shared task to identify texts written by AI. This task is carried
on along with a series of NLP-related tasks at the Iberlef 2023 workshop [3], where the objective
is to promote research in the detection of automatically generated text-by-text generation
models. The Autextification task comprises two subtasks, subtask_1 (binary classification) and
subtask_2 (multiple classification). The corpus presented in Subtask_1 has 33845 instances in
English and 32062 in Spanish. Meanwhile, there are 22416 texts and 21935, respectively, for the
corpus of subtask_2.</p>
      <p>In both corpora, the information is presented in three columns: “id”, “text” and “label”. The
classes available in Subtask_1 are: “generated” and “human” and subtask_2 includes the classes:
“A”, “B”, “C”, “D”, “E” and “F”, which represent five diferent language models. Tables 1 and 2
present the class distribution of both subtasks. For each subtask we have two subsets of the
corpus, train data, and test data. Train data contains the information structured as it was
mentioned above and is used to train the models. Test data contains only the “id” and “text”
and we used to make the predictions. You can also have visualization data in the section of
Appendix 6.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We approached both subtasks as a supervised learning problem, the subtask_1 as a binary
classification problem, and subtask_2 as a multiclass classification problem. First, we performed
some basic pre-processing to the texts, then performed feature extraction to obtain diferent
representations of the texts, and finally, we experimented with various machine learning
algorithms. Also, we evaluated the models with the F1-score.</p>
      <sec id="sec-3-1">
        <title>3.1. Pre-processing</title>
        <p>First, we analyzed the data to keep relevant information. We consider relevant characteristics
like stopwords, symbols, digits, capital letters, the number of punctuation marks, and other
linguistic considerations to identify human text and generated texts. The style of the text is an
important factor in achieving our goal.</p>
        <p>Although we had the hypothesis that not pre-processing the data would yield better results,
we did both procedures to compare them. The experiments were performed on two data-set:
pre-processed data and raw data. The pre-processed data was cleaned, considering only the
next:
• Remove all special character
• Lowercase all the words
• Tokenize
• Remove stopwords</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Train and Test Split</title>
        <p>We employed the Stratified K-Fold as implemented in the Scikit-learn [ 9] library to make splits
of five equal groups (folds) while maintaining the proportion of samples from each class in each
fold, reflecting the distribution of classes in the original dataset. This ensures the training and
testing subsets contain representative samples from all classes.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature Extraction</title>
        <p>
          We applied diferent techniques to find patterns in the text. We used a Bag-of-Words (BoW) to
divide the text per word and organize it by repetition of the frequency. We experimented with
techniques such as character N-Grams, for example, Tri-Grams, (
          <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
          )-Grams, and (
          <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
          )-Grams.
Finally, we complemented the best N-Gram feature set with additional stilometric features to
train the model. The stilometric features are:
• number of digits (d)
• number of others (W)
• number of spaces (s)
• number of stop-words
• length of characters
• number of comas (,)
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Machine Learning Algorithms</title>
        <p>In our experiments, we evaluated widely used supervised machine learning algorithms:
• Logistic Regression (LG)
• Random Forest (RF)
• Support Vector Machine linear (SVC)
• Gradient Boost (GB)
• XGBoost (XGB)
In both cases, we evaluated these models with the F1-score to compare them with the respective
methods or techniques; then, we only kept the best results to continue experimenting.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>As mentioned before, we employed techniques such as Bag-of-Words (BoW) and N-grams
to conduct our experiments. To enhance our model, we incorporated additional features,
specifically six stylometric features. In Table 3, the average of the occurrence of each feature
in the class “generated” and the class “human”, and the diference of each feature occurrence
between the generated and human texts are presented. A smaller diference indicates a lesser
contribution of that particular feature to our model. We only present the results for subtask_1_en,
as similar findings were observed in the other tasks.</p>
      <p>The experiments were conducted in two phases. In the first phase, we trained the models
and prepared the data for testing. The second phase involved performing the actual tests and
obtaining predictions to be submitted to the AuTexTification task.</p>
      <p>Regarding the results obtained from the pre-processed data, the Support Vector Machine
(SVC) model achieved the highest F1 score of 82.1% using the Tri-Grams representation scheme.
This result was specifically observed for subtask_1 in English, but a similar trend was observed
for subtask_1 in Spanish.</p>
      <p>In the case of raw data, the scores of most models showed improvement, except for SVC,
which actually decreased to an F1 score of 80.2%. Consequently, we focused our subsequent
experiments on the improved models obtained from the second set of cases for each subtask.
The scores for each case can be found in Table 4 and Table 5.</p>
      <p>The results varied depending on the language being analyzed. In certain instances, specific
types of characteristics did not have a significant impact or improve the results. For instance,
the character length did not enhance the performance of the models in English or Spanish.
Therefore, the inclusion of this characteristic was unnecessary for both languages. Another
example pertains to the use of stop-words. Initially, we hypothesized that stop-words would
have a significant influence, but we did not observe any substantial diference in their usage
and their impact on the results.</p>
      <p>
        The most efective approach was utilizing (
        <xref ref-type="bibr" rid="ref3 ref3">3,3</xref>
        )-Grams for SVC and (
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        )-Grams with extra
features for the other models. In terms of subtask_1 in English, Table 5 indicates that the XGB
algorithm achieved the highest score of 84.7% using (
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        )-Grams and adding ExtraFeatures. As
for subtask_1 in Spanish, the XGB algorithm attained a score of 85.9% (see Table 6). In subtask_2,
we achieved maximum scores of 49.5% and 52.2% for English and Spanish, respectively. This
information is presented in Tables 7 and 8.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this study, we conducted experiments to determine the optimal algorithm and feature
extraction method for identifying texts generated by artificial intelligence or written by humans.</p>
      <p>
        Trigrams (
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        )grams +ExtraFeatures
Trigrams (
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        )grams +ExtraFeatures
Additionally, we examined texts authored by diferent artificial intelligence systems. Our
findings revealed that the XGB model using (
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        )-Grams and adding stylometric features, performed
best across all subtasks. However, while the results were favorable for Subtask 1, they were less
promising for Subtask 2.
      </p>
      <p>Moreover, in the experiments conducted by Daphne Ippolito, the highest achieved score was
70% [4]. We obtained superior results in identifying human and machine-written texts through
machine learning compared to working with annotators. However, we struggled to diferentiate
between various machine-generated texts, with scores of 49.5% and 52.2%. These results indicate
a reliance on probability rather than the model’s ability to accurately classify the information.</p>
      <p>These outcomes emphasize the need to continue improving our experiments and exploring
new strategies for identifying written and generated texts. Utilizing lingüistic features, we
observed that factors such as the number of punctuation marks, digits, symbols, capital letters,
etc., contributed to achieving improved results.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Appendix</title>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been carried out with the support of DGAPA-UNAM PAPIIT project number
TA101722. The authors also thank CONACYT for the computing resources provided through the
Deep Learning Platform for Language Technologies of the INAOE Supercomputing Laboratory.
We also want to thank Eng. Roman Osorio for supporting the student administration of the
project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>ChatGPT</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://chat.openai.com/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Á.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Franco</given-names>
            <surname>Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of autextification at iberlef 2023:
          <article-title>Detection and attribution of machine-generated text in multiple domains</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          , Jaén, Spain,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y</surname>
          </string-name>
          <string-name>
            <surname>Gómez</surname>
          </string-name>
          ,
          <source>Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dugan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirubarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Roft: A tool for evaluating human detection of machine-generated text</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>03070</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ethayarajh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <article-title>The authenticity gap in human evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2205.11930</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>August</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Haduong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>All that's' human'is not gold: Evaluating human evaluation of generated text</article-title>
          ,
          <source>arXiv preprint arXiv:2107.00061</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sadasivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          ,
          <article-title>Can ai-generated text be reliably detected?</article-title>
          ,
          <source>arXiv preprint arXiv:2303.11156</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] OpenAI, Ai text classifier,
          <year>2023</year>
          . URL: https://beta.openai.com/ai-text-classifier.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>[9] Scikit-learn, ???? URL: https://scikit-learn.org/stable/.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>