<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Jaén, Spain
$ manex.aguirrezabal@hum.ku.dk (M. Agirrezabal)
 https://manexagirrezabal.github.io/ (M. Agirrezabal)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploring the Distinction: Investigating the Recognition of Automatic Text Generation Systems and Diferentiating Human Text from Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manex Agirrezabal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Language Technology (CST), Department of Nordic Studies and Linguistics, University of Copenhagen</institution>
          ,
          <addr-line>Emil Holms kanal 2, 2300 Copenhagen</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper we present our approach for the detection of text written by Language Models (task 1). We further train models to classify the specific Language Model that creates the texts (task 2). Our final model ranks 49/76 for task 1 and 30/38 for task 2. Even though results are not among the best ones, we provide a more in-depth analysis of the employed features, which show the importance of character-level features and the low importance of TF-IDF weighted word-level features.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;text classification</kwd>
        <kwd>language model detection</kwd>
        <kwd>human-generated text</kwd>
        <kwd>automatically generated text</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>we can see that the text resembles John Milton’s words. Without attempting to compare
this tiny language model to a Large Language Model as GPT, we tried to obtain some similar
text using the popular chatbot ChatGPT. We provided the following prompt and we got the
following answer on the 22nd of May, 2023:</p>
      <p>Prompt: In John Milton’s style, how would you follow the sentence "And Adam"?
Response: And Adam, with steadfast gaze and resolute countenance, beheld the
vast expanse before him, a tapestry of untamed wilderness, teeming with uncharted
wonders and boundless potential.</p>
      <p>We further continued the interaction with ChatGPT and asked the following: "can you do it
in iambic pentameter?". The response was:</p>
      <p>And Adam, with his steadfast gaze and mien,
Surveyed the vast expanse before his eyes,
A tapestry of wilds yet unexplored,
Where wonders dwelled, untamed and unrevealed,</p>
      <p>A realm of boundless promise, yet unknown.</p>
      <p>It is no doubt that recent language models sound notably more natural and fluent, especially
when we compare them to older models. This brings many opportunities, but it also poses many
dilemmas and dangers.</p>
      <p>On the bright side, many repetitive tasks that involve reading the same data and writing very
similar reports, could be simplified by the use of these models, such as, table report writing.
These models seem to work also very well in style change, for instance writing a paragraph in a
very diferent register, for instance writing an academic text for teenagers.</p>
      <p>What seems to be a nice help, though, it can also be seen as a dangerous step. For instance,
until recently, spam messages or scams were rather easy to spot (fool language, typical repeated
scenarios, and so on). But with the advent of new LLMs, we, e-mail recipients, can now be tricked
in an easier way. For the prompt "Can you write a catchy SMS message to buy an umbrella?" we
get the following answer on the 23rd of May, 2023 (the answer even included emojis that could
not be incorporated because of encoding issues):</p>
      <p>Don’t let the rain spoil your day! Stay dry and stylish with our fabulous umbrellas!
Whether it’s a sudden shower or a drizzle, our umbrellas have got you covered. Get
yours now and dance in the rain with confidence! Shop today and embrace the rainy
season with a touch of elegance. Hurry, limited stock available! Don’t miss out on this
weather-defying deal. #StayDryStayHappy</p>
      <p>Then, even though there are many benefits with these models, we believe it is relevant to be
able to detect whether a text was created by a human or a computer.</p>
      <p>
        In this paper, we present our approach to detect whether a text was written by a Language
Model or by a human, as part of the AuTexTification shared task [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Our model is part of an
attempt to create generic text classification models for a number of tasks, in which the majority
of the features are inspired by authorship attribution and profiling literature. The paper is
Task
Task 1
Task 2
#classes
2
6
structured as follows. In the next section we give a brief overview of some related works. After
that, we present the data and our employed methodology. Then we present our results followed
by a more in-depth analysis of the results and the models. We then conclude the paper and
propose some possible future directions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>With the improvement of language models, it is no surprise that recognizing whether a text
was written by a human or a computational model is becoming more challenging. We review
some works that attempted to solve this problem in the following lines.</p>
      <p>
        Some researchers have worked on deepfake detection in tweets. In their work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], they
collected text from 23 bots and from the 17 human accounts they imitate. The texts from those
accounts was generated using diferent techniques, such as GPT-2, RNNs, LSTMs and Markov
Chains. Their experiments show that generative methods based on the Transformer architecture,
for instance GPT-2, can produce high quality short texts, making it dificult to distinguish even
for expert human annotators. They employ common features including Bag-of-Words, TF-IDF
weighted BOWs, BERT representations and character level representations. They train several
classifiers such as Logistic Regression, Random Forest or Support Vector Machines. As the best
results were obtained by character-level models in detecting GPT-2 based texts, they mention
that these features might be relevant to be further explored.
      </p>
      <p>
        Many general classification works rely on the availability of large language models. In spite
of that, Frohling and colleagues [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose a simple feature-based classifier for the detection of
language model generated text, by the use of carefully crafted features that attempt to capture
the main diferences between human and machine-generated text. The main aspects that they
model are: (1) lack of syntactic and lexical diversity, (2) lack of coherence, and (3) lack of purpose.
They employ a number of features to model each of those aspects and besides, they make use of
common author profiling features, such as character, syllable or word counts, amonf others.
They train commonly used classifiers, namely, Logistic Regression, Support Vector Machines,
Neural Networks (NN) and Random Forests.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data and Method</title>
      <p>
        We use the data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] provided by the organizers of the shared task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Please find relevant figures
in the table below. As it can be seen, we have around 44,000 instances for the first task and
around 30,000 instances for the second task. Both datasets are balanced.
      </p>
      <p>
        This task is tackled following a general methodology that we expected to work for many tasks,
including bot vs. human discrimination, subjectivity detection, sexism detection, and many
others. Our original goal in the framework is to faithfully compare feature importance across
diferent tasks, but in this article we will emphasize in the characteristics of bot vs. human
and bot discrimination problems. Our feature set is mainly inspired by work on authorship
attribution and profiling [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. We employed commonly used stylometric features, such as
word-level bag-of-words (unigrams), TF-IDF weighted bag-of-words (unigrams),
characterlevel bag-of-words (1-4grams), POS tag bag-of-words (1-4grams), morphological features as
returned by the Stanza package (specify model) and BERT encoding of the input text by the
bert-base-cased model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In this study, we employ seven distinct classifiers for the purpose of training. These classifiers
consist of K-Nearest Neighbors (with  = 5), Logistic Regression, Linear Support Vector
Machine, Multilayer Perceptron, Decision Tree, Random Forest, and Gradient Boosting.</p>
      <p>To ensure model evaluation, we adopt a train/test validation procedure. Specifically, we
allocate two thirds of the available data for training purposes, while the remaining one third is
reserved for testing. This approach allows for reliable assessment of the classifiers’ performance
and generalization capabilities.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The results for task 1 and task 2, obtained from various classifiers, are provided in Tables 2
and 3, respectively. The evaluation metrics considered include the weighted F1-score, accuracy,
and training time. We believe that training time holds significant importance in practical
applications of such models, particularly in real-world scenarios involving private companies
or public institutions.</p>
      <p>Notably, the inference time is not included in the reported results as it demonstrates relatively
similar performance across all classifiers, except for the K-Nearest Neighbors classifier. However,
due to its comparatively low performance, the inference time is considered less relevant in this
context.</p>
      <p>For task 1, the classifiers achieved a ranking of 49 out of 76, with a macro F1-score of 57.35. In
the case of task 2, the classifiers attained a ranking of 30 out of 38, yielding a macro F1-score of
45.72. These metrics provide an overview of the classifiers’ performance in the respective tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Rather unexpectedly, we can observe that our test results from task 1 are rather low, compared
to the validation results from table 2. It could have been a case of overfitting to the training
data or also that the domain shift of the test data afected more than what we expected, making
our model domain dependent. In task 2 we observed that the GB method showed slightly better
results but a significantly larger training time (as it can be seen in table 3). Consequently, we
ultimately opted to utilize the Multilayer Perceptron as our chosen classifier for the test set, as
it demonstrated similar results to GB.</p>
      <p>Based on the feature importance of the Logistic Regression models trained, it seems like
BERT related features can be good for finding out whether a text is written by a human or a bot.
But they do not seem to hold the same power for establishing the system used (task 2).</p>
      <p>The analysis of feature performance in our study reveals certain trends. Firstly, the utilization
of Character Bag-of-Words features exhibits favorable predictive capability for both tasks under
investigation. On the other hand, the incorporation of morphological features extracted through
the Stanza library does not appear to provide predictive value for either of the tasks.</p>
      <p>Notably, POS uni- or ngrams demonstrate relatively strong predictive potential as feature
representations. These features, based on part-of-speech information, exhibit promising predictive
power for the tasks at hand.</p>
      <p>However, our findings indicate that TFIDF normalization does not significantly contribute
to improved performance. The TFIDF normalized BOW features do not appear to possess
substantial relevance in the context of the tasks examined, especially when compared to the
regular BOW features.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future work</title>
      <p>
        In this paper, we presented our approach for tasks 1 and 2 of the AuTexTification Shared
Task [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], where we trained models for distinguishing between human vs. Language Model
generated text, and besides, we trained models to predict the actual language model that was
used to generate the text. Our models were inspired by authorship attribution and profiling
literature, and we believe that results were fair, considering their simplicity.
      </p>
      <p>
        Considering that our results are still quite low, there is still room for improvement. We
would like to include more complex features, such as the ones proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and check their
importance with respect to other purely stylistic features. Besides, we included basic BERT
features by making use of the bert-base-cased model. We believe that there might be more
relevant models for the current tasks. Considering that POS-tag n-grams showed relatively
high importance, a possibility could be to use the hidden representations of a
Transformerbased POS-tagger, assuming that their hidden representations will have structural information
encoded as vectors.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Á.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Franco</given-names>
            <surname>Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of autextification at iberlef 2023:
          <article-title>Detection and attribution of machine-generated text in multiple domains</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          , Jaén, Spain,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y</surname>
          </string-name>
          <string-name>
            <surname>Gómez</surname>
          </string-name>
          ,
          <source>Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gambini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          , Tweepfake:
          <article-title>About detecting deepfake tweets</article-title>
          ,
          <source>Plos one 16</source>
          (
          <year>2021</year>
          )
          <article-title>e0251415</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fröhling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Feature-based detection of automated language models: tackling gpt-2, gpt-3 and grover</article-title>
          ,
          <source>PeerJ Computer Science</source>
          <volume>7</volume>
          (
          <year>2021</year>
          )
          <article-title>e443</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. Ángel</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Autextification,
          <year>2023</year>
          . URL: https://doi.org/10.5281/zenodo.7692961. doi:
          <volume>10</volume>
          .5281/zenodo.7692961.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          ,
          <article-title>Automatically categorizing written texts by author gender</article-title>
          ,
          <source>Literary and linguistic computing 17</source>
          (
          <year>2002</year>
          )
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <article-title>A survey of modern authorship attribution methods</article-title>
          ,
          <source>Journal of the American Society for information Science and Technology</source>
          <volume>60</volume>
          (
          <year>2009</year>
          )
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ N19-1423. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>