<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text using Transformer Ensembles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrei-Alexandru Preda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru-Clementin Cercel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Traian Rebedea</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Costin-Gabriel Chiru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest</institution>
          ,
          <addr-line>Bucharest 060042</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the solutions submitted by the UPB team to the AuTexTification shared task, featured as part of IberLEF-2023. Our team participated in the first subtask, identifying text documents produced by large language models instead of humans. The organizers provided a bilingual dataset for this subtask, comprising English and Spanish texts covering multiple domains, such as legal texts, social media posts, and how-to articles. We experimented mostly with deep learning models based on Transformers, as well as training techniques such as multi-task learning and virtual adversarial training to obtain better results. We submitted three runs, two of which consisted of ensemble models. Our best-performing model achieved macro F1-scores of 66.63% on the English dataset and 67.10% on the Spanish dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine-Generated Text</kwd>
        <kwd>Transformer</kwd>
        <kwd>Multi-Task Learning</kwd>
        <kwd>Virtual Adversarial Training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recently, computer-generated content started growing in presence on the Internet. With the
public release of powerful Large Language Models (LLMs) such as the Generative Pre-trained
Transformer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and its derivative systems such as ChatGPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], manufacturing texts is easier
than ever and probably harder to detect than ever. This phenomenon has already raised several
ethical issues that society must answer soon. This efort can be helped by finding mechanisms
to automatically and reliably detect computer-generated text.
      </p>
      <p>
        The AuTexTification: Automated Text Identification shared task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a natural language
processing (NLP) competition at IberLEF-2023 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Its main focus is detecting and understanding
the computer-generated text, especially that of LLMs. The competition presents two subtasks:
(1) Subtask 1 is a binary classification problem in which we have to detect whether a human or
an artificial intelligence model wrote a document, and (2) Subtask 2 is a multi-class classification
problem in which you have to select which LLM generated a given document from a list of
several LLMs. To address these subtasks, the organizers made available a bilingual dataset
of documents produced by humans and computers in English and Spanish, covering several
domains.
      </p>
      <p>
        Our team participated only in the first subtask. We first experimented with more standard
machine learning methods before moving to deep learning models, where we explored
techniques such as multi-task learning (MTL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and virtual adversarial training (VAT) [6]. Finally,
we combined multiple models trained independently to form ensembles, which we used to
generate our submissions, since they performed the best.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>While text classification is one of NLP’s fundamental and most well-established tasks, detecting
computer-generated text is a relatively novel task. This is probably because, until recently, few
systems could produce text realistic enough to fool humans. Creating such texts is commonly
called natural language generation [7].</p>
      <p>Currently, there seem to be various ways of addressing this problem, which can be classified
into black-box and white-box methods [8]. White-box techniques require access to the target
language model, and they can involve concepts such as watermarks which the models could
embed into their outputs to make detection easier. As such, black-box methods are more relevant
to the previously mentioned task since we only have access to the model’s output, but we do
not even know which model produced it.</p>
      <p>Black-box methods can involve both classical machine learning classification algorithms, as
well as ones based on deep learning [8]. To make predictions, traditional algorithms combine
statistical features and linguistic patterns with classifiers such as Support Vector Machines
(SVMs) [9]. On the other hand, deep learning methods usually involve fine-tuning pre-trained
language models using supervised learning in order to make predictions. These deep learning
approaches often obtain state-of-the-art results but are harder to interpret, which means they
are also harder to trust, as well.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>This section describes the diferent classification methods we tried and the final ensemble
architectures we submitted.</p>
      <sec id="sec-3-1">
        <title>3.1. Shallow Learning Models</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. Readability Scores</title>
          <p>Similar to Stodden and Venugopal [10], we combined several linguistic features with pre-trained
embeddings. Specifically, we computed the following readability scores: the Flesch reading ease
score [11], the Gunning-Fog index [12], and the SMOG index [13]. The intuition behind this
choice was that LLMs might not consider the ease of comprehension when generating texts.
For example, the generated legal texts might be harder to understand than those written by
humans. To compute the aforementioned scores, we used the Readability Python library [14],
which ofered 35 such features.</p>
          <p>Then, we concatenated the readability scores with document-level pre-trained embeddings
ofered by the spaCy library [ 15], which are 300-dimensional and language-specific. The
English embeddings are based on GloVe [16], while the Spanish ones are based on
FastText [17]. Finally, all features were scaled to have zero mean and unit variance with scikit-learn’s
StandardScaler [18], before using them to train two classifiers, namely XGBoost [ 19] and
k-Nearest Neighbors (kNN) [18].</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. String Kernels</title>
          <p>We also experimented with string kernels [20], which are kernel functions that measure the
degree of similarity between two strings. An example of a simple string kernel counts the
number of n-grams shared by the two strings without considering duplicates. Such a function
can be computed for multiple sizes of n-grams, and used as the kernel function of a classifier
such as an SVM.</p>
          <p>We performed common natural language preprocessing operations on the input text:
removing punctuation, removing stopwords, lowercasing all letters, and stemming the words. We
used n-gram sizes between 3 and 5, and the SVM classifier implemented in scikit-learn [ 18].
Since custom kernels might need to be computed between each pair of input samples, using the
entire training dataset would have taken a long time, so we tested the method only on a small
slice of it, comprising several thousand samples.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Deep Learning Models</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Transformers</title>
          <p>The Transformer architecture was introduced in 2017 by Vaswani et al. [21] and is currently
powering numerous state-of-the-art solutions for many tasks. Transformers usually feature two main
components: an encoder and a decoder. However, these two parts can be useful by themselves as
well. One example is the Bidirectional Encoder Representations from Transformers (BERT) [22]
model family, which can encode input text into contextual embeddings.</p>
          <p>We experimented with several BERT versions: multilingual ones (i.e., XLM-RoBERTa [23]
and multilingual BERT [22]), and one pre-trained on tweets (i.e., TwHIN-BERT [24]). Since
BERT models can be large and typically require large amounts of data to train from scratch, we
utilize transfer learning [25] instead, by fine-tuning pre-trained models.</p>
          <p>We experimented with Transformer-based models to encode the raw input text into
embeddings, which we then connected to a linear layer, followed by a dropout layer [26] before the
ifnal prediction head. The last layer produces a probability of the document being
computergenerated, and the binary prediction is chosen by comparing it with a threshold.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Multi-Task Learning</title>
          <p>As an additional method of preventing overfitting, we used the technique of multi-task learning.
MTL refers to training a model to solve multiple tasks simultaneously. As such, these models
typically feature a set of parameters shared for all tasks, and separate prediction heads for each
task. Intuitively, multi-task learning should make the task harder to solve, thus adding extra
complexity, which the model has to adapt to.</p>
          <p>In our case, an extra task that is easy to derive is predicting the language of a given input
document. More precisely, apart from predicting the human/computer label of a document, the
model has to detect whether it is written in English or Spanish. This means that, for training,
we combined the two datasets supplied for Subtask 1. However, we did not use any of the data
provided for Subtask 2.</p>
          <p>The MTL architecture is very similar to the one presented in the previous section, only adding
an extra classification head. The architecture can be seen in Figure 1a. Since both tasks involve
binary classification, we compute a binary cross-entropy loss for each of them, namely ℒbot for
the human/computer classification task, and ℒlang for the language detection task. The final
loss of the model is a combination of these two losses, as given by the following formula:
ℒ =  ℒbot + (1 −  )ℒlang
(1)
where hyperparameter  controls how much attention is paid to each task.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Virtual Adversarial Training</title>
          <p>VAT [27] is another regularization technique for deep learning models. It aims to help models
generalize better by perturbing the inputs to maximize the loss function. For our models, the
inputs refer to the embeddings of the raw documents, not to the token IDs.</p>
          <p>In our case, this method implies performing the forward and backward passes multiple times
in order to compute the gradients. Then, the loss function specific to VAT gets added to the
regular loss function, the final loss being the sum of the two. We added VAT to our models
using the VAT-pytorch Python library [28], which implements the distributional smoothing
technique described by Miyato et al. [6].</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ensemble Learning</title>
        <p>Ensemble techniques combine multiple diferent models to make better predictions. Intuitively,
they should make the weaknesses of each model matter less since if one model happens to have
poor performance on a certain edge case, all the other models will probably give better results,
thus negating the impact of the incorrect prediction.</p>
        <p>While there are multiple ways of combining models into ensembles (such as majority voting
or bagging) [29], we decided to use the stacking technique inspired by the work of Gaman [20].
Thus, we train an extra meta-learner model, which learns to make predictions based on the
outputs of each model in the ensemble. We experimented with an XGBoost classifier, which
takes as input the probabilities produced by each model, and the binary predictions they make.
The submitted final ensembles can be seen in Figure 1b.
(a) The MTL architecture used
for run 1.</p>
        <p>(b) Summary of the ensemble architectures used for runs 2 and 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The training dataset provided for Subtask 1 consists of approximately 33,845 English documents,
and 32,062 Spanish documents. For both languages, the ratio of computer-generated to
humangenerated texts was roughly 50%. This suggested that the dataset we worked with was fairly
well-balanced, and we did not attempt to use any techniques for dealing with imbalanced data.</p>
        <p>While the individual BERT classifier (MLP) was trained for the two languages separately,
the multi-task learning experiments merged the two slices into a single dataset. In both cases,
we used 70% of the labeled data to train the models, and we set aside the other 30% for
validation. This split was fixed at the beginning of the project to allow us to compare the models’
performance. However, when creating the final predictions, we retrained all models on the full
labeled dataset, only using 2% of it (around 1,300 samples) as a validation set to monitor the
training process.</p>
        <p>As seen in Figure 2, there is a relatively large number of documents with fewer than 200
characters. Since the task organizers described one of the domains in the dataset as social media,
and Twitter usually limits user posts to approximately 300 characters at most, we assumed
these samples to be tweets. For this reason, we experimented with a language model pre-trained
specifically on tweets, namely TwHIN-BERT.</p>
        <p>We did not perform any data preprocessing for the deep learning models, using the raw
documents as input instead.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Hyperparameters</title>
        <p>For the deep learning models, we used a hidden layer of size 64, and a dropout rate of 0.2. For
MTL, we assigned the same weight to the two loss values, i.e.,  = 0.5. We used the default
values for VAT, namely  = 1,  = 1, and  = 10.</p>
        <p>One of the parameters we did not set to a fixed value was the threshold for turning the
probabilities into binary predictions. To choose this threshold, we compute the true positive
rate (TPR) and false positive rate (FPR) of the Receiver Operating Characteristic curve for the
validation set. Since the dataset was balanced, we did not want to favor either precision or
recall, so we picked the threshold which brings the sum of TPR and FPR closest to 1. In our
experiments, the threshold was often greater than 0.9, sometimes over 0.95.</p>
        <p>We used the AdamW optimizer [30] implemented in PyTorch [31] with a learning rate of
10− 5, and 2-4 epochs for training. In order to avoid overfitting, we used both a dropout layer
and the early stopping technique. We used batch sizes between 24 and 48, depending on the
model size.</p>
        <p>For the kNN shallow model, we set the parameter  to 10. For the final ensembles, we
performed a grid search to find the hyperparameters of the XGBoost model, which maximized the
macro F1-score on a small validation set for each of the two languages. We searched for
estimators between 2 and 30, depths between 3 and 10, and learning rates between 10− 5 and 10− 1. The
best hyperparameters were: n_estimators = 3, max_depth = 5, and learning_rate = 10− 3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>The results obtained for Subtask 1 can be seen in Tables 1 and 2, alongside the baselines
provided by the task organizers, the best results obtained by other participant teams, and our
other experiments, which we did not submit.</p>
        <p>As expected, the ensembles performed better than the MTL model (i.e., run1) on both
datasets. However, the scores obtained on the test set are much smaller than those obtained
on our validation set. This fact could indicate that our chosen validation set was too small or
poorly chosen, or that the distribution of the test set is diferent from that of the training set. It
would have been interesting to see if methods such as cross-validation would have produced
other validation scores closer to the real performance.</p>
        <p>Our experiments indicated that the choice of Transformer mattered as well, with multilingual
models being generally better for this use case. Similarly, the fine-tuned embeddings produced
during training were better than the pre-trained ones provided by spaCy, and hence, we did not
explore the shallow learning direction more.</p>
        <p>We did not perform a grid search or any other type of hyperparameter search for the deep
learning models, so our approaches could obtain better results simply by choosing appropriate
hyperparameter values. Another thing to note is that while the full ensemble performed better
on the English dataset, removing the TwHIN-BERT Transformer slightly improved the results
on the Spanish dataset.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper, we proposed multiple methods for addressing the task of detecting LLM-generated
text. While we experimented briefly with more classical machine learning classifiers, we saw
that Transformer-based models performed better for this task. We described our experiments
with several classification models powered by BERT and detailed some regularization techniques,
which improved their performance slightly. Finally, we stacked multiple such models to form
ensembles, leading to even better performance and achieving our best macro F1-scores of 66.63%
on the English dataset, and 67.10% on the Spanish dataset of the AuTexTification shared task,
Subtask 1.</p>
      <p>Regarding future work, we could improve the choice of hyperparameters since they are
often crucial for achieving good performance, and techniques such as grid search should find
better values. Similarly, the training process can be improved by using better methods to avoid
overfitting, and increasing the number of training epochs.
[6] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, S. Ishii, Distributional smoothing with virtual
adversarial training, arXiv preprint arXiv:1507.00677 (2015).
[7] C. L. Paris, W. R. Swartout, W. C. Mann, Natural language generation in artificial
intelligence and computational linguistics, volume 119, Springer Science &amp; Business Media,
2013.
[8] R. Tang, Y.-N. Chuang, X. Hu, The science of detecting llm-generated texts, arXiv preprint
arXiv:2303.07205 (2023).
[9] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE</p>
      <p>Intelligent Systems and their applications 13 (1998) 18–28.
[10] R. Stodden, G. Venugopal, Rs_gv at semeval-2021 task 1: Sense relative lexical complexity
prediction, in: Proceedings of the 15th International Workshop on Semantic Evaluation
(SemEval-2021), 2021, pp. 640–649.
[11] R. Flesch, A new readability yardstick., Journal of applied psychology 32 (1948) 221.
[12] R. Gunning, Technique of clear writing, McGraw-Hill (1952).
[13] G. H. Mc Laughlin, Smog grading-a new readability formula, Journal of reading 12 (1969)
639–646.
[14] A. van Cranenburgh, Readability, https://github.com/andreasvc/readability, 2022. Accessed
15 Jun 2023.
[15] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural</p>
      <p>Language Processing in Python (2020). doi:10.5281/zenodo.1212303.
[16] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), 2014, pp. 1532–1543.
[17] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training
distributed word representations, in: Proceedings of the International Conference on
Language Resources and Evaluation (LREC 2018), 2018.
[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python,
Journal of Machine Learning Research 12 (2011) 2825–2830.
[19] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
pp. 785–794.
[20] M. Gaman, Using ensemble learning in language variety identification, in: Tenth Workshop
on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), 2023, pp. 230–240.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I.
Polosukhin, Attention is all you need, Advances in neural information processing systems 30
(2017).
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
[23] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, 2020, pp. 8440–8451.
[24] X. Zhang, Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, A. El-Kishky, Twhin-bert:
A socially-enriched pre-trained language model for multilingual tweet representations,
arXiv preprint arXiv:2209.07562 (2022).
[25] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning, Journal of Big data
3 (2016) 1–40.
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
way to prevent neural networks from overfitting, The journal of machine learning research
15 (2014) 1929–1958.
[27] T. Miyato, S.-i. Maeda, M. Koyama, S. Ishii, Virtual adversarial training: a regularization
method for supervised and semi-supervised learning, IEEE transactions on pattern analysis
and machine intelligence 41 (2018) 1979–1993.
[28] S. Yokoo, Vat-pytorch, https://github.com/lyakaap/VAT-pytorch, 2018. Accessed 15 Jun
2023.
[29] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles
for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE
Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42
(2011) 463–484.
[30] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun
(Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, A. Lerer, Automatic diferentiation in pytorch, in: NIPS 2017 Workshop on
Autodif, 2017. URL: https://openreview.net/forum?id=BJJsrmfCZ.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Improving language understanding by generative pre-training</article-title>
          ,
          <source>OpenAI</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Chatgpt: Optimizing language models for dialogue</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Á.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Franco</given-names>
            <surname>Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of autextification at iberlef 2023:
          <article-title>Detection and attribution of machine-generated text in multiple domains</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          , Jaén, Spain,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y</surname>
          </string-name>
          <string-name>
            <surname>Gómez</surname>
          </string-name>
          ,
          <source>Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>A unified architecture for natural language processing: Deep neural networks with multitask learning</article-title>
          ,
          <source>in: Proceedings of the 25th international conference on Machine learning</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>167</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>