<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Team Chen at PAN: Integrating R-Drop and Pre-trained Language Model for Multi-author Writing Style Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhaotian Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yusheng Yi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper presents our experiment in the PAN Multi-Author Writing Style Analysis task at CLEF 2024. The task is divided into three increasingly dificult subtasks according to the topic consistency between paragraphs: from detecting style changes between paragraphs with multiple topics at the easy level, to a medium level where the diversity of topics is small, forcing the method to focus more on style, finally, at the most dificult level to identify subtle style diferences between paragraphs of the same topic. Therefore, the task asks for not only distinguishing diferent topics but also capturing obvious change in writing style with the same topic. To address the task, we select the powerful pre-trained language model, Roberta, as the foundation model and fine-tuned it to detect styles and topics of texts. Additionally, we employed R-Drop regularization to reduce overfitting during the model fine-tuning, thereby enhancing its generalization capabilities on unseen texts. Experimental results demonstrate that our model achieved F1 scores of 0.968, 0.822, and 0.807 on the test sets of the three dificulty levels, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2024</kwd>
        <kwd>Multi-Author Writing Style Analysis</kwd>
        <kwd>Regularization</kwd>
        <kwd>R-Drop</kwd>
        <kwd>Pre-trained language Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The task of multi-author writing style analysis aims to find all positions of writing style change on the
paragraph-level in a given multi-author document [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, detecting style changes can assist in
identifying the identity of the current author, verifying the claimed authorship, and detecting the risk
of plagiarism in documents. Particularly in situations where there is no comparative text, detecting
style changes becomes the sole method to identify plagiarism in documents.
      </p>
      <p>
        In recent years, PAN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has organized a series of tasks to detect writing style changes in text, ranging
from determining the actual number of authors[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], identifying style changes between two consecutive
paragraphs[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], to detecting style changes at the sentence level[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ](ranging from detecting style of
consecutive paragraphs to consecutive sentences). In this year, the task of Style Change Detection
focuses on paragraphs and detects writing style changes at every pair of consecutive paragraphs in a
given text.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Large-scale pre-trained models, such as BERT[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], RoBERTa[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], etc., often contain millions or even
billions of parameters. Although larger models tend to exhibit better performance, they are highly
susceptible to overfitting. During the fine-tuning process of the pre-trained model for the style change
detection task, it was observed that despite a continuous decrease in training loss, the F1 score on the
validation set remained unsatisfactory. Upon closer examination of both the training and validation
losses, it was revealed that while the training loss was steadily declining, the validation loss was
progressively increasing. To address this issue, researchers have proposed various regularization
methods, including weight decay [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ], dropout [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14, 15</xref>
        ], normalization [16, 17, 18], adding
noise [19], layer-wise pre-training and initialization [20, 21], label smoothing [22], and more. Among
these methods, dropout and its variants have garnered significant attention due to their efectiveness
and compatibility with other regularization techniques.
      </p>
      <p>Dropout enhances the generalization capability by inhibiting the co-adaptation of neurons and
implicitly creating an ensemble of multiple sub-models. As a variant of dropout, compared to the
traditional dropout strategy in neural network training, The core idea of R-Drop regularization[23]
lies in generating consistent predictions from models with diferent dropout masks during the training
process.</p>
      <p>As a modified version of dropout, R-Drop regularization, in contrast to the conventional dropout
approach employed in neural network training, centers on ensuring consistent predictions from various
dropout-masked networks throughout the training phase [23]. In order to do this, R-Drop employs
the minimization of Kullback-Leibler divergence between the outputs of any two sub-models using
diferent dropout masks to achieve model regularization. The method greatly improves models ability
of generalization and lowers the risk of overfitting by efectively reducing the degree of freedom of
model parameters. Consequently, it significantly enhances the stability and generalization capability of
reasoning.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The task presents three datasets of varying dificulty levels, categorized based on the diversity of topics
and consistency within the documents. Each dataset poses specific subtasks.</p>
      <p>• Easy: The paragraphs of a document cover a variety of topics, allowing approaches to make use
of topic information to detect authorship changes.
• Medium: The topical variety in a document is small (though still present), forcing the approaches
to focus more on style to efectively solve the detection task.</p>
      <p>• Hard: All paragraphs in a document are on the same topic.</p>
      <p>Each dataset is divided into three parts: training set, validation set and testing set. The training set
and the validation set include ground truth data, while the testing set does not provide ground truth
data. Table 1 provides statistical information about the datasets. Note that "Samples" specifically refers
to data units composed of two consecutive paragraphs from the documents, used to analyze whether
there is a style change between the two paragraphs. For details on how the samples were constructed,
please refer to Section 4.1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>The methodology presented in this paper encompasses three primary steps: 1) data preparation, 2)
RDrop regularization, and 3) model fine-tuning. The methodology is founded on the concept of attaining
elevated precision and recall rates for classifying unseen datasets. To accomplish this goal, fine-tuning
of a pre-trained language model is undertaken for specific downstream tasks. Furthermore, R-Drop
regularization methods are employed to enhance the model’s generalization capabilities.</p>
      <sec id="sec-4-1">
        <title>4.1. Data Preparation</title>
        <p>To create the samples, we first marked the junction between two consecutive paragraphs in each
document using delimiters. Subsequently, we assigned binary labels indicating whether there was
a style change between the two paragraphs. This enabled us to transform the task into a binary
classification problem. In order to prepare the samples for fine-tuning the pre-trained RoBERTa model,
we adopted the corresponding tokenizer for RoBERTa. RoBERTa has a limit on the maximum input
sequence length, typically 512 tokens. Upon analyzing the dataset, we found that only a few samples
exceeded the maximum token limit. Therefore, we opted for a truncation strategy to handle samples
exceeding the maximum input sequence length.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. R-Drop Regularization</title>
        <p>Dropout randomly drops part of units in each layer of the neural network to avoid co-adapting and
over-fitting. Besides, dropout also approximately performs to combine exponentially many diferent
neural network architectures eficiently, while model combination can always improve the model
performance. Despite its simplicity and eficacy, dropout introduces a significant inconsistency between
the training and inference phases, which can potentially impede model performance. To address this
issue, the incorporation of the R-Drop regularization term into the training process ensures consistency
in the model’s predictions for identical inputs across varying dropout masks. This approach regulates
the inconsistency arising from dropout during training. Specifically, for each training batch, the
process involves conducting two forward passes with distinct dropout masks on the same data batch.
Subsequently, the Kullback-Leibler (KL) divergence, a widely used metric for quantifying the disparity
between two probability distributions, is computed between the two prediction outcomes.</p>
        <p>Given the input , 1(|) and 2(|) represent the probability distributions of  predicted
by the model under diferent sets of parameters (caused by dropout, such as 1 and 2), respectively.
The KL divergence between 1(|) and 2(|) is given by:
(1)
(2)
KL(1(|)‖2(|)) = ∑︁ 1(|) log

1(|)
2(|)</p>
        <p>To incorporate this discrepancy into the training process, R-Drop add the calculated KL divergence as
an important regularization term to the loss function and use the parameter  to control the coeficient
weight of KL divergence. The R-Drop method employs a loss function represented by the formula
given below. In the formula 2, the terms − log  (1)(|) and − log  (2)(|) signify the negative
log probabilities of accurately predicting the label  conditional on the input . These probabilities
are obtained from two sub-models, both of which are generated by introducing dropout variations to
the same neural network.</p>
        <p>= − log  (1)(|) − log  (2)(|) +  ( (1)(|)|| (2)(|))</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Model Fine-Tuning</title>
        <p>In the task, the RoBERTa pre-trained model was chosen as the base model to leverage the rich linguistic
representations it has learned from a large corpus, enhancing the performance of downstream tasks. To
reduce the risk of model overfitting, a Dropout layer was introduced after the output of the RoBERTa
model. Dropout is a widely used regularization technique that randomly discards neurons to reduce
model complexity and enhance model generalization. During model fine-tuning, R-Drop regularization
was employed to further improve the model’s generalization capability. To adapt the pre-trained model
to the style change detection task, a fully connected linear output layer was added on top of the model.
This output layer uses the softmax activation function to generate a probability distribution for each
category, enabling the model to learn and classify whether there is a style change between consecutive
paragraphs.</p>
        <p>Algorithm 1 details the implementation of R-Drop in the model fine-tuning process. Specifically, for
each training batch, two forward passes with diferent dropout masks are performed on the same batch of
data, and the Kullback-Leibler (KL) divergence between the two prediction results is calculated. Through
this approach, the model not only takes full advantage of the language representation capabilities of the
RoBERTa model but also addresses the issue of overfitting. This enables the model to efectively identify
and classify style changes in unseen texts, thereby enhancing the overall performance of the task.
Algorithm 1 Fine-tuning with Integrated R-Drop Algorithm
Require: Training dataset  = {(, )}
Require: Neural network model  with parameters 
Require: Number of training epochs 
Require: Learning rate 
Require: Balance factor  for KL divergence loss
Ensure: Trained model  with optimized parameters  *
1: Initialize model parameters  randomly or with pre-trained weights
2: Initialize optimizer with learning rate 
3: for epoch  = 1 to  do
4: for each batch (, ) in  do
5: Forward pass the model  twice with  to obtain two sets of outputs: 1 and 2
6: Compute cross-entropy losses: 1 = (, 1), 2 = (, 2)
7: Compute KL divergence loss:  =  · (1 ‖ 2) +  · (2 ‖ 1)
8: Compute total loss:  = 1 + 2 + 
9: Backpropagate the total loss  to update model parameters 
10: end for
11: end for
12: return trained model  with optimized parameters  *</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental settings</title>
        <p>In this paper, the RoBERTa model was chosen, comprising 12 transformer layers, 768 hidden units, and
12 attention heads. The hyperparameter settings are as follows: the maximum sequence length is set to
512, the learning rate is set at 0.00001, the batch size is configured to 32, the number of epochs is set to
7, and the dropout rate is 0.5. The coeficient weight  for R-Drop is set to 5.</p>
        <p>To evaluate the efectiveness of the model for each subtask, performance is assessed by calculating
the F1 score on the provided evaluation set. After conducting experiments and obtaining results on the
evaluation set, the best-performing model for each subtask is selected.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Result</title>
        <p>The best-performing model for each sub-task was ultimately submitted to TIRA [24] for execution, and
the final performance indicators of the model were obtained. Table 2 provides the F1 scores achieved by
the model on the oficial test set. Compared to the baseline approaches provided by Pan, which predict
either no style change (all-0) or all style changes (all-1), our method achieves a minimum improvement
of 1-fold and a maximum improvement of up to 7-fold.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Ablation experiments</title>
        <p>To demonstrate the efectiveness of R-Drop in this task, we conducted experiments while keeping other
parts of the model unchanged. We observed the performance changes of the model in the validation set
by adding or not adding the R-Drop method. Table 3 presents the experimental results.
e
r
coS 0.8
1
F</p>
        <p>In Figure 1, we adjusted the dropout parameter in the model with other parameters unchanged, in
order to optimize the model’s performance on complex datasets.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper briefly introduces our work achievements on the PAN 2024 multi-author writing style
analysis task. We fine-tune the RoBERTa model using the R-Drop regularization method to obtain the
ifnal results. This approach achieved promising outcomes across three subtasks of varying dificulty
levels, demonstrating its efectiveness in tackling complex writing style analysis challenges. An ablation
study further validated the significance of R-Drop regularization in preventing overfitting and enhancing
model performance.</p>
      <p>However, the lack of analysis on error cases limits our understanding of the model’s limitations and
potential for improvement. Future research should delve deeper into error case analyses to identify the
model’s weaknesses and devise targeted solutions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Natural Science Foundation of China (No.62276064)
[15] S. Wang, C. Manning, Fast dropout training, International Conference on Machine
Learning,International Conference on Machine Learning (2013).
[16] S. Iofe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal
covariate shift, arXiv: Learning,arXiv: Learning (2015).
[17] L. Huang, X. Liu, B. Liu, et al., Orthogonal weight normalization: Solution to optimization over
multiple dependent stiefel manifolds in deep neural networks, National Conference on Artificial
Intelligence,National Conference on Artificial Intelligence (2017).
[18] Y. Wu, K. He, Group normalization, International Journal of Computer Vision (2020) 742–755.</p>
      <p>URL: http://dx.doi.org/10.1007/s11263-019-01198-w. doi:10.1007/s11263-019-01198-w.
[19] S. Hochreiter, J. Schmidhuber, Simplifying neural nets by discovering flat minima, Neural</p>
      <p>Information Processing Systems,Neural Information Processing Systems (1994).
[20] D. Erhan, P.-A. Manzagol, Y. Bengio, et al., The dificulty of training deep architectures and the
efect of unsupervised pre-training (2009).
[21] K. He, X. Zhang, S. Ren, et al., Delving deep into rectifiers: Surpassing human-level performance
on imagenet classification, in: 2015 IEEE International Conference on Computer Vision (ICCV),
2015. URL: http://dx.doi.org/10.1109/iccv.2015.123. doi:10.1109/iccv.2015.123.
[22] C. Szegedy, V. Vanhoucke, S. Iofe, et al., Rethinking the inception architecture for computer
vision, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL:
http://dx.doi.org/10.1109/cvpr.2016.308. doi:10.1109/cvpr.2016.308.
[23] L. Wu, J. Li, Y. Wang, et al., R-drop: Regularized dropout for neural networks, Advances in Neural</p>
      <p>Information Processing Systems 34 (2021) 10890–10905.
[24] M. Fröbe, M. Wiegmann, N. Kolyada, et al., Continuous Integration for Reproducible Shared Tasks
with TIRA.io, in: Advances in Information Retrieval. 45th European Conference on IR Research
(ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp.
236–241. doi:10.1007/978-3-031-28241-6_20.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , et al.,
          <source>Overview of the Multi-Author Writing Style Analysis Task at PAN</source>
          <year>2024</year>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          , et al.,
          <source>Overview of PAN</source>
          <year>2024</year>
          <article-title>: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Specht</surname>
          </string-name>
          , et al.,
          <source>Overview of the Style Change Detection Task at PAN</source>
          <year>2019</year>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Losada</surname>
          </string-name>
          , H. Müller (Eds.),
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Specht</surname>
          </string-name>
          , et al.,
          <source>Overview of the Style Change Detection Task at PAN</source>
          <year>2020</year>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhof</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Névéol (Eds.),
          <article-title>CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , et al.,
          <source>Overview of the Style Change Detection Task at PAN</source>
          <year>2021</year>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          (Eds.),
          <article-title>CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , et al.,
          <source>Overview of the Style Change Detection Task at PAN</source>
          <year>2022</year>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.),
          <article-title>CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2022</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3180</volume>
          /paper-186.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North</source>
          ,
          <year>2019</year>
          . URL: http://dx.doi.org/10.18653/v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , et al.,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , Cornell University - arXiv,Cornell University - arXiv (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krogh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hertz</surname>
          </string-name>
          ,
          <article-title>A simple weight decay can improve generalization</article-title>
          ,
          <source>Neural Information Processing Systems,Neural Information Processing Systems</source>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          ,
          <source>Communications of the ACM</source>
          (
          <year>2017</year>
          )
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          . URL: http://dx.doi.org/10.1145/3065386. doi:
          <volume>10</volume>
          .1145/3065386.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Learning structured sparsity in deep neural networks</article-title>
          ,
          <source>Neural Information Processing Systems,Neural Information Processing Systems</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , et al.,
          <article-title>Improving neural networks by preventing coadaptation of feature detectors</article-title>
          , Cornell University - arXiv,Cornell University - arXiv (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Regularization of neural networks using dropconnect</article-title>
          ,
          <source>International Conference on Machine Learning,International Conference on Machine Learning</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <article-title>Adaptive dropout for training deep neural networks</article-title>
          ,
          <source>Neural Information Processing Systems,Neural Information Processing Systems</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>