<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Citation Recommendation: A Reproducibility Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Farber</string-name>
          <email>michael.faerber@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timo Klein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joan Sigloch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>66</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>Context-aware citation recommendation is used to overcome the process of manually searching for relevant citations by automatically recommending suitable papers as citations for a speci ed input text. In this paper, we examine the reproducibility of a state-of-the-art approach to context-aware citation recommendation, namely the neural citation network (NCN) by Ebesu and Fang [1]. We re-implement the network and run evaluations on both RefSeer, the originally used data set, and arXiv CS, as an additional data set. We provide insights on how the di erent hyperparameters of the neural network a ect the model performance of the NCN and thus can be used to improve the model's performance. In this way, we contribute to making citation recommendation approaches and their evaluations more transparent and creating more e ective neural network-based models in the future.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>bibliometrics</kwd>
        <kwd>citation context</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Citing sources is an essential part of academia to guarantee transparency and
truthfulness. However, the process of nding relevant and appropriate citations
is becoming increasingly time-consuming and di cult due to the sheer amount
of new literature published every year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Citation recommendation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been
proposed to overcome this issue. This task refers to the idea of generating a
ranked list of potentially suitable citations in an automated way, thus facilitating
the process of choosing correct citations.
      </p>
      <p>
        According to He et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], there are two types of citation recommendation
tasks, namely global citation recommendation and local citation recommendation.
The former is used to propose candidates for the bibliography of a given
scienti c manuscript that does not yet have a bibliography. Local citation
recommendation, on the other hand, proposes candidates for a given citation
placeholder (e.g., \[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]") located in the written text of a scienti c document. In
order to generate recommendations, the text surrounding the placeholder, often
referred to as the citation context, is used as an input into the recommender
system. The output consists of a ranked list containing candidates for the query
placeholder.
      </p>
      <p>
        In recent years, several approaches to global and local citation
recommendation have been proposed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we analyze the
reproducibility of one speci c local citation recommendation approach, namely
the neural citation network (NCN) by Ebesu and Fang [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We chose this
approach due to its currentness, its promising results on a large data set, and
its wide acceptance in the scienti c community (based on citation counts). Note
that we were unable to run the source code published online by Ebesu and
Fang. Furthermore, the Python version used by Ebesu and Fang is outdated.
Thus, after re-implementing the network, we used both the original data set and
another data set for training and evaluating the NCN in order to examine its
performance under varying circumstances.
      </p>
      <p>
        Overall, we make the following contributions in this paper:
1. We re-implement the NCN by Ebesu and Fang [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a state-of-the-art
approach to local citation recommendation.
2. We run extensive experiments based on the NCN using the original data set
      </p>
      <p>RefSeer and arXiv CS as a further data set.
3. We analyze the evaluation results and give noteworthy conclusions for the
future development of local citation recommendation approaches.</p>
      <p>The rest of this paper is structured as follows: We give an overview of the
NCN architecture in Sec. 2. In Sec. 3, we present our experimental setup and
the evaluation results. We conclude in Sec. 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Neural Citation Network</title>
      <p>
        The NCN proposed by Ebesu et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] consists of an encoder-decoder model
coupled with an attention mechanism (see Fig. 1).
      </p>
      <p>
        Encoders. Encoders are deployed as part of the NCN in order to turn the
raw citation context and the citing/cited authors' names into feature tensors
holding important information about the context and the authors, respectively.
1. Context encoder. The part of the NCN that is responsible for encoding
the citation context is a time-delay neural network (TDNN) introduced by
Collobert et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It allows multiple forward propagations through the
network at once, leading to all feature maps being calculated in parallel. The
TDNN used by Ebesu and Fang consists of a convolutional layer followed by
both a pooling layer and a fully connected layer.
2. Author encoder. In order to include author information when generating
citation recommendations, the NCN comprises an author encoder, which
uses the same architecture as the context encoder (outlined above). It is
separately applied to (1) the embeddings of the authors' names Aq of
the document from which the query context originated as well as (2) the
embeddings of the authors' names Ad of all documents in the database. The
citation context q
      </p>
      <p>Lorem ipsum…
ceimtabtieodndcinogntoefxt Xq
ceimtinbgedaduitnhgorof Aq
embedding of Ad
cited author</p>
      <p>Citation context
encoder (TDNN)
Citing author
encoder (TDNN)
Cited author
encoder (TDNN)
probability for
cited document yi
updated hidden
state hi</p>
      <p>softmax
s = [ f(Xq) + f(Aq) + (fAd) ]</p>
      <p>ci
Attention</p>
      <p>Decoder (RNN)
hidden state
hi-1
cited paper’s title xid
given as word embeddings
author encoder is applied multiple times using TDNNs with varying region
lter sizes in the convolutional layer.</p>
      <p>The nal representation which results from applying the context encoder
and author encoders is denoted as
s = [f (Xq)
f (Aq)
f (Ad)];
with a given citation context representation Xq.</p>
      <p>
        Decoder. The NCN's decoder is a recurrent neural network (RNN) that
makes use of the gated recurrent unit (GRU) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as a gating mechanism as well
as the attention mechanism [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is applied to the title of every document that
can be used as a citation for the query citation context.2 The purpose of the
decoder is to generate scores for every document in the database indicating its
suitability as a citation for the given query context. The scores can ultimately
be used to generate citation recommendations for the query context.
      </p>
      <p>
        Attention mechanism. The NCN makes use of the attention mechanism
originally introduced by Bahdanau et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With the help of attention, the
encodings sj that originate from the context and author encoders are given
weights dependent on the decoder output hi 1 for the word prior to i. The
result is a context vector ci which is made up of a weighted sum of the encoder
outputs sj in accordance to their relevance. Attention is used to put emphasis
on encodings that are particularly important for the current time step. The
attention mechanism is implemented as a feed-forward neural network that
concludes with a softmax layer converting attention vectors aij into attention
scores ij . These indicate the importance of the encoder output sj for the ith
word in the title of the document currently being decoded. To illustrate, in
2 For very large databases, a pre-selection algorithm may make sense to save
computing time. See Section 3.2 for further information.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <sec id="sec-3-1">
        <title>Data Sets</title>
        <p>
          We used two data sets in our evaluation.
1. RefSeer. Following Ebesu and Fang [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we used RefSeer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as our rst
data set. Although we followed Ebesu and Fang's instructions on creating
their evaluation data set, we were unable to generate the exact same data set
based on the original RefSeer data as we were unable to nd any information
about citing authors within the data set, only cited authors. For comparison,
we decided to randomly select 4.5 M out of the generated 14.9 M citation
        </p>
        <p>
          contexts in order to end up with the same data set size as the one used by
Ebesu and Fang. Note that the data set we reused did not contain author
information of the citing papers. We thus expected poorer performance than
that of the model published by Ebesu and Fang.
2. arXiv CS. We used the arXiv.org publications in the computer science
domain as our second data set, as proposed by Farber et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for
citation-based tasks. We cut o the citation contexts and citation titles at
lengths of 100 and 30 words, respectively, to achieve a trade-o between
model performance and training time (see Fig. 3). Overall, we used 502,353
pairs of citations and citation contexts. We chose this data set in order
to obtain insights into how well our models perform under di erent
circumstances than the ones presented by Ebesu and Fang. Thus, our paper
is not only a replicability paper (with a focus on repeating prior experiments
to see when the methods work) but also a reproducibility paper (repeating
experiments in new contexts).
        </p>
        <p>For model training and evaluation, we split the data sets into 80% training, 10%
validation, and 10% test data sets and set a seed to ensure reproducibility.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Model Re-Implementation</title>
        <p>We rebuilt the NCN from scratch. Our nal code is available on GitHub.3
We used PyTorch to reimplement the network, which was originally coded
in TensorFlow version r0.11. We used the torchtext package to convert the
data set into a suitable format for PyTorch and to facilitate the preprocessing
3 See https://github.com/X3N4/neural_citation.</p>
        <p>arXiv CS
# Param. Recall@10</p>
        <p>
          Embedding, conv lters, hidden: 64
Ebesu &amp; Fang [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] 7,890,916 0.0929 7,919,716
Batch size: 32 7,890,916 0.0876 7,919,716
Vocab size: 30k 11,740,916 0.0945 11,769,716
Filters: [
          <xref ref-type="bibr" rid="ref4 ref4 ref5 ref6 ref7">4,4,5,6,7</xref>
          ] 8,009,828 0.0916 8,038,628
GRU layers: 1 7,865,956 0.0914 7,894,756
GRU layers: 3 7,915,876 0.0846 7,944,676
Combined improvements 11,884,788 0.0925 11,913,588
        </p>
        <p>
          Embedding, conv lters, hidden: 128
Size: 128 16,138,660 0.0878 16,253,604
Filters: [
          <xref ref-type="bibr" rid="ref4 ref4 ref5 ref6 ref7">4,4,5,6,7</xref>
          ] 16,614,052 0.0849 16,728,996
Impr. Filters, Batch size:32 16,614,052 0.0835 16,728,996
Vocab size: 30k 23,828,660 0.0911 23,943,604
Combined improvements 24,304,052 0.0871 24,518,068
        </p>
        <p>Embedding, conv lters, hidden: 256
Batch size: 32 33,764,644 0.0877 34,223,908
steps. Furthermore, we used the SpaCy library in combination with torchtext to
tokenize the data set. After lemmatizing the data and removing stopwords using
the combined SpaCy and nltk stopword corpora, we numericalized the data set
using a vocabulary size of 20,000 tokens for citation contexts, citation titles, and
authors. To facilitate propagating batches through the network, we made use of
the bucketing technique that Ebesu and Fang used as well. Like Ebesu and Fang,
we further use the BM25 ranking function in the decoder part of the network to
preselect citation titles for a given citation context.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation Results</title>
        <p>Citation recommendation approaches are di cult to evaluate, as the citation
provided by the original authors cannot be seen as the unequivocal ground
truth. Therefore, we did not consider ranking metrics but solely recall@k as
our evaluation metric. Table 1 shows the evaluation results.</p>
        <p>RefSeer. We were unable to run our code on exactly the same data set as
Ebesu and Fang did, and our model for RefSeer does not include citing authors'
information (see Sec. 3.1), leading to a slightly di erent number of parameters.
Presumably due to the missing citing author information, our results are worse
than the ones reported by Ebesu and Fang (namely, recall@10 of 0.0929 instead
of around 0.29). Overall, all of the recall@10 values were in a similar range.
However, using other setups than the one proposed by Ebesu and Fang seems
promising.
arXiv CS. We evaluated our trained models on the rst 20,000 of the
50,235 test examples, which signi cantly reduced the evaluation running time
and allowed us to perform detailed ablation studies.</p>
        <p>
          By applying the hyperparameters used by Ebesu and Fang, our
reimplemented NCN yielded a recall@10 of 0.1637, as compared to 0.29 in
the original paper. Thus, we were unable to replicate the performance of the
original model. We hypothesize that this is a result of our signi cantly smaller
dataset, which comprised only 9.44% of the original paper's training examples
(401,882 examples compared to 4,258,383 in the original paper). In order to tune
performance, we used di ering hyperparameter settings and evaluated our model
after every modi cation. Our changes included the use of di erent vocabulary
sizes when preprocessing the data set as well as varying batch and embedding
sizes when propagating data through the network. We also altered the number
of lters in the convolutional layer of the TDNN encoder and the number of
GRU layers in the RNN decoder. Table 1 shows that the best con guration
achieved a 9.77% improvement compared to Ebesu and Fang's hyperparameter
values (recall@10 of 0.1797 vs. 0.1637). While the NCN's performance increases
with larger capacity in general, this e ect only persists up to a certain size.4 In
particular, enlarging the embedding size past 128 dimensions and increasing the
vocabulary to more than 20,000 tokens did not guarantee an improved recall@10
value. We suspect this to be the result of our small data set, as compared to the
model's increased capacity [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          In addition to experimenting with various architectural changes, we also
tried di erent batch sizes. Masters et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] showed that training with smaller
mini-batches can lead to improved test performance. However, we were unable
to replicate these results for our best con guration. For the larger NCN models,
a decreased batch size instead led to inferior test performance. On the other
hand, our enhanced lter region sizes for the TDNN context encoder consistently
boosted the model's performance. At the same time, this modi cation is
computationally cheap, in terms of both additional parameters and wall time,
as the TDNN encoders run in parallel.
        </p>
        <p>We observed during the evaluation runs that models with a lower validation
loss generally achieved a better recall@10 value (given equal batch sizes). While
this intuitively makes sense, as we use the loss function to re-rank the top titles,
we can also nd counterexamples.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Discussion</title>
        <p>
          We believe that there is still room to improve the NCN, in terms of the model's
hyperparameters and architecture. Our research shows that changing the lter
lengths in the convolutional layer of the network's encoder leads to consistently
better results. Further investigation into their e ects on model improvement
4 We use the term \model size" to refer to the embedding dimension, number of
convolutional lters, and the GRU dimension. These parameters are set to the same
value in most con gurations.
may thus be rewarding. The original architecture only used Dropout [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to
regularize the network. It may be worthwhile to investigate other regularization
techniques such as batch normalization [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] for convolutional layers or layer
normalization [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for recurrent layers.
        </p>
        <p>
          We conclude that the NCN leads to reasonable results even when applied to a
smaller data set, like the arXiv CS subset used in our paper. We believe a major
reason for not being able to achieve similar performance results on another data
set (arXiv CS) was the signi cantly smaller size of training examples [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Thus,
for the future, it might be more important to use large data sets than to further
tune model hyperparameters in order to obtain better recall@10 scores.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        For this paper, we re-implemented the neural citation network [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for citation
recommendation and ran evaluations on both RefSeer, the originally used data
set, and arXiv CS, as the second evaluation data set. We were unable to achieve
the same model performance as Ebesu and Fang did. However, we provided
insights on how the di erent hyperparameters can a ect the NCN's model
performance and how these insights can be used to further improve the model.
In this way, we exempli ed how to make citation recommendation approaches
and their evaluations more transparent facilitating the creation of more e ective
models in the future.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ebesu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Neural Citation Network for Context-Aware Citation Recommendation</article-title>
          .
          <source>In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . SIGIR'
          <volume>17</volume>
          (
          <year>2017</year>
          )
          <volume>1093</volume>
          {
          <fpage>1096</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Citation Recommendation: Approaches and Datasets</article-title>
          . CoRR abs/
          <year>2002</year>
          .06961 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kifer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Context-aware Citation Recommendation</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on World Wide Web. WWW '10</source>
          (
          <year>2010</year>
          )
          <volume>421</volume>
          {
          <fpage>430</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Uni ed Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on Machine Learning. ICML'08</source>
          (
          <year>2008</year>
          )
          <volume>160</volume>
          {
          <fpage>167</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cho</surname>
          </string-name>
          , K.,
          <string-name>
            <surname>van Merrienboer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Gulcehre,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing</source>
          . EMNLP'
          <volume>14</volume>
          (
          <year>2014</year>
          )
          <volume>1724</volume>
          {
          <fpage>1734</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          .
          <source>In: Proceedings of the 3rd International Conference on Learning Representations. ICLR'15</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>RefSeer: A citation recommendation system</article-title>
          .
          <source>In: Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries</source>
          . JCDL'
          <volume>14</volume>
          (
          <year>2014</year>
          )
          <volume>371</volume>
          {
          <fpage>374</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Thiemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>A High-Quality Gold Standard for Citation-based Tasks</article-title>
          .
          <source>In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation</source>
          . LREC'
          <volume>18</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Revisiting Unreasonable E ectiveness of Data in Deep Learning Era</article-title>
          .
          <source>In: Proceedings of the 2017 IEEE International Conference on Computer Vision</source>
          . ICCV'
          <volume>17</volume>
          (
          <year>2017</year>
          )
          <volume>843</volume>
          {
          <fpage>852</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Masters</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luschi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Revisiting Small Batch Training for Deep Neural Networks</article-title>
          . CoRR abs/
          <year>1804</year>
          .07612 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A Simple Way to Prevent Neural Networks from Over tting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>2014</year>
          )
          <year>1929</year>
          {
          <fpage>1958</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Io e, S.,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Batch Normalization:
          <article-title>Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          .
          <source>In: Proceedings of the 32nd International Conference on Machine Learning. ICML'15</source>
          (
          <year>2015</year>
          )
          <volume>448</volume>
          {
          <fpage>456</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.: Layer</given-names>
          </string-name>
          <string-name>
            <surname>Normalization</surname>
          </string-name>
          .
          <source>CoRR abs/1607</source>
          .06450 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>