<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Product Categorization with LSTMs and Balanced Pooling Views</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Skinner</string-name>
          <email>research@mcskinner.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seattle</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Washington</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Recurrent Neural Networks (RNNs), and LSTMs in particular, have proven competitive in a wide variety of language modeling and classification tasks. We explore the application of such models to the Rakuten Data Challenge, a large-scale product classification task. We show that a straightforward network architecture and ensembling strategy can achieve state of the art results when trained efectively. We also demonstrate the positive impact of tightening the connections between recurrent and output layers through the use of pooling layers, and introduce the Balanced Pooling View architecture to take advantage of this. Our final solution, produced with a bidirectional ensemble of 6 such models, achieved a weighted F1 score of 0.8513 and won the challenge by a wide margin.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The Rakuten Data Challenge. The goal of the challenge is to
classify each product into one of 3008 categories given the title of
the product. As an additional challenge, the categories are highly
unbalanced and intrinsically noisy as sellers take responsibility for
categorizing their own products.</p>
      <p>The dataset contains 1 million categorized products, of which 80%
were available for training and the remaining 20% were reserved
for a test set. The test set categories were not available during
the challenge, except through a provisional leaderboard to score
solutions on a subset of them.</p>
      <p>Solutions were scored using the category-weighted average of F1
scores, with the leaderboard also showing weighted precision and
recall. That is to say, if there are some labeled documents D, with a
set of documents Dc per category c in C, and a corresponding set of
documents Dˆ c per predicted category, then the weighted F1-score
is:
weiдhted-F 1 =</p>
      <p>ˆ
recallc = |Dc ∩ Dc |
|Dc |</p>
      <p>ˆ
precisionc = |Dc ∩ Dc |
ˆ
|Dc |
precisionc · recallc
F 1c = 2
· precisionc + recallc
1 Õ</p>
      <p>|Dc | · F 1c
|D | c ∈C
(1)
1 Õ
|D | c ∈C
1 Õ
|D | c ∈C
1 Õ
|D | c ∈C</p>
      <p>ˆ
= |D ∩ D |
|D |
|Dc | · recallc
|Dc | ·</p>
      <p>ˆ
|Dc ∩ Dc |</p>
      <p>|Dc |
ˆ
|Dc ∩ Dc |
(2)</p>
      <p>
        Our contributions. Our contributions are as follows: 1) We
achieve state-of-the-art results for this dataset, applying the
methodology of [
        <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
        ] by focusing on regularizing and training
straightforward LSTM architectures efectively. 2) We extend the above
methodology to the problem of sizing a configurable model
architecture and evaluating potential design improvements. We note
that the resulting model architectures are robust to changes to the
training schedule, and vice versa. 3) We introduce the Balanced
Pooling View architecture, an extension of concat pooling from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
to extract signal from variable-length recurrent outputs, and show
that this improves performance compared to existing architectures.
      </p>
      <p>Overall challenge results. Our approach won the challenge by
a wide margin. Our final submission achieved an F1 score of 0.8513
on the test set, maintaining the recall of the next best solution
while improving precision from 0.8425 to 0.8697, decreasing the
false positive rate by 17.5% from 15.75% to 13%.2 Our position at or
near the top of the leaderboard was held for much of the contest, an
advantage of the simple but methodical approach to model design,
training, and iteration. In addition, our models do not take long to
train. It takes less than 24 hours on a commodity GPU to train our
ifnal 6-network ensemble.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Several recent papers have shown the broad applicability of
straightforward LSTM architectures. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were able to use transfer learning
on top of the same pre-trained LSTM architecture to achieve
stateof-the-art results on several text classification tasks. That work
builds on the work of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], who used a single AWD-LSTM
architecture to achieve state-of-the-art results on both character- and
word-level language modeling benchmarks.
1The numerator Dc ∩ Dˆ c is the number of true positives i.e. correctly classified
documents for category c. The sum of this over all categories C is the number of
correctly classified examples in the entire dataset.
2While we can’t technically avoid false positives, we can efectively "not guess" by
picking a very rare category where they will not have a noteworthy impact on the
weighted scores.
      </p>
      <p>
        There is also increasing interest in the topic of robust training
schedules, often with aggressively high learning rates. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
introduced the cosine annealing schedule, which follows the descending
shape of a cosine wave. Training starts with a high learning rate,
then drops to lower learning rates, before plateauing to finish the
cycle. Results obtained after one or more iterations were found to
have better generalization error than those from other learning
rate schedules. A symmetrical low-high-low triangular learning
rate schedule from [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], dubbed Cyclical Learning Rates (CLR) was
shown to have similar performance and generalization properties.
It was promptly refined into the 1cycle policy in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which added
an additional annealing phase and suggested a single policy cycle
rather than several as in [
        <xref ref-type="bibr" rid="ref14 ref8">8, 14</xref>
        ]. The results from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were obtained
with a Slanted Triangular Learning Rate (STLR), an asymmetrical
variant of CLR in which learning rate is increased over less time
and then decreased for more.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA PREPARATION</title>
    </sec>
    <sec id="sec-4">
      <title>Exploratory Analysis</title>
      <p>
        With 800k examples available for training, this constitutes a fairly
large dataset. With 3008 categories to pick from, it also constitutes
a large-scale classification problem. For comparison, the text
classiifcation datasets in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] ranged in size from 5.5k to 560k examples,
but with only 2 to 14 categories. The categories are also highly
unbalanced, although we see in Figure 1 that the tail is not quite as
long as if it were Zipfian.
      </p>
      <p>Due to time constraints, the product titles were not examined
closely. A brief overview is given in Figure 2, but only a few things
were relevant to the modeling: 1) The maximum sequence length
of 274 still allows for reasonably large batches to fit into GPU
memory. 2) The wide distribution of lengths implies some potential
for ineficiency if long titles are placed into batches next to short
titles, which results in a large number of meaningless pad tokens
to process.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Tokenization</title>
      <p>Due to the high incidence of symbols and alphanumeric product
codes, we opted for character-level tokenization. Word-level
tokenizers are not well suited for this task, and poor tokenization leads
to an unnecessarily large and sparse parameter space to optimize
over. To further alleviate sparsity concerns, any characters that
occurred fewer than 10 times were replaced with an "unknown"
token. Adding in typical tokens for padding, beginning, and end of
string, the final vocabulary size was 128 tokens.</p>
      <p>Categories were treated as an opaque classification label. No
attempts were made to leverage the existing taxonomy rather than
letting the networks learn their own. Representation learning is
quite powerful, and there is some risk of adversely biasing the
results by imposing a hierarchy in advance.</p>
      <p>
        The int-encoding of both categories and characters was
frequency ordered, such that tokens with higher frequencies had lower
indices. This is a common practice, and allows for simple
cutofbased implementations for techniques like adaptive softmax from
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Training/Validation Split</title>
      <p>We set aside 25% of the training data as a validation set, using
stratified sampling to ensure good representation across categories.
We chose this set to have 200k examples, like the test set, in order to
ensure that all model changes could be accurately evaluated during
the challenge. This reduced reliance on the public leaderboard and
enabled a methodical process for new ideas from introduction to
deployment.</p>
      <p>At the end of the challenge, with the model architecture finalized
and training parameters dialed in, the validation data was used to
train one last batch of models for the nfial submission. The 33%
increase in training data improved the F1 score on the leaderboard
by nearly 0.01, which was enough to change the final standings
from a tossup to decisive.
4
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>MODEL ARCHITECTURE</title>
    </sec>
    <sec id="sec-8">
      <title>Baseline Model</title>
      <p>
        Following the lead of [
        <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
        ], the baseline model uses a
straightforward LSTM architecture. Character tokens are looked up in an
embedding of size ne , which feeds into a multi-layer LSTM of width
nw and depth nd , which feeds into a linear layer of width nc = |C |
to produce a score for each category. In addition, there are dropout
layers after the embedding, between the recurrent LSTM layers, and
after the LSTM output. Those have respective dropout probabilities
de , dr , and do .
      </p>
      <p>The model sizing process began with a deliberately small
architecture. Following a rule of thumb, ne = max(|V |/2, 50) = 50 for
our 128-wide vocabulary. The values nw = 64 and nd = 2 were
chosen to be deliberately small, and because the latter is the PyTorch
default. Dropouts were chosen based on an existing example, with
de = 0.15, dr = 0.25, and do = 0.35.</p>
      <p>This initial model was quickly tuned and tested as per the
methodology in 5.2. After initial results were shown to quickly plateau, nw
was doubled and the model rerun until this phenomenon stopped
at a width of 512.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Concat Pooling</title>
      <p>
        The idea of concat pooling, as proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], is to augment the last
LSTM output with average- and max-pooled aggregates over the
entire sequence. The idea is that the network can become excited
about certain categories at any point during the sequence, and the
average and maximum are able to capture this information more
easily than the final output. In our experiments, this was found to
be true, increasing validation precision, recall, and F-score by 0.01,
a substantial increment on the leaderboard. We note a few likely
advantages from the pooling layers:
(1) Short-circuiting the outputs from all time steps directly to
the loss function means direct feedback for the earlier time
steps, leading to faster and more reliable convergence.
(2) The same direct connection introduces a sort of data
augmentation for the RNN output layer. Instead of only solving
the translation problem for the last hidden state, it benefits
from successfully understanding the intermediate states as
well.
(3) The average pooling is similar in spirit to the ResNet from
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in which subsequent computations only need to make
additive adjustments, i.e. error corrections, on the output from
the previous layer. It is possible that this idea of sharing
responsibility for the overall prediction has nice generalization
properties that are shared across domains.
4.3
      </p>
    </sec>
    <sec id="sec-10">
      <title>Balanced Pooling View</title>
      <p>Since average- and max-pooling aggregations were able to help
the output layer extract more information from the recurrent layer,
it stands to reason that additional aggregates might also help.
Although not commonly mentioned, min-pooling is trivial to
implement as minpool (x ) = −maxpool (−x ), and therefore easy to test. In
addition, we believe that the use of both min- and max-pooling
might encourage the network to use the full range of possible
activations, and therefore become more expressive, than when using
max-pooling alone to prefer large activations. We call this a
Balanced Pooling View (BPV) architecture, and find that it increased
the F-score by 0.006 for a single network, trained for 40 epochs, as
compared to concat pooling without the min-pool addition. This
architecture and its precursors are outlined in Figure 6.3</p>
      <p>It is worth noting that the widening of the pooling options also
expands the size of the final output weight matrix, and therefore
the capacity of the model. In addition, independent dropout across
the views means that most of the 512-wide recurrent outputs will
be at least partially observable most of the time.</p>
      <p>To understand the impact of pooling alone, we trained a separate
network with the same capacity as the concat pooled networks but
multiple copies of the final RNN output rather than the diferent
pooled views. This network did not surpass the performance of the
non-pooled baseline model,4 so we conclude that the performance
improvement comes from increasing the ability of the network
to extract signal from the recurrent outputs. See Figure 6 for a
visualization of the final BPV architecture.
4.4</p>
    </sec>
    <sec id="sec-11">
      <title>Ensembling</title>
      <p>
        Previous work has shown the benefits of bidirectional training
on sequence tagging ([
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) and text classification ([
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). Reversing
the input sequence acts as a sort of data augmentation, and
training a second independent model adds the known advantages of
ensembling as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Bidirectional training is more powerful than a simple 2 network
ensemble. Various ensembles of 2 networks in the same direction,
forward or backward, were never as good as bidirectional ensembles.
The efect of directionality is explored a bit more in Section 5.7.5.
This technique improved our model performance as well, increasing
the F1 score by 0.013 with pooling and 0.011 with BPV, and is
another significant increment on the leaderboard.</p>
      <p>
        As many additional models and training parameters were tested,
it was found that it was not uncommon for solutions to converge
to the current best single model result. Despite the lack of single
model gains, building an ensemble from several models was found
to improve overall performance, continuing to fall in line with
results from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. An 8 network ensemble, 4 each forward and
backward, increased the F-score by another increment of nearly
0.01 on the validation set.
3This has been placed on the last page of this paper to avoid disruption to the layout.
4To make things worse, it also had trouble converging.
5
5.1
      </p>
    </sec>
    <sec id="sec-12">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-13">
      <title>Implementation Details</title>
      <p>
        All models were implemented in PyTorch ([
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) and trained on a
single Paperspace P5000 instance with a 16GB NVIDIA Quadro P5000
with 2560 CUDA cores. We used the cross-entropy loss function,
which implies a softmax over the final activations.
5.2
      </p>
    </sec>
    <sec id="sec-14">
      <title>Training Methodology</title>
      <p>
        Following the advice in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the batch size was increased until
GPU memory struggled to keep up, which it did at a batch size
of 256. Due to large variances in the text lengths leading to an
unnecessarily large and ineficient number of pad tokens per batch,
it was desirable to organize the batches to have similar lengths. In
order to avoid the same batch ordering every time, the sorting is
done in chunks of several batches at a time. Sorting 50 batches of
examples at a time cut training times by approximately half, and
it also appeared to have a regularizing efect. This is perhaps due
to an observed correlation between title length and categorization
dificulty, in which shorter titles are more dificult. Experiments
with a fully randomized training set not only lost the 2x speed gain,
but also resulted in poorer performance. Changing the width of the
sorting window to 10 or 250 batches did not appear to impact the
results.
      </p>
      <p>
        Models were optimized using SGD with momentum, using a
cross-entropy loss function. Adam was attempted a few times, but
it required learning rates that were two orders of magnitude smaller,
and still exhibited dificulty in converging. This is in line with
research from [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] on the weakness of adaptive gradient methods. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
provides further reason to believe that SGD leads to faster training
and better generalization error.
      </p>
      <p>
        In general we rejected training or architectural changes that
substantially reduced the learning rate. This follows the insight from
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that high learning rates have a regularizing efect, improving
generalization error while also reducing training time. Our rule of
thumb is a sort of corollary: if high learning rates are desirable, then
it is undesirable when modeling variants are surprisingly intolerant
to those same learning rates.
5.3
      </p>
    </sec>
    <sec id="sec-15">
      <title>Hyperparameter Tuning</title>
      <p>
        When training each new model architecture, we followed the
disciplined approach from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. First the peak learning rate was chosen
using a learning rate sweep as described in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The sweep starts
with a low learning rate, at which nothing much happens. As the
learning rate continues to increase, the training and validation loss
eventually drop, level out, and then diverge. An example of this is
shown in Figure 3.
      </p>
      <p>Good peak rates were found to exist just short of the basin in
which loss has leveled out. Our methodology was to choose the first
learning rates to be on the high end of plausible, sometimes in the
basin, and then decrease the estimate until a short training run of 5
epochs trains without exhibiting signs of divergence or oscillation
at the peak rate. Peak learning rates found this way were near 1,
and sometimes even higher. For the highest performing models,
the learning rates were further tuned by hand within a small range
to squeeze out remaining performance gains. The majority of our
models, including those used for our final submission, were trained
with a learning rate of 0.8.</p>
      <p>
        Momentum was also chosen according to the approach in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
running a learning rate sweep for several options and comparing
the loss curves. Good momentums have nice convergence
properties like a quick descent, a low minimum loss, and a longer basin
before divergence. The best momentums for this challenge were
around 0.95, which is fairly high given the observation in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that
momentum is not usually preferred for language modeling. While
our experiments with low momentum SGD were able to achieve
very high learning rates, as high as 10, the models were not able to
converge to low losses.
      </p>
      <p>Weight decay was chosen similarly, with early experiments
quickly showing that weight decay above 0 led to dificulty with
convergence. Due to time considerations, this parameter was left
at 0 for the remainder of the SGD runs rather than trying to tune it
further.
5.4</p>
    </sec>
    <sec id="sec-16">
      <title>Training Schedule</title>
      <p>
        We used the 1cycle learning rate policy from [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Learning rates
follow a low-high-low triangular schedule as in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], followed by a
few iterations of increasingly very low learning rates. The typical
low rate used was 10-20 times smaller than the peak, and the very
low rate is a few orders of magnitude smaller still. See Figure 4 for
an example, with the parameters selected using the methodology in
Section 5.3. A few variations like the Slanted Triangular Learning
Rate (STLR) from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were tested and not found to perform any
better or worse than the 1cycle policy. The results were found
similarly insensitive to the ratio of the peak learning rate to the
lower starting value.
      </p>
      <p>In the early phases of the competition, networks were trained
for 20 epochs using a 1cycle policy with a peak learning rate
of 1,5 with the remaining parameters as specified in Figure 4. In
the later phases of the competition networks were trained for 40
epochs, with the peak learning rate reduced to 0.8. This is exactly
the parameterization shown in Figure 4.</p>
      <p>When testing new model architectures or training parameters,
we used short training runs of 5 epochs to filter for clear
improvements. This was enough to distinguish ideas, while keeping the
training time within 30 minutes. Each doubling of the schedule
length to 10, 20, and then 40 epochs was shown to improve results,
5For the most common network architecture. When dropout was increased, learning
rate was decreased to balance the increased regularization from higher dropout.
and so the full training schedules were again chosen based on
timing considerations. In the early phases of the challenge we used a
schedule of 20 epochs so that a competitive 2 network ensemble
could be trained in under 4 hours, enough to test a few quality
ideas per day. The final schedule of 40 epochs was chosen so our
tuned model, a bidirectional ensemble of 6 BPV networks, could be
trained in under 24 hours. There were no apparent gains from a 60
epoch training cycle, though we also did not take time to carefully
re-tune the hyperparameters.
5.5</p>
    </sec>
    <sec id="sec-17">
      <title>Model Variants</title>
      <p>This section outlines the most interesting of the various extensions
that we tried to add on to the basic model. None led to a performance
improvement, but several maintained the same performance.</p>
      <p>5.5.1 Larger Models. While decreasing the LSTM width by a
modest amount did decrease performance, no increases to the model
width or depth were able to increase model performance. Adding
additional layers before or after the LSTM led to overfitting or
otherwise poor convergence. Increasing LSTM depth nd to 3, increasing
the LSTM width nw by 50%, and increasing the embedding size ne
to 64 all led to the same performance as the baseline model.</p>
      <p>
        5.5.2 Regularization Tuning. We tested a variety of changes to
the dropout magnitude and mix, and none were found to improve
F1 score. Lowering the dropout in any of the layers was prone
to problems with overfitting. In particular, do = 0.1 is a common
choice, but did not add enough regularization and led to
overfitting in our experiments. This includes experiments in which we
increased to de = 0.4 and dr = 0.4 to match the tuning from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Increasing the dropout by a consistent factor across all layers
resulted in better cross-entropy loss, but did nothing to improve
discriminative accuracy. We also tested the introduction of a batch
normalization layer right after the LSTM but before dropout and
decoding. The resulting networks were sometimes competitive, but
prone to diverging during the fitting process.</p>
      <p>
        5.5.3 RNN Variants. Swapping out the LSTM for simpler models
like GRU led to degraded performance. The QRNN model from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
was found to be much faster to train as advertised, but also sufered
from slightly degraded performance compared to LSTM.
      </p>
      <p>
        5.5.4 Adaptive Softmax. Given the unbalanced nature of the
categories, it seems worthwhile to focus network bandwidth on
highfrequency categories rather than the rare ones. Adaptive softmax,
as proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], does this by partitioning the output categories
and then using increasingly small bottleneck layers in front of each
partition to limit the number of weights that need to be learned. No
variation of this technique managed to improve upon the
concatpooled network, which apparently was not having trouble with
this number of categories as compared to the O(million) weights
to which adaptive softmax intends to scale.
      </p>
      <p>
        5.5.5 Training Variants. Noting the positive results that [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] had
with an auxiliary loss function, we experimented with accuracy as a
weighted component of the loss function. Even at high weightings
this did not appear to have a substantial efect on performance.
5.6
      </p>
    </sec>
    <sec id="sec-18">
      <title>Prediction Generation</title>
      <p>The model is trained according to a cross-entropy loss function,
but the challenge is scored according to a weighted F1-score. A
simple highest probability wins strategy is not optimal in this case,
and it is better to select a discriminative cutof for each category to
maximize an estimate of the F1-score. This can be done directly from
the probabilities output by the network. Assuming well-calibrated
probability estimates...</p>
      <p>...the estimated number of true examples nt rue for each category
is the sum of probabilities for that category.</p>
      <p>...the estimated number of true positives ntp given a certain
discriminative cutof is the sum of probabilities that make the cut.</p>
      <p>...the number of guessed positives nдuess given a cutof is the
number of probabilities that make the cut.</p>
      <p>Given those estimates, it is possible to estimate the precision,
recall, and F1-score for each potential cutof. Then the cutof is
chosen by taking that for which the F1-score is maximized. It is
straightforward to compute this eficiently by sorting the
probabilities and using cumulative computations. See Algorithm 1 for
details. This strategy tends to increase precision at the expense of
recall, boosting the validation precision of bidirectional BPVs from
0.828 to 0.854 and an F1-score of 0.827 to 0.836, but at the cost of
dropping recall from 0.833 to 0.826.</p>
      <p>5.6.1 Probability Calibration. Since the final probability
estimates are not guaranteed to be well calibrated, we apply a simple
piecewise linear model to ensure that the probability estimates are
suited for the F1 tuning. To do so, we first computed the actual
accuracy for each 1% increment of predicted probability. A prediction
of 20% should be correct 20% of the time, and if it is only correct
10% of the time then the raw estimate needs to be halved to get a
true probability.</p>
      <p>Figure 5 shows these raw calibration factors, which are a bit
noisy. Rather than using these directly, we applied a simple moving</p>
      <sec id="sec-18-1">
        <title>ALGORITHM 1: F1-Score Optimization</title>
        <p>Data: P , probability estimates for each observation, for a
single category.</p>
        <p>Result: pˆcut , an F1-optimizing cutof below which values in P
should be rejected as predictions.
nt rue ← sum(P )
sort P in descending order
ntp,0 ← 0
pˆcut ← 0
scorebest ← 0
for i ← 1 to |P | do
nдuess,i ← i
ntp,i ← ntp,i−1 + pi
reci ← ntp,i /nt rue
preci ← ntp,i /nдuess,i
scorei ← 2 · rreeccii+·pprreeccii
if scorei &gt; scorebest then
pˆcut ← pi
scorebest ← scorei
end
average to visualize the broader trend. Since this trend appeared
piecewise linear, we applied such a fit to the data. Figure 5 shows
these smoothed probability factors as well. Applying these
adjustments to true up the raw network probabilities had a small but
positive efect on the validation performance, indicating an
improved fit. However, to be conservative, these estimates were not
applied to our final submission.
5.7</p>
        <p>Results
5.7.1 Model Architecture. Table 1 shows the validation set
performance for each iterative improvement of the network
architecture, using the preliminary training schedule of 20 epochs and a
simple best-wins strategy for selecting the prediction. We leave
accuracy out of the tables, since it is the same as recall as shown
in Equation 2. The introduction of concat pooling increases
performance by 0.01 or so on all evaluation metrics, and BPV adds an
addition 0.003 on top of that. Using pairs of bidirectional networks
led to another boost of 0.01, but that also requires twice as long to
train. Section 5.7.5 provides a more apples-to-apples comparison.</p>
        <p>5.7.2 Training Schedule. In Table 2 we show the benefit of
increasing the training schedule from 20 to 40 epochs. The concat
pooled networks benefit from a modest scoring boost of 0.001 to
0.002 for a single network and 0.003 to 0.004 for a bidirectional
pair. The BPV architecture took better advantage of the increased
time, with scores growing 0.004 to 0.005 for a single network and
0.006 to 0.007 for the bidirectional network. In taking advantage
of the extra epochs, BPV was able to grow the performance gap to
alternatives and highlight the increased modeling capacity of the
new architecture.</p>
        <p>5.7.3 F1 Optimization. Table 3 shows the results after the F1
optimization procedure. The use of F1 optimization has a clear
positive impact on both precision and F-score. Precision is increased
by nearly 0.03 for all networks, while F-scores are all higher by 0.01
or so. This comes at a smaller cost to recall, which falls of by up
nearly 0.01 for some model architectures. This approximately 3 to
1 tradeof of recall for precision is a good one to make in terms of
F1 score, and a result of choosing to break any near-ties away from
large and precise categories.</p>
        <p>5.7.4 Ensembling. In Table 4 we show the performance
improvement for increasing ensemble sizes, using later stage networks
trained for 40 epochs. The 1-pair ensemble here is the same as
the bidirectional BPV results in Table 2. There is a big step when
increasing from 1 to 2 pairs, but results quickly level of after an
ensemble of even 4 pairs. Our final submission was generated by 3
pairs of networks trained with validation data included. A solution
with 4 pairs decreased the test scores slightly, and so we reverted
to a previous submission.</p>
        <p>5.7.5 Bidirectionality. Finally we examine the impact of
directionality on validation set results. Table 5 shows the best-wins F1
scores for a few ensembles, while Table 6 shows the optimized
F1 scores. Interestingly, we found that reverse networks perform
slightly better than forward networks. Balanced bidirectional
ensembles outperformed equivalently sized one-way ensembles in
either direction.</p>
        <p>Compared to the results in Table 2, it is clear that the bulk of the
improvement comes from ensembling. With only forward networks
a single model reaches a best-wins F1 of 0.814, adding a second
network improves to 0.825, and doubling the ensemble to 4 networks
improves again to 0.830. By comparison, an F1 increase of 0.002 for
reversing every other network is more modest. It is nonetheless a
noticeable improvement and does not add to the training time, so
it is worth including as a part of the ensembling strategy.</p>
      </sec>
      <sec id="sec-18-2">
        <title>Precision</title>
      </sec>
      <sec id="sec-18-3">
        <title>Recall</title>
      </sec>
      <sec id="sec-18-4">
        <title>F-Score</title>
      </sec>
      <sec id="sec-18-5">
        <title>Precision</title>
      </sec>
      <sec id="sec-18-6">
        <title>Recall</title>
        <p>F-Score</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>FUTURE WORK</title>
      <p>Recalling the model refinements in section 5.5, we note that the
training process achieved remarkably similar results across a wide
variety of mutations to the model architecture. This was true of
both model changes as well as modifications to the learning rate
schedule, and is an encouraging result since it implies saturation of
the current model architecture and training methodology. It also
implies that further improvement is more likely to come from some
enhancement to either the input representation, or from additional
improvements to the extraction of signal from the recurrent layer
as in sections 4.2 and 4.3.</p>
      <p>
        The input representation at this point is quite simple, and has not
yet been well explored. The cutof of 10 occurrences for character
inclusion has not been challenged, and it is possible that more
tokens could increase the model power. If that conjecture is true, then
it might also be worthwhile to try a hybrid word level tokenization
in which rare words are split into characters. Another alternative is
to try a hierarchical model as proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], who were working
on a short-phrase language identification task which bears some
similarity to this one. The idea of avoiding sparsity problems by
computing the word embeddings as a convolution over character
embeddings seems like it might apply well to this problem.
      </p>
      <p>
        It is also possible that there are additional pooling or aggregation
techniques which could bolster the connection between recurrent
and output layers. An attention mechanism is one option, which
could use convolutions as in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to weight the most interesting time
steps. Multiple such attention vectors could also be generated and
viewed together, and network capacity could be tuned by increasing
both the complexity and number of such signal extractors.
      </p>
    </sec>
    <sec id="sec-20">
      <title>ACKNOWLEDGMENTS</title>
      <p>The author would like to his colleagues at Duetto for supporting
and embracing this research, even though it was done of the clock.
Thank you as well to the fast.ai community for developing and
supporting an amazing MOOC and deep learning library. Finally,
thank you to the challenge organizers for fostering a friendly and
collaborative environment around this new dataset, providing
valu</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mikolaj</given-names>
            <surname>Binkowski</surname>
          </string-name>
          , Gautier Marti, and
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Donnat</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Autoregressive Convolutional Neural Networks for Asynchronous Time Series</article-title>
          .
          <source>CoRR abs/1703</source>
          .04122 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>James</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , Stephen Merity, Caiming Xiong, and Richard Socher.
          <year>2016</year>
          .
          <article-title>Quasi-Recurrent Neural Networks</article-title>
          .
          <source>CoRR abs/1611</source>
          .01576 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Armand Joulin, Moustapha Cissé, David Grangier,
          <string-name>
            <given-names>and Hervé</given-names>
            <surname>Jégou</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Eficient softmax approximation for GPUs</article-title>
          . In ICML.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Moritz</given-names>
            <surname>Hardt</surname>
          </string-name>
          , Benjamin Recht, and
          <string-name>
            <given-names>Yoram</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Train faster, generalize better: Stability of stochastic gradient descent</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          ),
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Universal Language Model Finetuning for Text Classification</article-title>
          . arXiv:arXiv:
          <year>1801</year>
          .06146
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Jaech</surname>
          </string-name>
          , George Mulcaire, Shobhit Hathi, Mari Ostendorf, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical Character-Word Models for Language Identification</article-title>
          . In SocialNLP@EMNLP.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>SGDR: Stochastic Gradient Descent with Restarts</article-title>
          .
          <source>CoRR abs/1608</source>
          .03983 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Merity</surname>
          </string-name>
          , Nitish Shirish Keskar, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Regularizing and Optimizing LSTM Language Models</article-title>
          .
          <source>arXiv:arXiv:1708.02182</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Zeming Lin, Alban Desmaison, Luca Antiga, and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lerer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic diferentiation in PyTorch</article-title>
          . In NIPS-W.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Michael</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perrone</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>When Networks Disagree: Ensemble Methods for Hybrid Neural Networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Matthew</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            , Waleed Ammar, Chandra Bhagavatula, and
            <given-names>Russell</given-names>
          </string-name>
          <string-name>
            <surname>Power</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Semi-supervised sequence tagging with bidirectional language models</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Leslie</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A disciplined approach to neural network hyperparameters: Part 1 - learning rate, batch size, momentum, and weight decay</article-title>
          . arXiv:arXiv:
          <year>1803</year>
          .09820
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Leslie</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            and
            <given-names>Nicholay</given-names>
          </string-name>
          <string-name>
            <surname>Topin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Exploring loss function topology with cyclical learning rates</article-title>
          .
          <source>arXiv:arXiv:1702.04283</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Leslie</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            and
            <given-names>Nicholay</given-names>
          </string-name>
          <string-name>
            <surname>Topin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates</article-title>
          .
          <source>arXiv:arXiv:1708.07120</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ashia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wilson</surname>
            , Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and
            <given-names>Benjamin</given-names>
          </string-name>
          <string-name>
            <surname>Recht</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Marginal Value of Adaptive Gradient Methods in Machine Learning</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>