=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_9
|storemode=property
|title=Product Categorization with LSTMs and Balanced Pooling Views
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_9.pdf
|volume=Vol-2319
|authors=Michael Skinner
|dblpUrl=https://dblp.org/rec/conf/sigir/Skinner18
}}
==Product Categorization with LSTMs and Balanced Pooling Views==
<pdf width="1500px">https://ceur-ws.org/Vol-2319/ecom18DC_paper_9.pdf</pdf>
<pre>
 Product Categorization with LSTMs and Balanced Pooling Views
                                                                                    Michael Skinner∗
                                                                                    Seattle, Washington

 ABSTRACT                                                                                            It is worth noting that the weighted recall used for this challenge
 Recurrent Neural Networks (RNNs), and LSTMs in particular, have                                  is equal to accuracy. With some small abuse of notation in the last
 proven competitive in a wide variety of language modeling and                                    line:1
 classification tasks. We explore the application of such models to
 the Rakuten Data Challenge, a large-scale product classification                                                                          1 Õ
 task. We show that a straightforward network architecture and                                                  weiдhted-recall =              |Dc | · recallc
                                                                                                                                          |D|
                                                                                                                                               c ∈C
 ensembling strategy can achieve state of the art results when trained
 effectively. We also demonstrate the positive impact of tightening                                                                      1 Õ             |Dc ∩ D̂c |
                                                                                                                                      =          |Dc | ·
 the connections between recurrent and output layers through the                                                                        |D|                 |Dc |
                                                                                                                                            c ∈C                                     (2)
 use of pooling layers, and introduce the Balanced Pooling View                                                                          1  Õ
 architecture to take advantage of this. Our final solution, produced                                                                 =          |Dc ∩ D̂c |
                                                                                                                                        |D|
 with a bidirectional ensemble of 6 such models, achieved a weighted                                                                           c ∈C
 F1 score of 0.8513 and won the challenge by a wide margin.                                                                             |D ∩ D̂|
                                                                                                                                      =
                                                                                                                                          |D|
 KEYWORDS                                                                                            Our contributions. Our contributions are as follows: 1) We
 Rakuten Data Challenge; text classification; neural networks; deep                               achieve state-of-the-art results for this dataset, applying the method-
 learning                                                                                         ology of [6, 9] by focusing on regularizing and training straight-
                                                                                                  forward LSTM architectures effectively. 2) We extend the above
                                                                                                  methodology to the problem of sizing a configurable model archi-
 1      INTRODUCTION                                                                              tecture and evaluating potential design improvements. We note
 The Rakuten Data Challenge. The goal of the challenge is to                                      that the resulting model architectures are robust to changes to the
 classify each product into one of 3008 categories given the title of                             training schedule, and vice versa. 3) We introduce the Balanced
 the product. As an additional challenge, the categories are highly                               Pooling View architecture, an extension of concat pooling from [6],
 unbalanced and intrinsically noisy as sellers take responsibility for                            to extract signal from variable-length recurrent outputs, and show
 categorizing their own products.                                                                 that this improves performance compared to existing architectures.
     The dataset contains 1 million categorized products, of which 80%                               Overall challenge results. Our approach won the challenge by
 were available for training and the remaining 20% were reserved                                  a wide margin. Our final submission achieved an F1 score of 0.8513
 for a test set. The test set categories were not available during                                on the test set, maintaining the recall of the next best solution
 the challenge, except through a provisional leaderboard to score                                 while improving precision from 0.8425 to 0.8697, decreasing the
 solutions on a subset of them.                                                                   false positive rate by 17.5% from 15.75% to 13%.2 Our position at or
     Solutions were scored using the category-weighted average of F1                              near the top of the leaderboard was held for much of the contest, an
 scores, with the leaderboard also showing weighted precision and                                 advantage of the simple but methodical approach to model design,
 recall. That is to say, if there are some labeled documents D, with a                            training, and iteration. In addition, our models do not take long to
 set of documents Dc per category c in C, and a corresponding set of                              train. It takes less than 24 hours on a commodity GPU to train our
 documents D̂c per predicted category, then the weighted F1-score                                 final 6-network ensemble.
 is:
                                                                                                  2     RELATED WORK
                                                                                                  Several recent papers have shown the broad applicability of straight-
                                           |Dc ∩ D̂c |                                            forward LSTM architectures. [6] were able to use transfer learning
                             recallc =
                                              |Dc |                                               on top of the same pre-trained LSTM architecture to achieve state-
                                           |Dc ∩ D̂c |                                            of-the-art results on several text classification tasks. That work
                        precisionc =                                                              builds on the work of [9], who used a single AWD-LSTM archi-
                                       | D̂c |
                                                                                            (1)   tecture to achieve state-of-the-art results on both character- and
                                        precisionc · recallc
                            F 1c = 2 ·                                                            word-level language modeling benchmarks.
                                       precisionc + recallc
                                    1 Õ
                    weiдhted-F 1 =             |Dc | · F 1c                                       1 The numerator D
                                   |D|                                                                                 c ∩ D̂ c is the number of true positives i.e. correctly classified
                                                 c ∈C                                             documents for category c . The sum of this over all categories C is the number of
                                                                                                  correctly classified examples in the entire dataset.
                                                                                                  2 While we can’t technically avoid false positives, we can effectively "not guess" by
                                                                                                  picking a very rare category where they will not have a noteworthy impact on the
 ∗ Correspondence to: research@mcskinner.com
                                                                                                  weighted scores.
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
In: J. Degenhardt, G. Di Fabbrizio, S. Kallumadi, M. Kumar, Y.-C. Lin, A. Trotman, H. Zhao
(eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
published at http://ceur-ws.org
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                     Michael Skinner


Figure 1: The category frequencies are Zipfian for the most              Figure 2: Product titles range in length from 1 character
common categories, but without the long tail for rarer cate-             up to 274. The bulk of the data appears to be normally dis-
gories.                                                                  tributed around a mean just below 60, with a long tail. How-
                                                                         ever there is an obvious artifact which peaks at 80 charac-
                                                                         ters. This is presumed to be some systemic behavior, like a
   There is also increasing interest in the topic of robust training     display limitation.
schedules, often with aggressively high learning rates. [8] intro-
duced the cosine annealing schedule, which follows the descending
shape of a cosine wave. Training starts with a high learning rate,       for inefficiency if long titles are placed into batches next to short
then drops to lower learning rates, before plateauing to finish the      titles, which results in a large number of meaningless pad tokens
cycle. Results obtained after one or more iterations were found to       to process.
have better generalization error than those from other learning
rate schedules. A symmetrical low-high-low triangular learning           3.2    Tokenization
rate schedule from [14], dubbed Cyclical Learning Rates (CLR) was        Due to the high incidence of symbols and alphanumeric product
shown to have similar performance and generalization properties.         codes, we opted for character-level tokenization. Word-level tok-
It was promptly refined into the 1cycle policy in [15], which added      enizers are not well suited for this task, and poor tokenization leads
an additional annealing phase and suggested a single policy cycle        to an unnecessarily large and sparse parameter space to optimize
rather than several as in [8, 14]. The results from [6] were obtained    over. To further alleviate sparsity concerns, any characters that
with a Slanted Triangular Learning Rate (STLR), an asymmetrical          occurred fewer than 10 times were replaced with an "unknown"
variant of CLR in which learning rate is increased over less time        token. Adding in typical tokens for padding, beginning, and end of
and then decreased for more.                                             string, the final vocabulary size was 128 tokens.
                                                                            Categories were treated as an opaque classification label. No
3 DATA PREPARATION                                                       attempts were made to leverage the existing taxonomy rather than
3.1 Exploratory Analysis                                                 letting the networks learn their own. Representation learning is
With 800k examples available for training, this constitutes a fairly     quite powerful, and there is some risk of adversely biasing the
large dataset. With 3008 categories to pick from, it also constitutes    results by imposing a hierarchy in advance.
a large-scale classification problem. For comparison, the text classi-      The int-encoding of both categories and characters was fre-
fication datasets in [6] ranged in size from 5.5k to 560k examples,      quency ordered, such that tokens with higher frequencies had lower
but with only 2 to 14 categories. The categories are also highly         indices. This is a common practice, and allows for simple cutoff-
unbalanced, although we see in Figure 1 that the tail is not quite as    based implementations for techniques like adaptive softmax from
long as if it were Zipfian.                                              [3].
   Due to time constraints, the product titles were not examined
closely. A brief overview is given in Figure 2, but only a few things    3.3    Training/Validation Split
were relevant to the modeling: 1) The maximum sequence length            We set aside 25% of the training data as a validation set, using
of 274 still allows for reasonably large batches to fit into GPU         stratified sampling to ensure good representation across categories.
memory. 2) The wide distribution of lengths implies some potential       We chose this set to have 200k examples, like the test set, in order to
Product Categorization with LSTMs and Balanced Pooling Views                    eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


ensure that all model changes could be accurately evaluated during        4.3      Balanced Pooling View
the challenge. This reduced reliance on the public leaderboard and        Since average- and max-pooling aggregations were able to help
enabled a methodical process for new ideas from introduction to           the output layer extract more information from the recurrent layer,
deployment.                                                               it stands to reason that additional aggregates might also help. Al-
   At the end of the challenge, with the model architecture finalized     though not commonly mentioned, min-pooling is trivial to imple-
and training parameters dialed in, the validation data was used to        ment as minpool(x) = −maxpool(−x), and therefore easy to test. In
train one last batch of models for the final submission. The 33%          addition, we believe that the use of both min- and max-pooling
increase in training data improved the F1 score on the leaderboard        might encourage the network to use the full range of possible acti-
by nearly 0.01, which was enough to change the final standings            vations, and therefore become more expressive, than when using
from a tossup to decisive.                                                max-pooling alone to prefer large activations. We call this a Bal-
                                                                          anced Pooling View (BPV) architecture, and find that it increased
4 MODEL ARCHITECTURE                                                      the F-score by 0.006 for a single network, trained for 40 epochs, as
4.1 Baseline Model                                                        compared to concat pooling without the min-pool addition. This
Following the lead of [6, 9], the baseline model uses a straight-         architecture and its precursors are outlined in Figure 6.3
forward LSTM architecture. Character tokens are looked up in an               It is worth noting that the widening of the pooling options also
embedding of size ne , which feeds into a multi-layer LSTM of width       expands the size of the final output weight matrix, and therefore
nw and depth nd , which feeds into a linear layer of width nc = |C |      the capacity of the model. In addition, independent dropout across
to produce a score for each category. In addition, there are dropout      the views means that most of the 512-wide recurrent outputs will
layers after the embedding, between the recurrent LSTM layers, and        be at least partially observable most of the time.
after the LSTM output. Those have respective dropout probabilities            To understand the impact of pooling alone, we trained a separate
de , dr , and do .                                                        network with the same capacity as the concat pooled networks but
    The model sizing process began with a deliberately small archi-       multiple copies of the final RNN output rather than the different
tecture. Following a rule of thumb, ne = max(|V |/2, 50) = 50 for         pooled views. This network did not surpass the performance of the
our 128-wide vocabulary. The values nw = 64 and nd = 2 were cho-          non-pooled baseline model,4 so we conclude that the performance
sen to be deliberately small, and because the latter is the PyTorch       improvement comes from increasing the ability of the network
default. Dropouts were chosen based on an existing example, with          to extract signal from the recurrent outputs. See Figure 6 for a
de = 0.15, dr = 0.25, and do = 0.35.                                      visualization of the final BPV architecture.
    This initial model was quickly tuned and tested as per the method-
ology in 5.2. After initial results were shown to quickly plateau, nw     4.4      Ensembling
was doubled and the model rerun until this phenomenon stopped
at a width of 512.                                                        Previous work has shown the benefits of bidirectional training
                                                                          on sequence tagging ([12]) and text classification ([6]). Reversing
4.2    Concat Pooling                                                     the input sequence acts as a sort of data augmentation, and train-
                                                                          ing a second independent model adds the known advantages of
The idea of concat pooling, as proposed in [6], is to augment the last
                                                                          ensembling as in [11].
LSTM output with average- and max-pooled aggregates over the
                                                                             Bidirectional training is more powerful than a simple 2 network
entire sequence. The idea is that the network can become excited
                                                                          ensemble. Various ensembles of 2 networks in the same direction,
about certain categories at any point during the sequence, and the
                                                                          forward or backward, were never as good as bidirectional ensembles.
average and maximum are able to capture this information more
                                                                          The effect of directionality is explored a bit more in Section 5.7.5.
easily than the final output. In our experiments, this was found to
                                                                          This technique improved our model performance as well, increasing
be true, increasing validation precision, recall, and F-score by 0.01,
                                                                          the F1 score by 0.013 with pooling and 0.011 with BPV, and is
a substantial increment on the leaderboard. We note a few likely
                                                                          another significant increment on the leaderboard.
advantages from the pooling layers:
                                                                             As many additional models and training parameters were tested,
   (1) Short-circuiting the outputs from all time steps directly to       it was found that it was not uncommon for solutions to converge
       the loss function means direct feedback for the earlier time       to the current best single model result. Despite the lack of single
       steps, leading to faster and more reliable convergence.            model gains, building an ensemble from several models was found
   (2) The same direct connection introduces a sort of data aug-          to improve overall performance, continuing to fall in line with
       mentation for the RNN output layer. Instead of only solving        results from [11]. An 8 network ensemble, 4 each forward and
       the translation problem for the last hidden state, it benefits     backward, increased the F-score by another increment of nearly
       from successfully understanding the intermediate states as         0.01 on the validation set.
       well.
   (3) The average pooling is similar in spirit to the ResNet from
       [5], in which subsequent computations only need to make ad-
       ditive adjustments, i.e. error corrections, on the output from
       the previous layer. It is possible that this idea of sharing re-
                                                                          3 This has been placed on the last page of this paper to avoid disruption to the layout.
       sponsibility for the overall prediction has nice generalization    4 To make things worse, it also had trouble converging.
       properties that are shared across domains.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                    Michael Skinner


5 EXPERIMENTS
5.1 Implementation Details
All models were implemented in PyTorch ([10]) and trained on a sin-
gle Paperspace P5000 instance with a 16GB NVIDIA Quadro P5000
with 2560 CUDA cores. We used the cross-entropy loss function,
which implies a softmax over the final activations.

5.2    Training Methodology
Following the advice in [13], the batch size was increased until
GPU memory struggled to keep up, which it did at a batch size
of 256. Due to large variances in the text lengths leading to an
unnecessarily large and inefficient number of pad tokens per batch,
it was desirable to organize the batches to have similar lengths. In
order to avoid the same batch ordering every time, the sorting is
done in chunks of several batches at a time. Sorting 50 batches of
examples at a time cut training times by approximately half, and
it also appeared to have a regularizing effect. This is perhaps due
to an observed correlation between title length and categorization
difficulty, in which shorter titles are more difficult. Experiments
with a fully randomized training set not only lost the 2x speed gain,
but also resulted in poorer performance. Changing the width of the
sorting window to 10 or 250 batches did not appear to impact the
results.
    Models were optimized using SGD with momentum, using a
cross-entropy loss function. Adam was attempted a few times, but
it required learning rates that were two orders of magnitude smaller,
and still exhibited difficulty in converging. This is in line with re-
search from [16] on the weakness of adaptive gradient methods. [4]
provides further reason to believe that SGD leads to faster training
and better generalization error.
    In general we rejected training or architectural changes that sub-
stantially reduced the learning rate. This follows the insight from
[13] that high learning rates have a regularizing effect, improving
generalization error while also reducing training time. Our rule of
thumb is a sort of corollary: if high learning rates are desirable, then
it is undesirable when modeling variants are surprisingly intolerant
to those same learning rates.

5.3    Hyperparameter Tuning
                                                                           Figure 3: A typical learning rate sweep, run for 1 epoch with
When training each new model architecture, we followed the disci-
                                                                           a learning rate that increases linearly to 3. The learning rate
plined approach from [13]. First the peak learning rate was chosen
                                                                           levels off substantially around 0.8, with one last drop around
using a learning rate sweep as described in [14]. The sweep starts
                                                                           1.3. Using a 40 epoch 1cycle policy, peak learning rates of 0.7,
with a low learning rate, at which nothing much happens. As the
                                                                           0.8, and 1.0 were all found to perform well.
learning rate continues to increase, the training and validation loss
eventually drop, level out, and then diverge. An example of this is
shown in Figure 3.                                                         models, including those used for our final submission, were trained
   Good peak rates were found to exist just short of the basin in          with a learning rate of 0.8.
which loss has leveled out. Our methodology was to choose the first           Momentum was also chosen according to the approach in [13],
learning rates to be on the high end of plausible, sometimes in the        running a learning rate sweep for several options and comparing
basin, and then decrease the estimate until a short training run of 5      the loss curves. Good momentums have nice convergence proper-
epochs trains without exhibiting signs of divergence or oscillation        ties like a quick descent, a low minimum loss, and a longer basin
at the peak rate. Peak learning rates found this way were near 1,          before divergence. The best momentums for this challenge were
and sometimes even higher. For the highest performing models,              around 0.95, which is fairly high given the observation in [9] that
the learning rates were further tuned by hand within a small range         momentum is not usually preferred for language modeling. While
to squeeze out remaining performance gains. The majority of our            our experiments with low momentum SGD were able to achieve
Product Categorization with LSTMs and Balanced Pooling Views                             eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


Figure 4: A typical 1cycle policy with a peak learning rate of 0.8, an initial learning rate 20x smaller, momentum ranging from
0.95-0.85, and with 10% of the iterations reserved for final annealing at very low learning rates. This example was used on a
training run of 40 epochs with 2344 batches per epoch.


very high learning rates, as high as 10, the models were not able to               and so the full training schedules were again chosen based on tim-
converge to low losses.                                                            ing considerations. In the early phases of the challenge we used a
   Weight decay was chosen similarly, with early experiments                       schedule of 20 epochs so that a competitive 2 network ensemble
quickly showing that weight decay above 0 led to difficulty with                   could be trained in under 4 hours, enough to test a few quality
convergence. Due to time considerations, this parameter was left                   ideas per day. The final schedule of 40 epochs was chosen so our
at 0 for the remainder of the SGD runs rather than trying to tune it               tuned model, a bidirectional ensemble of 6 BPV networks, could be
further.                                                                           trained in under 24 hours. There were no apparent gains from a 60
                                                                                   epoch training cycle, though we also did not take time to carefully
5.4     Training Schedule                                                          re-tune the hyperparameters.
We used the 1cycle learning rate policy from [15]. Learning rates
follow a low-high-low triangular schedule as in [14], followed by a                5.5    Model Variants
few iterations of increasingly very low learning rates. The typical                This section outlines the most interesting of the various extensions
low rate used was 10-20 times smaller than the peak, and the very                  that we tried to add on to the basic model. None led to a performance
low rate is a few orders of magnitude smaller still. See Figure 4 for              improvement, but several maintained the same performance.
an example, with the parameters selected using the methodology in
Section 5.3. A few variations like the Slanted Triangular Learning                    5.5.1 Larger Models. While decreasing the LSTM width by a
Rate (STLR) from [6] were tested and not found to perform any                      modest amount did decrease performance, no increases to the model
better or worse than the 1cycle policy. The results were found                     width or depth were able to increase model performance. Adding
similarly insensitive to the ratio of the peak learning rate to the                additional layers before or after the LSTM led to overfitting or other-
lower starting value.                                                              wise poor convergence. Increasing LSTM depth nd to 3, increasing
   In the early phases of the competition, networks were trained                   the LSTM width nw by 50%, and increasing the embedding size ne
for 20 epochs using a 1cycle policy with a peak learning rate                      to 64 all led to the same performance as the baseline model.
of 1,5 with the remaining parameters as specified in Figure 4. In
the later phases of the competition networks were trained for 40                      5.5.2 Regularization Tuning. We tested a variety of changes to
epochs, with the peak learning rate reduced to 0.8. This is exactly                the dropout magnitude and mix, and none were found to improve
the parameterization shown in Figure 4.                                            F1 score. Lowering the dropout in any of the layers was prone
   When testing new model architectures or training parameters,                    to problems with overfitting. In particular, do = 0.1 is a common
we used short training runs of 5 epochs to filter for clear improve-               choice, but did not add enough regularization and led to overfit-
ments. This was enough to distinguish ideas, while keeping the                     ting in our experiments. This includes experiments in which we
training time within 30 minutes. Each doubling of the schedule                     increased to de = 0.4 and dr = 0.4 to match the tuning from [6].
length to 10, 20, and then 40 epochs was shown to improve results,                    Increasing the dropout by a consistent factor across all layers
                                                                                   resulted in better cross-entropy loss, but did nothing to improve
5 For the most common network architecture. When dropout was increased, learning   discriminative accuracy. We also tested the introduction of a batch
rate was decreased to balance the increased regularization from higher dropout.    normalization layer right after the LSTM but before dropout and
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                   Michael Skinner


decoding. The resulting networks were sometimes competitive, but           ALGORITHM 1: F1-Score Optimization
prone to diverging during the fitting process.                              Data: P, probability estimates for each observation, for a
   5.5.3 RNN Variants. Swapping out the LSTM for simpler models                   single category.
like GRU led to degraded performance. The QRNN model from [2]               Result: p̂cut , an F1-optimizing cutoff below which values in P
was found to be much faster to train as advertised, but also suffered               should be rejected as predictions.
from slightly degraded performance compared to LSTM.
                                                                            nt rue ← sum(P)
   5.5.4 Adaptive Softmax. Given the unbalanced nature of the cat-          sort P in descending order
egories, it seems worthwhile to focus network bandwidth on high-
frequency categories rather than the rare ones. Adaptive softmax,           ntp,0 ← 0
as proposed in [3], does this by partitioning the output categories
                                                                            p̂cut ← 0
and then using increasingly small bottleneck layers in front of each
                                                                            scorebest ← 0
partition to limit the number of weights that need to be learned. No
variation of this technique managed to improve upon the concat-             for i ← 1 to |P | do
pooled network, which apparently was not having trouble with                    nдuess,i ← i
this number of categories as compared to the O(million) weights                 ntp,i ← ntp,i−1 + pi
to which adaptive softmax intends to scale.                                     rec i ← ntp,i /nt rue
                                                                                prec i ← ntp,i /nдuess,i
   5.5.5 Training Variants. Noting the positive results that [1] had                           r ec ·pr ec
                                                                                scorei ← 2 · r ec ii+pr ecii
with an auxiliary loss function, we experimented with accuracy as a
weighted component of the loss function. Even at high weightings               if scorei > scorebest then
this did not appear to have a substantial effect on performance.                   p̂cut ← pi
                                                                                   scorebest ← scorei
5.6    Prediction Generation                                                end
The model is trained according to a cross-entropy loss function,
but the challenge is scored according to a weighted F1-score. A
simple highest probability wins strategy is not optimal in this case,
and it is better to select a discriminative cutoff for each category to
maximize an estimate of the F1-score. This can be done directly from
the probabilities output by the network. Assuming well-calibrated
probability estimates...
    ...the estimated number of true examples nt r ue for each category
is the sum of probabilities for that category.
    ...the estimated number of true positives ntp given a certain
discriminative cutoff is the sum of probabilities that make the cut.
    ...the number of guessed positives nдuess given a cutoff is the
number of probabilities that make the cut.
    Given those estimates, it is possible to estimate the precision,
recall, and F1-score for each potential cutoff. Then the cutoff is
chosen by taking that for which the F1-score is maximized. It is
straightforward to compute this efficiently by sorting the proba-
bilities and using cumulative computations. See Algorithm 1 for           Figure 5: Raw probability estimates from the ensemble tend
details. This strategy tends to increase precision at the expense of      to be overconfident. Low probability estimates need a bit
recall, boosting the validation precision of bidirectional BPVs from      of a boost, while mediocre probabilities need to be deflated.
0.828 to 0.854 and an F1-score of 0.827 to 0.836, but at the cost of      A piecewise estimate is very close to the smoothed calibra-
dropping recall from 0.833 to 0.826.                                      tions.

   5.6.1 Probability Calibration. Since the final probability esti-
mates are not guaranteed to be well calibrated, we apply a simple
piecewise linear model to ensure that the probability estimates are
suited for the F1 tuning. To do so, we first computed the actual ac-      average to visualize the broader trend. Since this trend appeared
curacy for each 1% increment of predicted probability. A prediction       piecewise linear, we applied such a fit to the data. Figure 5 shows
of 20% should be correct 20% of the time, and if it is only correct       these smoothed probability factors as well. Applying these adjust-
10% of the time then the raw estimate needs to be halved to get a         ments to true up the raw network probabilities had a small but
true probability.                                                         positive effect on the validation performance, indicating an im-
   Figure 5 shows these raw calibration factors, which are a bit          proved fit. However, to be conservative, these estimates were not
noisy. Rather than using these directly, we applied a simple moving       applied to our final submission.
Product Categorization with LSTMs and Balanced Pooling Views                 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


5.7    Results                                                              Table 1: Model Performance (Best-Wins, 20 Epochs)
   5.7.1 Model Architecture. Table 1 shows the validation set per-
formance for each iterative improvement of the network architec-             Model                      Precision    Recall   F-Score
ture, using the preliminary training schedule of 20 epochs and a             Baseline                       0.795     0.804      0.796
simple best-wins strategy for selecting the prediction. We leave             Concat Pooling                 0.806     0.813      0.806
accuracy out of the tables, since it is the same as recall as shown          Balanced Pooling View          0.809     0.816      0.809
in Equation 2. The introduction of concat pooling increases per-             Bidirectional (CP)             0.820     0.826      0.819
formance by 0.01 or so on all evaluation metrics, and BPV adds an            Bidirectional (BPV)            0.821     0.827      0.820
addition 0.003 on top of that. Using pairs of bidirectional networks
led to another boost of 0.01, but that also requires twice as long to       Table 2: Model Performance (Best-Wins, 40 Epochs)
train. Section 5.7.5 provides a more apples-to-apples comparison.

   5.7.2 Training Schedule. In Table 2 we show the benefit of in-            Model                      Precision    Recall   F-Score
creasing the training schedule from 20 to 40 epochs. The concat              Concat Pooling                 0.807     0.814      0.808
pooled networks benefit from a modest scoring boost of 0.001 to              Balanced Pooling View          0.814     0.820      0.814
0.002 for a single network and 0.003 to 0.004 for a bidirectional            Bidirectional (CP)             0.823     0.829      0.823
pair. The BPV architecture took better advantage of the increased            Bidirectional (BPV)            0.828     0.833      0.827
time, with scores growing 0.004 to 0.005 for a single network and
0.006 to 0.007 for the bidirectional network. In taking advantage          Table 3: Model Performance (F1-Optimized, 40 Epochs)
of the extra epochs, BPV was able to grow the performance gap to
alternatives and highlight the increased modeling capacity of the
new architecture.                                                            Model                      Precision    Recall   F-Score
                                                                             Concat Pooling                 0.838     0.805      0.819
   5.7.3 F1 Optimization. Table 3 shows the results after the F1
                                                                             Balanced Pooling View          0.842     0.812      0.824
optimization procedure. The use of F1 optimization has a clear
                                                                             Bidirectional (CP)             0.852     0.821      0.833
positive impact on both precision and F-score. Precision is increased
                                                                             Bidirectional (BPV)            0.854     0.826      0.836
by nearly 0.03 for all networks, while F-scores are all higher by 0.01
or so. This comes at a smaller cost to recall, which falls off by up       Table 4: Ensemble Performance (Best-Wins, 40 Epochs)
nearly 0.01 for some model architectures. This approximately 3 to
1 tradeoff of recall for precision is a good one to make in terms of
F1 score, and a result of choosing to break any near-ties away from          Ensemble Size (in pairs)   Precision    Recall   F-Score
large and precise categories.                                                1                              0.828     0.833      0.827
                                                                             2                              0.833     0.838      0.832
   5.7.4 Ensembling. In Table 4 we show the performance improve-
                                                                             3                              0.835     0.840      0.834
ment for increasing ensemble sizes, using later stage networks
                                                                             4                              0.836     0.841      0.835
trained for 40 epochs. The 1-pair ensemble here is the same as
                                                                             5                              0.837     0.841      0.835
the bidirectional BPV results in Table 2. There is a big step when
increasing from 1 to 2 pairs, but results quickly level off after an
                                                                            Table 5: Ensemble F1 Scores (Best-Wins, 40 Epochs)
ensemble of even 4 pairs. Our final submission was generated by 3
pairs of networks trained with validation data included. A solution
with 4 pairs decreased the test scores slightly, and so we reverted                  Model           2 Networks     4 Networks
to a previous submission.                                                            Forward              0.825          0.830
                                                                                     Reverse              0.827          0.831
    5.7.5 Bidirectionality. Finally we examine the impact of direc-
                                                                                     Bidirectional        0.827          0.832
tionality on validation set results. Table 5 shows the best-wins F1
scores for a few ensembles, while Table 6 shows the optimized
                                                                           Table 6: Ensemble F1 Scores (F1-Optimized, 40 Epochs)
F1 scores. Interestingly, we found that reverse networks perform
slightly better than forward networks. Balanced bidirectional en-
sembles outperformed equivalently sized one-way ensembles in                         Model           2 Networks     4 Networks
either direction.                                                                    Forward              0.835          0.839
    Compared to the results in Table 2, it is clear that the bulk of the             Reverse              0.835          0.840
improvement comes from ensembling. With only forward networks                        Bidirectional        0.836          0.841
a single model reaches a best-wins F1 of 0.814, adding a second net-
work improves to 0.825, and doubling the ensemble to 4 networks
improves again to 0.830. By comparison, an F1 increase of 0.002 for
reversing every other network is more modest. It is nonetheless a
noticeable improvement and does not add to the training time, so
it is worth including as a part of the ensembling strategy.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                                Michael Skinner


           Figure 6: An illustration of the core RNN architecture, as well as Concat Pooling and BPV enhancements.


6   FUTURE WORK                                                           thank you to the challenge organizers for fostering a friendly and
Recalling the model refinements in section 5.5, we note that the          collaborative environment around this new dataset, providing valu-
training process achieved remarkably similar results across a wide        able feedback, and for answering the author’s numerous questions
variety of mutations to the model architecture. This was true of          throughout the process.
both model changes as well as modifications to the learning rate
schedule, and is an encouraging result since it implies saturation of     REFERENCES
                                                                           [1] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. 2017. Autoregres-
the current model architecture and training methodology. It also               sive Convolutional Neural Networks for Asynchronous Time Series. CoRR
implies that further improvement is more likely to come from some              abs/1703.04122 (2017).
enhancement to either the input representation, or from additional         [2] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016.
                                                                               Quasi-Recurrent Neural Networks. CoRR abs/1611.01576 (2016).
improvements to the extraction of signal from the recurrent layer          [3] Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé
as in sections 4.2 and 4.3.                                                    Jégou. 2017. Efficient softmax approximation for GPUs. In ICML.
   The input representation at this point is quite simple, and has not     [4] Moritz Hardt, Benjamin Recht, and Yoram Singer. 2016. Train faster, generalize
                                                                               better: Stability of stochastic gradient descent. In ICML.
yet been well explored. The cutoff of 10 occurrences for character         [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
inclusion has not been challenged, and it is possible that more to-            Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and
                                                                               Pattern Recognition (CVPR) (2016), 770–778.
kens could increase the model power. If that conjecture is true, then      [6] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-
it might also be worthwhile to try a hybrid word level tokenization            tuning for Text Classification. arXiv:arXiv:1801.06146
in which rare words are split into characters. Another alternative is      [7] Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, and Noah A.
                                                                               Smith. 2016. Hierarchical Character-Word Models for Language Identification.
to try a hierarchical model as proposed in [7], who were working               In SocialNLP@EMNLP.
on a short-phrase language identification task which bears some            [8] Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with
similarity to this one. The idea of avoiding sparsity problems by              Restarts. CoRR abs/1608.03983 (2016).
                                                                           [9] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
computing the word embeddings as a convolution over character                  and Optimizing LSTM Language Models. arXiv:arXiv:1708.02182
embeddings seems like it might apply well to this problem.                [10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
                                                                               Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
   It is also possible that there are additional pooling or aggregation        2017. Automatic differentiation in PyTorch. In NIPS-W.
techniques which could bolster the connection between recurrent           [11] Michael P. Perrone. 1993. When Networks Disagree: Ensemble Methods for
and output layers. An attention mechanism is one option, which                 Hybrid Neural Networks.
                                                                          [12] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
could use convolutions as in [1] to weight the most interesting time           2017. Semi-supervised sequence tagging with bidirectional language models. In
steps. Multiple such attention vectors could also be generated and             ACL.
viewed together, and network capacity could be tuned by increasing        [13] Leslie N. Smith. 2018. A disciplined approach to neural network hyper-
                                                                               parameters: Part 1 – learning rate, batch size, momentum, and weight decay.
both the complexity and number of such signal extractors.                      arXiv:arXiv:1803.09820
                                                                          [14] Leslie N. Smith and Nicholay Topin. 2017. Exploring loss function topology with
ACKNOWLEDGMENTS                                                                cyclical learning rates. arXiv:arXiv:1702.04283
                                                                          [15] Leslie N. Smith and Nicholay Topin. 2017. Super-Convergence: Very Fast Training
The author would like to his colleagues at Duetto for supporting               of Neural Networks Using Large Learning Rates. arXiv:arXiv:1708.07120
                                                                          [16] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin
and embracing this research, even though it was done off the clock.            Recht. 2017. The Marginal Value of Adaptive Gradient Methods in Machine
Thank you as well to the fast.ai community for developing and                  Learning. In NIPS.
supporting an amazing MOOC and deep learning library. Finally,

</pre>