=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_9
|storemode=property
|title=Product Categorization with LSTMs and Balanced Pooling Views
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_9.pdf
|volume=Vol-2319
|authors=Michael Skinner
|dblpUrl=https://dblp.org/rec/conf/sigir/Skinner18
}}
==Product Categorization with LSTMs and Balanced Pooling Views==
Product Categorization with LSTMs and Balanced Pooling Views
Michael Skinner∗
Seattle, Washington
ABSTRACT It is worth noting that the weighted recall used for this challenge
Recurrent Neural Networks (RNNs), and LSTMs in particular, have is equal to accuracy. With some small abuse of notation in the last
proven competitive in a wide variety of language modeling and line:1
classification tasks. We explore the application of such models to
the Rakuten Data Challenge, a large-scale product classification 1 Õ
task. We show that a straightforward network architecture and weiдhted-recall = |Dc | · recallc
|D|
c ∈C
ensembling strategy can achieve state of the art results when trained
effectively. We also demonstrate the positive impact of tightening 1 Õ |Dc ∩ D̂c |
= |Dc | ·
the connections between recurrent and output layers through the |D| |Dc |
c ∈C (2)
use of pooling layers, and introduce the Balanced Pooling View 1 Õ
architecture to take advantage of this. Our final solution, produced = |Dc ∩ D̂c |
|D|
with a bidirectional ensemble of 6 such models, achieved a weighted c ∈C
F1 score of 0.8513 and won the challenge by a wide margin. |D ∩ D̂|
=
|D|
KEYWORDS Our contributions. Our contributions are as follows: 1) We
Rakuten Data Challenge; text classification; neural networks; deep achieve state-of-the-art results for this dataset, applying the method-
learning ology of [6, 9] by focusing on regularizing and training straight-
forward LSTM architectures effectively. 2) We extend the above
methodology to the problem of sizing a configurable model archi-
1 INTRODUCTION tecture and evaluating potential design improvements. We note
The Rakuten Data Challenge. The goal of the challenge is to that the resulting model architectures are robust to changes to the
classify each product into one of 3008 categories given the title of training schedule, and vice versa. 3) We introduce the Balanced
the product. As an additional challenge, the categories are highly Pooling View architecture, an extension of concat pooling from [6],
unbalanced and intrinsically noisy as sellers take responsibility for to extract signal from variable-length recurrent outputs, and show
categorizing their own products. that this improves performance compared to existing architectures.
The dataset contains 1 million categorized products, of which 80% Overall challenge results. Our approach won the challenge by
were available for training and the remaining 20% were reserved a wide margin. Our final submission achieved an F1 score of 0.8513
for a test set. The test set categories were not available during on the test set, maintaining the recall of the next best solution
the challenge, except through a provisional leaderboard to score while improving precision from 0.8425 to 0.8697, decreasing the
solutions on a subset of them. false positive rate by 17.5% from 15.75% to 13%.2 Our position at or
Solutions were scored using the category-weighted average of F1 near the top of the leaderboard was held for much of the contest, an
scores, with the leaderboard also showing weighted precision and advantage of the simple but methodical approach to model design,
recall. That is to say, if there are some labeled documents D, with a training, and iteration. In addition, our models do not take long to
set of documents Dc per category c in C, and a corresponding set of train. It takes less than 24 hours on a commodity GPU to train our
documents D̂c per predicted category, then the weighted F1-score final 6-network ensemble.
is:
2 RELATED WORK
Several recent papers have shown the broad applicability of straight-
|Dc ∩ D̂c | forward LSTM architectures. [6] were able to use transfer learning
recallc =
|Dc | on top of the same pre-trained LSTM architecture to achieve state-
|Dc ∩ D̂c | of-the-art results on several text classification tasks. That work
precisionc = builds on the work of [9], who used a single AWD-LSTM archi-
| D̂c |
(1) tecture to achieve state-of-the-art results on both character- and
precisionc · recallc
F 1c = 2 · word-level language modeling benchmarks.
precisionc + recallc
1 Õ
weiдhted-F 1 = |Dc | · F 1c 1 The numerator D
|D| c ∩ D̂ c is the number of true positives i.e. correctly classified
c ∈C documents for category c . The sum of this over all categories C is the number of
correctly classified examples in the entire dataset.
2 While we can’t technically avoid false positives, we can effectively "not guess" by
picking a very rare category where they will not have a noteworthy impact on the
∗ Correspondence to: research@mcskinner.com
weighted scores.
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
In: J. Degenhardt, G. Di Fabbrizio, S. Kallumadi, M. Kumar, Y.-C. Lin, A. Trotman, H. Zhao
(eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
published at http://ceur-ws.org
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Michael Skinner
Figure 1: The category frequencies are Zipfian for the most Figure 2: Product titles range in length from 1 character
common categories, but without the long tail for rarer cate- up to 274. The bulk of the data appears to be normally dis-
gories. tributed around a mean just below 60, with a long tail. How-
ever there is an obvious artifact which peaks at 80 charac-
ters. This is presumed to be some systemic behavior, like a
There is also increasing interest in the topic of robust training display limitation.
schedules, often with aggressively high learning rates. [8] intro-
duced the cosine annealing schedule, which follows the descending
shape of a cosine wave. Training starts with a high learning rate, for inefficiency if long titles are placed into batches next to short
then drops to lower learning rates, before plateauing to finish the titles, which results in a large number of meaningless pad tokens
cycle. Results obtained after one or more iterations were found to to process.
have better generalization error than those from other learning
rate schedules. A symmetrical low-high-low triangular learning 3.2 Tokenization
rate schedule from [14], dubbed Cyclical Learning Rates (CLR) was Due to the high incidence of symbols and alphanumeric product
shown to have similar performance and generalization properties. codes, we opted for character-level tokenization. Word-level tok-
It was promptly refined into the 1cycle policy in [15], which added enizers are not well suited for this task, and poor tokenization leads
an additional annealing phase and suggested a single policy cycle to an unnecessarily large and sparse parameter space to optimize
rather than several as in [8, 14]. The results from [6] were obtained over. To further alleviate sparsity concerns, any characters that
with a Slanted Triangular Learning Rate (STLR), an asymmetrical occurred fewer than 10 times were replaced with an "unknown"
variant of CLR in which learning rate is increased over less time token. Adding in typical tokens for padding, beginning, and end of
and then decreased for more. string, the final vocabulary size was 128 tokens.
Categories were treated as an opaque classification label. No
3 DATA PREPARATION attempts were made to leverage the existing taxonomy rather than
3.1 Exploratory Analysis letting the networks learn their own. Representation learning is
With 800k examples available for training, this constitutes a fairly quite powerful, and there is some risk of adversely biasing the
large dataset. With 3008 categories to pick from, it also constitutes results by imposing a hierarchy in advance.
a large-scale classification problem. For comparison, the text classi- The int-encoding of both categories and characters was fre-
fication datasets in [6] ranged in size from 5.5k to 560k examples, quency ordered, such that tokens with higher frequencies had lower
but with only 2 to 14 categories. The categories are also highly indices. This is a common practice, and allows for simple cutoff-
unbalanced, although we see in Figure 1 that the tail is not quite as based implementations for techniques like adaptive softmax from
long as if it were Zipfian. [3].
Due to time constraints, the product titles were not examined
closely. A brief overview is given in Figure 2, but only a few things 3.3 Training/Validation Split
were relevant to the modeling: 1) The maximum sequence length We set aside 25% of the training data as a validation set, using
of 274 still allows for reasonably large batches to fit into GPU stratified sampling to ensure good representation across categories.
memory. 2) The wide distribution of lengths implies some potential We chose this set to have 200k examples, like the test set, in order to
Product Categorization with LSTMs and Balanced Pooling Views eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
ensure that all model changes could be accurately evaluated during 4.3 Balanced Pooling View
the challenge. This reduced reliance on the public leaderboard and Since average- and max-pooling aggregations were able to help
enabled a methodical process for new ideas from introduction to the output layer extract more information from the recurrent layer,
deployment. it stands to reason that additional aggregates might also help. Al-
At the end of the challenge, with the model architecture finalized though not commonly mentioned, min-pooling is trivial to imple-
and training parameters dialed in, the validation data was used to ment as minpool(x) = −maxpool(−x), and therefore easy to test. In
train one last batch of models for the final submission. The 33% addition, we believe that the use of both min- and max-pooling
increase in training data improved the F1 score on the leaderboard might encourage the network to use the full range of possible acti-
by nearly 0.01, which was enough to change the final standings vations, and therefore become more expressive, than when using
from a tossup to decisive. max-pooling alone to prefer large activations. We call this a Bal-
anced Pooling View (BPV) architecture, and find that it increased
4 MODEL ARCHITECTURE the F-score by 0.006 for a single network, trained for 40 epochs, as
4.1 Baseline Model compared to concat pooling without the min-pool addition. This
Following the lead of [6, 9], the baseline model uses a straight- architecture and its precursors are outlined in Figure 6.3
forward LSTM architecture. Character tokens are looked up in an It is worth noting that the widening of the pooling options also
embedding of size ne , which feeds into a multi-layer LSTM of width expands the size of the final output weight matrix, and therefore
nw and depth nd , which feeds into a linear layer of width nc = |C | the capacity of the model. In addition, independent dropout across
to produce a score for each category. In addition, there are dropout the views means that most of the 512-wide recurrent outputs will
layers after the embedding, between the recurrent LSTM layers, and be at least partially observable most of the time.
after the LSTM output. Those have respective dropout probabilities To understand the impact of pooling alone, we trained a separate
de , dr , and do . network with the same capacity as the concat pooled networks but
The model sizing process began with a deliberately small archi- multiple copies of the final RNN output rather than the different
tecture. Following a rule of thumb, ne = max(|V |/2, 50) = 50 for pooled views. This network did not surpass the performance of the
our 128-wide vocabulary. The values nw = 64 and nd = 2 were cho- non-pooled baseline model,4 so we conclude that the performance
sen to be deliberately small, and because the latter is the PyTorch improvement comes from increasing the ability of the network
default. Dropouts were chosen based on an existing example, with to extract signal from the recurrent outputs. See Figure 6 for a
de = 0.15, dr = 0.25, and do = 0.35. visualization of the final BPV architecture.
This initial model was quickly tuned and tested as per the method-
ology in 5.2. After initial results were shown to quickly plateau, nw 4.4 Ensembling
was doubled and the model rerun until this phenomenon stopped
at a width of 512. Previous work has shown the benefits of bidirectional training
on sequence tagging ([12]) and text classification ([6]). Reversing
4.2 Concat Pooling the input sequence acts as a sort of data augmentation, and train-
ing a second independent model adds the known advantages of
The idea of concat pooling, as proposed in [6], is to augment the last
ensembling as in [11].
LSTM output with average- and max-pooled aggregates over the
Bidirectional training is more powerful than a simple 2 network
entire sequence. The idea is that the network can become excited
ensemble. Various ensembles of 2 networks in the same direction,
about certain categories at any point during the sequence, and the
forward or backward, were never as good as bidirectional ensembles.
average and maximum are able to capture this information more
The effect of directionality is explored a bit more in Section 5.7.5.
easily than the final output. In our experiments, this was found to
This technique improved our model performance as well, increasing
be true, increasing validation precision, recall, and F-score by 0.01,
the F1 score by 0.013 with pooling and 0.011 with BPV, and is
a substantial increment on the leaderboard. We note a few likely
another significant increment on the leaderboard.
advantages from the pooling layers:
As many additional models and training parameters were tested,
(1) Short-circuiting the outputs from all time steps directly to it was found that it was not uncommon for solutions to converge
the loss function means direct feedback for the earlier time to the current best single model result. Despite the lack of single
steps, leading to faster and more reliable convergence. model gains, building an ensemble from several models was found
(2) The same direct connection introduces a sort of data aug- to improve overall performance, continuing to fall in line with
mentation for the RNN output layer. Instead of only solving results from [11]. An 8 network ensemble, 4 each forward and
the translation problem for the last hidden state, it benefits backward, increased the F-score by another increment of nearly
from successfully understanding the intermediate states as 0.01 on the validation set.
well.
(3) The average pooling is similar in spirit to the ResNet from
[5], in which subsequent computations only need to make ad-
ditive adjustments, i.e. error corrections, on the output from
the previous layer. It is possible that this idea of sharing re-
3 This has been placed on the last page of this paper to avoid disruption to the layout.
sponsibility for the overall prediction has nice generalization 4 To make things worse, it also had trouble converging.
properties that are shared across domains.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Michael Skinner
5 EXPERIMENTS
5.1 Implementation Details
All models were implemented in PyTorch ([10]) and trained on a sin-
gle Paperspace P5000 instance with a 16GB NVIDIA Quadro P5000
with 2560 CUDA cores. We used the cross-entropy loss function,
which implies a softmax over the final activations.
5.2 Training Methodology
Following the advice in [13], the batch size was increased until
GPU memory struggled to keep up, which it did at a batch size
of 256. Due to large variances in the text lengths leading to an
unnecessarily large and inefficient number of pad tokens per batch,
it was desirable to organize the batches to have similar lengths. In
order to avoid the same batch ordering every time, the sorting is
done in chunks of several batches at a time. Sorting 50 batches of
examples at a time cut training times by approximately half, and
it also appeared to have a regularizing effect. This is perhaps due
to an observed correlation between title length and categorization
difficulty, in which shorter titles are more difficult. Experiments
with a fully randomized training set not only lost the 2x speed gain,
but also resulted in poorer performance. Changing the width of the
sorting window to 10 or 250 batches did not appear to impact the
results.
Models were optimized using SGD with momentum, using a
cross-entropy loss function. Adam was attempted a few times, but
it required learning rates that were two orders of magnitude smaller,
and still exhibited difficulty in converging. This is in line with re-
search from [16] on the weakness of adaptive gradient methods. [4]
provides further reason to believe that SGD leads to faster training
and better generalization error.
In general we rejected training or architectural changes that sub-
stantially reduced the learning rate. This follows the insight from
[13] that high learning rates have a regularizing effect, improving
generalization error while also reducing training time. Our rule of
thumb is a sort of corollary: if high learning rates are desirable, then
it is undesirable when modeling variants are surprisingly intolerant
to those same learning rates.
5.3 Hyperparameter Tuning
Figure 3: A typical learning rate sweep, run for 1 epoch with
When training each new model architecture, we followed the disci-
a learning rate that increases linearly to 3. The learning rate
plined approach from [13]. First the peak learning rate was chosen
levels off substantially around 0.8, with one last drop around
using a learning rate sweep as described in [14]. The sweep starts
1.3. Using a 40 epoch 1cycle policy, peak learning rates of 0.7,
with a low learning rate, at which nothing much happens. As the
0.8, and 1.0 were all found to perform well.
learning rate continues to increase, the training and validation loss
eventually drop, level out, and then diverge. An example of this is
shown in Figure 3. models, including those used for our final submission, were trained
Good peak rates were found to exist just short of the basin in with a learning rate of 0.8.
which loss has leveled out. Our methodology was to choose the first Momentum was also chosen according to the approach in [13],
learning rates to be on the high end of plausible, sometimes in the running a learning rate sweep for several options and comparing
basin, and then decrease the estimate until a short training run of 5 the loss curves. Good momentums have nice convergence proper-
epochs trains without exhibiting signs of divergence or oscillation ties like a quick descent, a low minimum loss, and a longer basin
at the peak rate. Peak learning rates found this way were near 1, before divergence. The best momentums for this challenge were
and sometimes even higher. For the highest performing models, around 0.95, which is fairly high given the observation in [9] that
the learning rates were further tuned by hand within a small range momentum is not usually preferred for language modeling. While
to squeeze out remaining performance gains. The majority of our our experiments with low momentum SGD were able to achieve
Product Categorization with LSTMs and Balanced Pooling Views eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
Figure 4: A typical 1cycle policy with a peak learning rate of 0.8, an initial learning rate 20x smaller, momentum ranging from
0.95-0.85, and with 10% of the iterations reserved for final annealing at very low learning rates. This example was used on a
training run of 40 epochs with 2344 batches per epoch.
very high learning rates, as high as 10, the models were not able to and so the full training schedules were again chosen based on tim-
converge to low losses. ing considerations. In the early phases of the challenge we used a
Weight decay was chosen similarly, with early experiments schedule of 20 epochs so that a competitive 2 network ensemble
quickly showing that weight decay above 0 led to difficulty with could be trained in under 4 hours, enough to test a few quality
convergence. Due to time considerations, this parameter was left ideas per day. The final schedule of 40 epochs was chosen so our
at 0 for the remainder of the SGD runs rather than trying to tune it tuned model, a bidirectional ensemble of 6 BPV networks, could be
further. trained in under 24 hours. There were no apparent gains from a 60
epoch training cycle, though we also did not take time to carefully
5.4 Training Schedule re-tune the hyperparameters.
We used the 1cycle learning rate policy from [15]. Learning rates
follow a low-high-low triangular schedule as in [14], followed by a 5.5 Model Variants
few iterations of increasingly very low learning rates. The typical This section outlines the most interesting of the various extensions
low rate used was 10-20 times smaller than the peak, and the very that we tried to add on to the basic model. None led to a performance
low rate is a few orders of magnitude smaller still. See Figure 4 for improvement, but several maintained the same performance.
an example, with the parameters selected using the methodology in
Section 5.3. A few variations like the Slanted Triangular Learning 5.5.1 Larger Models. While decreasing the LSTM width by a
Rate (STLR) from [6] were tested and not found to perform any modest amount did decrease performance, no increases to the model
better or worse than the 1cycle policy. The results were found width or depth were able to increase model performance. Adding
similarly insensitive to the ratio of the peak learning rate to the additional layers before or after the LSTM led to overfitting or other-
lower starting value. wise poor convergence. Increasing LSTM depth nd to 3, increasing
In the early phases of the competition, networks were trained the LSTM width nw by 50%, and increasing the embedding size ne
for 20 epochs using a 1cycle policy with a peak learning rate to 64 all led to the same performance as the baseline model.
of 1,5 with the remaining parameters as specified in Figure 4. In
the later phases of the competition networks were trained for 40 5.5.2 Regularization Tuning. We tested a variety of changes to
epochs, with the peak learning rate reduced to 0.8. This is exactly the dropout magnitude and mix, and none were found to improve
the parameterization shown in Figure 4. F1 score. Lowering the dropout in any of the layers was prone
When testing new model architectures or training parameters, to problems with overfitting. In particular, do = 0.1 is a common
we used short training runs of 5 epochs to filter for clear improve- choice, but did not add enough regularization and led to overfit-
ments. This was enough to distinguish ideas, while keeping the ting in our experiments. This includes experiments in which we
training time within 30 minutes. Each doubling of the schedule increased to de = 0.4 and dr = 0.4 to match the tuning from [6].
length to 10, 20, and then 40 epochs was shown to improve results, Increasing the dropout by a consistent factor across all layers
resulted in better cross-entropy loss, but did nothing to improve
5 For the most common network architecture. When dropout was increased, learning discriminative accuracy. We also tested the introduction of a batch
rate was decreased to balance the increased regularization from higher dropout. normalization layer right after the LSTM but before dropout and
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Michael Skinner
decoding. The resulting networks were sometimes competitive, but ALGORITHM 1: F1-Score Optimization
prone to diverging during the fitting process. Data: P, probability estimates for each observation, for a
5.5.3 RNN Variants. Swapping out the LSTM for simpler models single category.
like GRU led to degraded performance. The QRNN model from [2] Result: p̂cut , an F1-optimizing cutoff below which values in P
was found to be much faster to train as advertised, but also suffered should be rejected as predictions.
from slightly degraded performance compared to LSTM.
nt rue ← sum(P)
5.5.4 Adaptive Softmax. Given the unbalanced nature of the cat- sort P in descending order
egories, it seems worthwhile to focus network bandwidth on high-
frequency categories rather than the rare ones. Adaptive softmax, ntp,0 ← 0
as proposed in [3], does this by partitioning the output categories
p̂cut ← 0
and then using increasingly small bottleneck layers in front of each
scorebest ← 0
partition to limit the number of weights that need to be learned. No
variation of this technique managed to improve upon the concat- for i ← 1 to |P | do
pooled network, which apparently was not having trouble with nдuess,i ← i
this number of categories as compared to the O(million) weights ntp,i ← ntp,i−1 + pi
to which adaptive softmax intends to scale. rec i ← ntp,i /nt rue
prec i ← ntp,i /nдuess,i
5.5.5 Training Variants. Noting the positive results that [1] had r ec ·pr ec
scorei ← 2 · r ec ii+pr ecii
with an auxiliary loss function, we experimented with accuracy as a
weighted component of the loss function. Even at high weightings if scorei > scorebest then
this did not appear to have a substantial effect on performance. p̂cut ← pi
scorebest ← scorei
5.6 Prediction Generation end
The model is trained according to a cross-entropy loss function,
but the challenge is scored according to a weighted F1-score. A
simple highest probability wins strategy is not optimal in this case,
and it is better to select a discriminative cutoff for each category to
maximize an estimate of the F1-score. This can be done directly from
the probabilities output by the network. Assuming well-calibrated
probability estimates...
...the estimated number of true examples nt r ue for each category
is the sum of probabilities for that category.
...the estimated number of true positives ntp given a certain
discriminative cutoff is the sum of probabilities that make the cut.
...the number of guessed positives nдuess given a cutoff is the
number of probabilities that make the cut.
Given those estimates, it is possible to estimate the precision,
recall, and F1-score for each potential cutoff. Then the cutoff is
chosen by taking that for which the F1-score is maximized. It is
straightforward to compute this efficiently by sorting the proba-
bilities and using cumulative computations. See Algorithm 1 for Figure 5: Raw probability estimates from the ensemble tend
details. This strategy tends to increase precision at the expense of to be overconfident. Low probability estimates need a bit
recall, boosting the validation precision of bidirectional BPVs from of a boost, while mediocre probabilities need to be deflated.
0.828 to 0.854 and an F1-score of 0.827 to 0.836, but at the cost of A piecewise estimate is very close to the smoothed calibra-
dropping recall from 0.833 to 0.826. tions.
5.6.1 Probability Calibration. Since the final probability esti-
mates are not guaranteed to be well calibrated, we apply a simple
piecewise linear model to ensure that the probability estimates are
suited for the F1 tuning. To do so, we first computed the actual ac- average to visualize the broader trend. Since this trend appeared
curacy for each 1% increment of predicted probability. A prediction piecewise linear, we applied such a fit to the data. Figure 5 shows
of 20% should be correct 20% of the time, and if it is only correct these smoothed probability factors as well. Applying these adjust-
10% of the time then the raw estimate needs to be halved to get a ments to true up the raw network probabilities had a small but
true probability. positive effect on the validation performance, indicating an im-
Figure 5 shows these raw calibration factors, which are a bit proved fit. However, to be conservative, these estimates were not
noisy. Rather than using these directly, we applied a simple moving applied to our final submission.
Product Categorization with LSTMs and Balanced Pooling Views eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
5.7 Results Table 1: Model Performance (Best-Wins, 20 Epochs)
5.7.1 Model Architecture. Table 1 shows the validation set per-
formance for each iterative improvement of the network architec- Model Precision Recall F-Score
ture, using the preliminary training schedule of 20 epochs and a Baseline 0.795 0.804 0.796
simple best-wins strategy for selecting the prediction. We leave Concat Pooling 0.806 0.813 0.806
accuracy out of the tables, since it is the same as recall as shown Balanced Pooling View 0.809 0.816 0.809
in Equation 2. The introduction of concat pooling increases per- Bidirectional (CP) 0.820 0.826 0.819
formance by 0.01 or so on all evaluation metrics, and BPV adds an Bidirectional (BPV) 0.821 0.827 0.820
addition 0.003 on top of that. Using pairs of bidirectional networks
led to another boost of 0.01, but that also requires twice as long to Table 2: Model Performance (Best-Wins, 40 Epochs)
train. Section 5.7.5 provides a more apples-to-apples comparison.
5.7.2 Training Schedule. In Table 2 we show the benefit of in- Model Precision Recall F-Score
creasing the training schedule from 20 to 40 epochs. The concat Concat Pooling 0.807 0.814 0.808
pooled networks benefit from a modest scoring boost of 0.001 to Balanced Pooling View 0.814 0.820 0.814
0.002 for a single network and 0.003 to 0.004 for a bidirectional Bidirectional (CP) 0.823 0.829 0.823
pair. The BPV architecture took better advantage of the increased Bidirectional (BPV) 0.828 0.833 0.827
time, with scores growing 0.004 to 0.005 for a single network and
0.006 to 0.007 for the bidirectional network. In taking advantage Table 3: Model Performance (F1-Optimized, 40 Epochs)
of the extra epochs, BPV was able to grow the performance gap to
alternatives and highlight the increased modeling capacity of the
new architecture. Model Precision Recall F-Score
Concat Pooling 0.838 0.805 0.819
5.7.3 F1 Optimization. Table 3 shows the results after the F1
Balanced Pooling View 0.842 0.812 0.824
optimization procedure. The use of F1 optimization has a clear
Bidirectional (CP) 0.852 0.821 0.833
positive impact on both precision and F-score. Precision is increased
Bidirectional (BPV) 0.854 0.826 0.836
by nearly 0.03 for all networks, while F-scores are all higher by 0.01
or so. This comes at a smaller cost to recall, which falls off by up Table 4: Ensemble Performance (Best-Wins, 40 Epochs)
nearly 0.01 for some model architectures. This approximately 3 to
1 tradeoff of recall for precision is a good one to make in terms of
F1 score, and a result of choosing to break any near-ties away from Ensemble Size (in pairs) Precision Recall F-Score
large and precise categories. 1 0.828 0.833 0.827
2 0.833 0.838 0.832
5.7.4 Ensembling. In Table 4 we show the performance improve-
3 0.835 0.840 0.834
ment for increasing ensemble sizes, using later stage networks
4 0.836 0.841 0.835
trained for 40 epochs. The 1-pair ensemble here is the same as
5 0.837 0.841 0.835
the bidirectional BPV results in Table 2. There is a big step when
increasing from 1 to 2 pairs, but results quickly level off after an
Table 5: Ensemble F1 Scores (Best-Wins, 40 Epochs)
ensemble of even 4 pairs. Our final submission was generated by 3
pairs of networks trained with validation data included. A solution
with 4 pairs decreased the test scores slightly, and so we reverted Model 2 Networks 4 Networks
to a previous submission. Forward 0.825 0.830
Reverse 0.827 0.831
5.7.5 Bidirectionality. Finally we examine the impact of direc-
Bidirectional 0.827 0.832
tionality on validation set results. Table 5 shows the best-wins F1
scores for a few ensembles, while Table 6 shows the optimized
Table 6: Ensemble F1 Scores (F1-Optimized, 40 Epochs)
F1 scores. Interestingly, we found that reverse networks perform
slightly better than forward networks. Balanced bidirectional en-
sembles outperformed equivalently sized one-way ensembles in Model 2 Networks 4 Networks
either direction. Forward 0.835 0.839
Compared to the results in Table 2, it is clear that the bulk of the Reverse 0.835 0.840
improvement comes from ensembling. With only forward networks Bidirectional 0.836 0.841
a single model reaches a best-wins F1 of 0.814, adding a second net-
work improves to 0.825, and doubling the ensemble to 4 networks
improves again to 0.830. By comparison, an F1 increase of 0.002 for
reversing every other network is more modest. It is nonetheless a
noticeable improvement and does not add to the training time, so
it is worth including as a part of the ensembling strategy.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Michael Skinner
Figure 6: An illustration of the core RNN architecture, as well as Concat Pooling and BPV enhancements.
6 FUTURE WORK thank you to the challenge organizers for fostering a friendly and
Recalling the model refinements in section 5.5, we note that the collaborative environment around this new dataset, providing valu-
training process achieved remarkably similar results across a wide able feedback, and for answering the author’s numerous questions
variety of mutations to the model architecture. This was true of throughout the process.
both model changes as well as modifications to the learning rate
schedule, and is an encouraging result since it implies saturation of REFERENCES
[1] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. 2017. Autoregres-
the current model architecture and training methodology. It also sive Convolutional Neural Networks for Asynchronous Time Series. CoRR
implies that further improvement is more likely to come from some abs/1703.04122 (2017).
enhancement to either the input representation, or from additional [2] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016.
Quasi-Recurrent Neural Networks. CoRR abs/1611.01576 (2016).
improvements to the extraction of signal from the recurrent layer [3] Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé
as in sections 4.2 and 4.3. Jégou. 2017. Efficient softmax approximation for GPUs. In ICML.
The input representation at this point is quite simple, and has not [4] Moritz Hardt, Benjamin Recht, and Yoram Singer. 2016. Train faster, generalize
better: Stability of stochastic gradient descent. In ICML.
yet been well explored. The cutoff of 10 occurrences for character [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
inclusion has not been challenged, and it is possible that more to- Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016), 770–778.
kens could increase the model power. If that conjecture is true, then [6] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-
it might also be worthwhile to try a hybrid word level tokenization tuning for Text Classification. arXiv:arXiv:1801.06146
in which rare words are split into characters. Another alternative is [7] Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, and Noah A.
Smith. 2016. Hierarchical Character-Word Models for Language Identification.
to try a hierarchical model as proposed in [7], who were working In SocialNLP@EMNLP.
on a short-phrase language identification task which bears some [8] Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with
similarity to this one. The idea of avoiding sparsity problems by Restarts. CoRR abs/1608.03983 (2016).
[9] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
computing the word embeddings as a convolution over character and Optimizing LSTM Language Models. arXiv:arXiv:1708.02182
embeddings seems like it might apply well to this problem. [10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
It is also possible that there are additional pooling or aggregation 2017. Automatic differentiation in PyTorch. In NIPS-W.
techniques which could bolster the connection between recurrent [11] Michael P. Perrone. 1993. When Networks Disagree: Ensemble Methods for
and output layers. An attention mechanism is one option, which Hybrid Neural Networks.
[12] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
could use convolutions as in [1] to weight the most interesting time 2017. Semi-supervised sequence tagging with bidirectional language models. In
steps. Multiple such attention vectors could also be generated and ACL.
viewed together, and network capacity could be tuned by increasing [13] Leslie N. Smith. 2018. A disciplined approach to neural network hyper-
parameters: Part 1 – learning rate, batch size, momentum, and weight decay.
both the complexity and number of such signal extractors. arXiv:arXiv:1803.09820
[14] Leslie N. Smith and Nicholay Topin. 2017. Exploring loss function topology with
ACKNOWLEDGMENTS cyclical learning rates. arXiv:arXiv:1702.04283
[15] Leslie N. Smith and Nicholay Topin. 2017. Super-Convergence: Very Fast Training
The author would like to his colleagues at Duetto for supporting of Neural Networks Using Large Learning Rates. arXiv:arXiv:1708.07120
[16] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin
and embracing this research, even though it was done off the clock. Recht. 2017. The Marginal Value of Adaptive Gradient Methods in Machine
Thank you as well to the fast.ai community for developing and Learning. In NIPS.
supporting an amazing MOOC and deep learning library. Finally,