Predicting Sales from the Language of Product Descriptions
                    Reid Pryzant                                         Young-joo Chung                                 Dan Jurafsky
                Stanford University                              Rakuten Institute of Technology                      Stanford University
              rpryzant@stanford.edu                                    yjchung@acm.org                              jurafsky@stanford.edu

ABSTRACT                                                                              test, and evaluate products before making purchasing decisions,
What can a business say to attract customers? E-commerce vendors                      the remote nature of e-commerce renders such tactile evaluations
frequently sell the same items but use different marketing strate-                    obsolete.
gies to present their goods. Understanding consumer responses to                         In lieu of in-store evaluation, online shoppers increasingly rely
this heterogeneous landscape of information is important both as                      on alternative sources of information. This includes “word-of-mouth”
business intelligence and, more broadly, a window into consumer                       recommendations from outside sources [9] and local product re-
attitudes. When studying consumer behavior, the existing litera-                      views [13, 18, 20]. These factors, though well studied, are only
ture is primarily concerned with product reviews. In this paper we                    indirectly controllable from a business perspective [25, 52]. Busi-
posit that textual product descriptions are also important determi-                   ness owners have considerably stronger control over their own
nants of consumer choice. We mine 90,000+ product descriptions                        product descriptions. The same products may be sold by multiple
on the Japanese e-commerce marketplace Rakuten and identify ac-                       vendors, with each item having a different textual description (note
tionable writing styles and word usages that are highly predictive                    that we take product to mean a purchasable object, and item to mean
of consumer purchasing behavior. In the process, we observe the                       an individual e-commerce listing). Studying consumers’ reactions
inadequacies of traditional feature extraction algorithms, namely                     to these descriptions is valuable both as business intelligence and
their inability to control for the implicit effects of confounds like                 as a new window into consumer attitudes.
brand loyalty and pricing strategies. To circumvent this problem,                        The hypothesis that business-generated product descriptions
we propose a novel neural network architecture that leverages an                      affect consumer behavior (manifested in sales) has received strong
adversarial objective to control for confounding factors, and atten-                  support in prior empirical studies [22, 26, 34, 37, 39]. However, these
tional scores over its input to automatically elicit textual features                 studies have only used summary statistics of these descriptions (i.e.
as a domain-specific lexicon. We show that these textual features                     readability, length, completeness). We propose that embedded in
can predict the sales of each product, and investigate the narratives                 these product descriptions are narratives that affect shoppers, which
highlighted by these words. Our results suggest that appeals to au-                   can be studied by examining the words in each description.
thority, polite language, and mentions of informative and seasonal                       Our hypothesis is that product descriptions are fundamentally
language win over the most customers.                                                 a kind of social discourse, one whose linguistic contents have real
                                                                                      control over consumer purchasing behavior. Business owners em-
CCS CONCEPTS                                                                          ploy narratives to portray their products, and consumers react
                                                                                      accordingly according to their beliefs and attitudes.
• Information systems → Content analysis and feature se-
                                                                                         To test this hypothesis, we mine 93,591 product descriptions and
lection; • Computing methodologies → Information extrac-
                                                                                      sales records from the Japanese e-commerce website rakuten.co.jp
tion; Neural networks;
                                                                                      (“Rakuten”). We build models that can explain how the textual con-
                                                                                      tent of product descriptions impacts sales. Second, we use these
KEYWORDS
                                                                                      models to conduct a explanatory analysis, identifying what linguis-
e-commerce, feature selection, neural networks, adversarial learn-                    tic aspects of product descriptions are the most important determi-
ing, natural language processing                                                      nants of success.
ACM Reference format:                                                                    We seek to unearth actionable phrases that can help e-
Reid Pryzant, Young-joo Chung, and Dan Jurafsky. 2017. Predicting Sales               commerce vendors increase their sales regardless of what’s
from the Language of Product Descriptions. In Proceedings of SIGIR, Tokyo,            being sold. Thus, we want to study the effect of language on sales
Japan, August 2017 (SIGIR 2017 eCom), 10 pages.                                       in isolation, i.e. find textual features that are untangled from the
                                                                                      effects of pricing strategies [15], brand loyalty [17, 48], and product
                                                                                      identity. Choosing features for such a task is a challenging problem,
1 INTRODUCTION                                                                        because product descriptions are embedded in a larger e-commerce
The internet has dramatically altered consumer shopping habits.                       experience that leverages the shared power of these confounds to
Whereas customers of physical stores can physically manipulate,                       market a product. For a not-so-subtle example, product descrip-
                                                                                      tions frequently boast “free shipping!”, overtly pointing to a pricing
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic
purposes.                                                                             strategy with known power over consumer choice [19].
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):           We develop a new text feature selection algorithm to operate
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published
                                                                                      in this confound-controlled setting. This algorithm makes use of a
at http://ceur-ws.org
                                                                                      novel neural network architecture. The network uses attentional
scores over its input and an adversarial objective to select a lexi-                 (e.g low-selling descriptions). Note that this method requires di-
con that is simultaneously predictive of consumer behavior and                       chotomized targets, which we discuss further in Section 3.1.
controlled for confounds such as brand and price.                                       Mutual information (MI) is a measurement of how informative
   We evaluate our feature selection algorithm on two pools of fea-                  the presence of a token is to making correct classification decisions.
ture candidates: morphemes obtained with the JUMAN tokenizer1 ,                      Formally, the mutual information MI (t, c) of a token t and binary
and sub-word units obtained via byte-pair encoding (“BPE”) [47].                     class c is
From these pools we select features with either (1) our proposed
neural network, (2) odds ratios [10], (3) mutual information [41],                                             Õ           Õ                             P(It , Ic )
                                                                                              MI (t, c) =                             P(It , Ic ) log                  (2)
and (4) the features with nonzero coefficients of a L1 regularized                                                                                      P(It )P(IC )
                                                                                                            I t ∈ {1,0} I t ∈ {1,0}
linear regression. Our results suggest that lexicons produced by the
neural model are both less correlated with confounding factors and                   where It and Ic are indicators on term presence and class label for
the most powerful predictors of sales.                                               a given description. Like OR, this method requires dichotomized
   In summary, our contributions are as follows:                                     sales targets.
      • We demonstrate that the narratives embedded in e-commerce                       Lasso Regularization (L1) can perform variable selection on a
         product descriptions influence sales.                                       linear regression model [51] by including a regularization term to
      • We propose a novel neural architecture to mine features                      the least squares objective. This term penalizes the L1 norm of the
         for the task.                                                               model parameters:
      • We discover actionable writing styles and words that have
         especially high influence on these outcomes.                                                              N 
                                                                                                                  nÕ             Õ         o
                                                                                                      arg min         yi − β 0 −   β j xi j ,                          (3)
2     PREDICTING SALES FROM DESCRIPTIONS                                                                            i=1                      j
                                                                                                                                           Õ
Our task is to predict consumer demand (measured in log(sales))                                                           subject to             |β j | ≤ α            (4)
from the narratives embedded in product descriptions. To do so,                                                                              j
                                                                                     .
we mine features from these textual data and fit a statistical model.
                                                                                     Where yi is the ith target, β 0 is an intercept, β j is the jth coefficient
In this section, we review our feature-mining baselines, present
                                                                                     of the ith predictor x i . α is pre-specified parameter that determines
our novel approach to feature-mining, and outline our statistical
                                                                                     the amount of regularization. The parameter α can be obtained by
technique for predicting sales from these features while accounting
                                                                                     minimizing the error in cross-validation.
for confounding factors like brand loyalty and product identity.

2.1     Feature Mining Preliminaries                                                 2.3    Deep Adversarial Feature Mining
                                                                                     An important limitation of all the aforementioned feature selection
We approach the featurization problem by first segmenting prod-
                                                                                     methods is that they are incapable of selecting features that are
uct descriptions into sequences of tokens, then selecting tokens
                                                                                     decorrelated from confounds like brand and price. Recall from
from the vocabulary of tokens that are predictive of high sales.
                                                                                     Section 1 the price-related example of “free shipping!”. Consider
We take subsets of these vocabularies (rather than one feature per
                                                                                     the brand-related example of “the quality you know and love from
vocabulary item) because (1) we need to be able to examine the
                                                                                     Daison”. Though effective marketing tools, these phrases leverage
linguistic contents of the resulting feature sets, and (2) we need
                                                                                     the power of pricing strategies and brand loyalty, factors with
models that are highly generalizable, and not too closely adapted
                                                                                     known power over consumers. We wish to study the impact of
to the peculiarities of these data’s vocabulary distributions.
                                                                                     linguistic structures in product descriptions in isolation, beyond
   We select predictive subsets of the data’s tokenized vocabularies
                                                                                     those indicators of price or branding. Thus, we consider brand,
in four ways. Three of these (Section 2.2) are traditional feature
                                                                                     product, and price information as confounding factors that confuse
selection methods that serve as strong baselines for our proposed
                                                                                     the effect of language on consumers.
method (Section 2.3).
                                                                                         As a solution to this problem, we propose a novel feature-selecting
                                                                                     neural network (RNN+/-GF), sketched in Figure 1. The model uses
2.2     Traditional Feature Mining
                                                                                     an attention mechanism to produce estimates for log(sales), brand,
Odds Ratio (OR) finds words that are over-represented in a partic-                   and price. We omit product because it is only present in our test
ular copora when compared to another (e.g. descriptions of high                      data; see Section 3.1 for details. During training, the model uses
selling items verses those of low-selling counterparts). Formally,                   an adversarial objective to discourage feature effectiveness with
this is:                                                                             respect to two of these prediction targets: brand and price. That is,
                                                                                     the model finds features that are good at predicting sales, and bad
                                  pi /(1 − pi )                                      at predicting brand and price.
                                                                              (1)
                                  p j /(1 − p j )                                        Deep learning review. Before we describe the model, we review
where pi is the probability of the word in copora i (e.g high-selling                its primary building blocks.
descriptions) and p j is the probability of the word in copora j                         Feedforward Neural Networks (FFNNs) are composed of a
                                                                                     series of fully connected layers, where each layer takes on the form
1 JUMAN (a User-Extensible Morphological Analyzer for Japanese), http://nlp.ist.i.
kyoto-u.ac.jp/EN/index.php?JUMAN
                                                                                                                    y = f (W x + b).                                   (5)
Figure 1: An illustration of the proposed RNN+GF model operating on an example product description with three timesteps.
All operations and dimensionalities are explicitly shown. Vectors are depicted as rounded rectangles, matrix multiplications
as squared rectangles, and scalars as circles. Trainable parameters are grey, while dynamically computed values are colored.
Gradient reversal layers multiply gradients by -1 as they backpropagate from the prediction networks to the encoder. In this
example, the model attends to the description’s final token the most, so that would be the most likely candidate for a generated
lexicon.


Note that x ∈ Rn is a vector of inputs (e.g. from a previous layer),          states (eq. 8). Finally, this distribution is used to compute a weighted
W ∈ Ry×n is a matrix of parameters, b ∈ Ry is a vector of biases,             average of hidden states c (eq. 9). Formally, this can be written as:
y ∈ Ry is an output vector, and f (·) is some nonlinear activation                                            ⊺
function, e.g. the ReLU: ReLU (x) = max{0, x }.                                                       âi = va tanh(Wa hi )                        (7)
    Recurrent Neural Networks (RNNs) are effective tools for                                           a = softmax(â)                             (8)
learning structure from sequential data [14]. RNNs take a vector                                           Õ
                                                                                                       c=      aj hj                               (9)
x t at each timestep. They compute a hidden state vector ht ∈ Rh
                                                                                                             j
at each timestep by applying nonlinear maps to the the previous
hidden state ht −1 and the current input x t (note that h 0 is initialized       Our model. We continue by describing our adversarial feature
to 0®):                                                                       mining model. The process of obtaining features from the model can
                                                                            be thought of as a three-stage algorithm: (1) forward pass, where
                  ht = σ W (hx )x t + W (hh)ht −1 .                     (6)
                                                                              predictions are generated, (2) backward pass, where parameters
                                                                              are updated, and, after repeated iterations of 1 and 2, (3) feature
   W (hx ) ∈ Rh×n , W (hh) ∈ Rx ×h are parameterized matrices. We
                                                                              selection, where we use attentional scores to elicit lexicons.
use Long Short-Term Memory Network (LSTM) cells, a variant of
                                                                                 The forward pass operates as follows:
the traditional RNN cell that can more effectively model long-term
temporal dependencies [23].                                                       (1) The segmented input is fed into an LSTM to produce hidden
   Attention mechanisms. Attentional mechanisms allow neural                          state encodings for each timestep.
models to focus on parts of the encoded input before producing                    (2) We compute an attentional summary of these hidden states
predictions. We calculate Bahdanau-style attentional contexts [3]                     to obtain a single vector encoding of the input.
because these have been shown to perform well for other tasks                     (3) We feed this encoding into three FFNNs. One is a regres-
like translation and language modeling [11, 31], and preliminary                      sion network that tries to minimize L = ||ŷ − x ||2 , the
experiments suggested that this mechanism worked best for our                         squared loss between the predicted and true log(price).
problem setting.                                                                      The second and third are classification networks, which
   Bahdanau-style attention computes the attentional context as a                     predict a likelihood distribution over all possible labels,
weighted average of hidden states. The weights are computed as                        and are trained to minimize L = − log p(y), the negative
follows: pass each hidden state hi through a fully-connected neural                   log probability of the correct class label. We attach classi-
network, then compute a dot product with a vector of parameters to                    fication networks for brand id and a dochotomization of
produce an intermediary scalar âi (eq. 7). Next, the âi ’s are scaled               price (see Section 3.1 for details). We dichotomized sales in
by a softmax function so that they map to a distribution over hidden                  this way to create a fair comparison between this method
          and the baselines: other feature selection algorithms (OR,         We proceed with a formal description of our mixed-effects model.
          MI) are not so flexible and require dichotomized targets.        Let yi jk be the log(sales) of item i, which is product j and sold
   The backward pass draws on prior work in leveraging adver-              by brand k. The description for this item is written as xi jk , and
                                                                                     (h)
sarial objective functions to match feature distributions in different     each x i jk ∈ xi jk is the h th feature of this description. With these
settings [40]. In particular, we draw from a line of research in the       definitions, we can write our mixed-effects model model as
style of [16], [8], and [27]. This method involves passing gradients                                     Õ
                                                                                                                  (h)
through a gradient reversal layer, which multiplies gradients by                          yi jk = β 0 +      βh x i jk + γ j + α k + ϵi jk     (11)
a negative constant, i.e. -1, as they propagate back through the                                               h
network. Intuitively, this encourages parameters to update away                                γ j ∼ N(0, σγ2 )                                                (12)
from the optimization objective.                                                              α k ∼ N(0, σα2 )                                                 (13)
   If Lsal es , Lbr and , Lpr ice are the regression and classification
losses from each prediction network, then the final loss we are opti-                        ϵi jk ∼ N(0, σϵ2 )                                                (14)
mizing is L = Lsal es +Lbr and +Lpr ice . However, when backprop-          where γ j and α k are the random effects of product and brand, respec-
agating from each prediction network to the encoder, we reverse            tively, and ϵi jk is an item-specific effect, i.e. this item’s deviation
the gradients of the networks that are predicting confounds. This          from the mean item sales.
means that the prediction networks still learn to predict brand and           Nakagawa and Schielzeth [44] introduced the marginal and con-
price, but the encoder is forced to learn brand- and price-invariant       ditional R 2 (Rm
                                                                                          2 and R 2 ) as summary statistics of mixed-effects mod-
                                                                                                   c
representations which are not useful to these downstream tasks.            els. Marginal Rm 2 is the R 2 of the textual effects only. It reports the
We hope that such representations encourage the model to attend            proportion of variance in the model’s predictions can be explained
to confound-decorrelated tokens.                                                                           (h)
                                                                           with fixed effects variables x i jk . It is written as;
   The lexicon induction stage uses a trained model defined above
to select textual features that are predictive of sales, but control for                                                   σf2
                                                                                                       2
the influence of brand and price. This stage operates as follows:                                     Rm =                              ,                      (15)
                                                                                                            σf2 + σγ2 + σα2 + σϵ2
      (1) Generate predictions for each test item, but rather than                                                             !
          saving those predictions, save the attentional distribution                                   2
                                                                                                                 Õ
                                                                                                                           (h)
                                                                                                       σf = var       βh x i jk .                              (16)
          over each source sequence.
                                                                                                                       h
      (2) Standardize these distributions. For each input i, standard-
          ize the distribution over timesteps p(i) by computing               Conditional Rc2 is the R 2 of the entire model (text + product +
                                                                           brand). It conditions on the variances of the random factors we are
                                                    (i)
                               (i)
                                         p(i) − µ p                        controlling for (product and brand):
                           z         =                             (10)
                                            σp
                                              (i)                                                                  σf2 + σγ2 + σα2
                                                                                                      Rc2 =                             .                      (17)
      (3) Merge these standardized distribution over each input se-                                           σf2 + σγ2 + σα2 + σϵ2
          quence. If there is a word collision (i.e. we observe the same
          token in multiple input sequences and the model assigned         3     EXPERIMENTS
          each observation a different z-score), take the max of those     We now detail a series of experiments that were conducted to evalu-
          words’ z-scores.                                                 ate the effectiveness of each feature set, and, more generally, to test
      (4) Select the k tokens with highest z-scores. This is our in-       the hypothesis that narratives embedded in product descriptions
          duced lexicon.                                                   are indeed predictive of sales.

2.4     Using Features to Predict Sales                                    3.1       Product and Sales Data
Once we have mined textual features from product descriptions, we          We obtained data on e-commerce product descriptions, sales, ven-
need a statistical model that accounts for the effects of confounding      dors, and prices from a December 2012 snapshot of the Rakuten
variables like product identity and brand loyalty in predicting the        marketplace2 . We focused on items belonging to two product cate-
sales of each item. We use a mixed-effects model, a type of hierar-        gories: chocolate and health. These two categories are both popular
chical regression that assumes observations can be explained with          on the marketplace, but their characteristics are different. There is
two types of categorical variables: fixed effect variables and random      more variability among chocolate products than health products;
effect variables [7].                                                      many vendors are boutiques that sell handmade goods. Health ven-
   We model textual features as fixed effects. We take the product         dors, on the other hand, are often large providers of pharmaceuticals
that each item corresponds to and the brand selling each item as           goods, sometimes wholesale.
random effects. Thus, we force the model to assume that product               We segment product descriptions two ways. First, we tokenize de-
and brand information is decorrelated from everything else, and we         scriptions into morphological units (morphemes) with the JUMAN
expect to observe the explanatory power of text features without           tokenizer3 . Second, we break descriptions into frequently occurring
the influence of brand or product. Note that the continuous nature         2 Please refer to https://rit.rakuten.co.jp/opendata.html for details on data acquisition.
of the “price” confound precludes our ability to model it (Section         3 Using   JUMAN (a User-Extensible Morphological Analyzer for Japanese),
3.1).                                                                      http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN
sub-word units 4 . From here on we refer to the morpheme features          hidden states, and 32-dimensional intermediate Bahdanau vectors
as “morph”, and sub-word features as “BPE”.                                as described in Figure 1. Dropout at a rate of 0.2 was applied to the
   Details of these data can be found in Table 1. Notably, the ratio       input of each LSTM cell. We optimized using Adam, a batch size
of the size of vocabulary (unique keywords) to the size of tokens          of 128, and a learning rate of 0.0001 [30]. All models took approx-
(occurrence of keywords) in the chocolate category is twice as large       imately three hours to reach convergence on a Nvidia TITAN X
as that of the health category as listed in (%) in Table 1. This implies   GPU.
that product descriptions in the chocolate category are written with          The L1 regularization parameter α was obtained with the scikit-
more diverse language.                                                     learn library [45] by minimizing the error in the four-fold cross
   Recall that some feature selection algorithms (OR, MI) require          validation on training set.
dichotomized prediction targets. Thus, we dichotomized the data               In all of our experiments, we analyzed the log(sales) of an item
on log(sales), taking the top-selling 30% and bottom-selling 30% as        as a function of textual description features. We used mixed-effects
positive and negative examples, respectively. Our textual features         regression to model the relationship between these two entities.
were selected using these dichotomized data.                               We included linguistic features obtained by the methods of Section
   In order to evaluate mixed-effects regression models on these           2.2 and 2.3 as fixed effect variables, and the confounding prod-
data, we consider the vendor selling an item as its “brand identifier”     uct/vendor identifiers in the test set as random effect variables. We
(vendors have unique branding on the Rakuten platform). We also            used the “lme4” package in the R software environment v. 3.3.3
need to know what product each item corresponds to, something              to perform these analyses [6]. To evaluate feature effectiveness
not present in the data. Thus, we hand-labeled 2,131 items with            and goodness of fit, we obtained conditional and marginal R 2 val-
product identifiers and separated these into a separate dataset for        ues with the “MuMIn” R package [5]. We also performed t-tests
testing (Table 2). Our experimental results are reported on this test      to obtain significance measurements on the model’s fitted parame-
data set.                                                                  ters. For this we obtained degrees of freedom with Satterthwaite
                                                                           approximations [46] with the “lmerTest” R package[32].
Table 1: Characteristics of the Rakuten data. These data con-                 In addition to keywords, we experimented with two additional
sist of 93,591 product descriptions, vendors, prices, and sales            types of features: description length in number of keywords and
figures.                                                                   part-of-speech tags obtained with JUMAN.

                                       Chocolate              Health       3.3    Experimental Results
        # items                            32,104             61,487       Influence of narratives. Figure 2 depicts the performance of
        # vendors                           1,373              1,533       mixed-effects regression models fitted with the top 500 features
        # morph tokens                  5,237,277         11,544,145       from each approach. Overall, these results strongly support the
        # BPE tokens                    6,581,490         16,706,646       hypothesis that narrative elements of product descriptions are pre-
        # morph vocab (%)          18,807 (0.36%)     20,669 (0.18%)       dictive of consumer behavior. Adding text features to the model
        # BPE vocab (%)            16,000 (0.24%)     16,000 (0.10%)       increased its explanatory power in all settings. The marginal Rm
                                                                                                                                          2 ’s of
                                                                           each approach are listed on Table 3. The RNN+GF method selected
                                                                           features superior in both marginal and conditional R 2 . This implies
                                                                           that it could select features that perform well in both isolated and
Table 2: Characteristics of the test data. Product identifiers             confound-combined settings.
were manually assigned to these data for evaluation.                          To investigate whether the high performance of RNN+GF fea-
                                                                           tures is simply a phenomenon of model capacity, we compared
                                             Chocolate      Health         RNN+GF and one of the best-performing baselines, that of the lasso.
                                                                           We varied the number of features each algorithm is allowed to
           # items                                   924       1207
                                                                           select and compared the resulting conditional R 2 values, finding
           # products                                186          50
                                                                           that RNN+GF features are consistently on-par with or outperform
           # vendors                                 201         384
                                                                           that of the lasso, regardless of feature count as shown in Figure 3.
           avg. # items per product                     4          9
           (min, max)                             (2, 26)   (2, 134)

                                                                              Effect of gradient reversal To determine the role of gradient
3.2     Experimental Protocol                                              reversal in the efficacy of the RNN+GF features, we conducted
All deep learning models were implemented using the Tensor-                an ablation test, toggling the gradient reversal layer of our model
flow framework [1]. In order to obtain features from the proposed          and observing the performance of the elicited features. From Ta-
RNN+GF model, we conducted a brief hyperparameter search on a              ble 4, it is apparent that the confound-invariant representations
held-out development set. This set consisted of of 2,000 examples          encouraged by gradient reversal lead to more effective features
randomly drawn from the pool of training data. The final model             being selected. Apart from summary statistics, this observation
used 32-dimensional word vectors, an LSTM with 64-dimensional              can be seen in the features themselves. For example, one of the
4 Using https://github.com/google/sentencepiece                            highest scoring morphemes without gradient reversal was 無料
                                                                   Figure 3: Conditional R 2 (Rc2 ) of the model trained varying
                                                                   numbers of of morpheme/BPE features. Despite being decor-
                                                                   related from the random effects of brand and price, RNN+GF
                                                                   features are competitive with that of the lasso regardless of
                                                                   token type and feature set size.


Figure 2: Conditional R 2 of random effects only models            Table 4: Gradient reversal ablation and its impact on condi-
(brand + product) and full models (brand + product + key-          tional R 2 . The confound-invariance encouraged by the ad-
words + POS + BPE tokens) from Table 3. Including tex-             versarial objective helps downstream regressions.
tual features in mixed effect regressions improves predictive
power regardless of dataset and feature selection method                                  Chocolate        Health
features provide the largest gains. Morpheme tokens yielded                               BPE morph        BPE morph
similar results.
                                                                                  +GF     0.81    0.81     0.78     0.75
                                                                                  -GF     0.76    0.75     0.64     0.69

Table 3: The explanatory power of random effect confounds
(brand, product), text (BPE features, description length, and         Comparison of different feature mining strategies. To in-
POS tags), and the combination of confounds and text. Mar-         vestigate whether the proposed method successfully discovered
ginal and conditional R 2 are depicted where appropriate.          features that are simultaneously explanatory of sales and untan-
The RNN+GF-selected features appear superior with and              gled from the confounding effects of product, brand, and price, we
without confounds (Rc2 and Rm 2 ). Morpheme features yielded       computed the correlations between BPE tokens selected by differ-
similar results.                                                   ent methods and these non-linguistic confounds. For each feature
                                                                   set, the average per-feature Cramer’s V was computed for product
                                                                   and brand, while the average per-feature point-biserial correlation
                           Chocolate                               coefficient was computed for price. Our results indicate that the
  Model features     R 2 type        L1     MI     OR     RNN+GF   RNN+GF features are less correlated with these confounds than
                                                                   any other method (Table 5).
  confounds only     conditional     0.57   0.57   0.57   0.57
  text only          marдinal        0.58   0.53   0.49   0.60
  confounds + text   conditional     0.78   0.73   0.71   0.81     Table 5: Average association strengths between each BPE to-
                                                                   ken set and non-linguistic factors. The RNN+GF features
                                Health                             are the least correlated with these confounding factors. Mor-
  Model Features     R 2 type        L1     MI     OR     RNN+GF   pheme tokens yielded similar results.
  confounds only     conditional     0.44   0.44   0.44   0.44
  text only          marдinal        0.40   0.40   0.36   0.44                              L1      MI    OR      RNN+GF
  confounds + text   conditional     0.65   0.71   0.69   0.78
                                                                                product    0.55    0.57   0.55       0.38
                                                                                brand      0.58    0.54   0.57       0.42
                                                                                price      0.08    0.08   0.08       0.07

(“free”). The RNN+GF features, on the other hand, are devoid of       Examining the keywords selected by different methods suggests
words relating to brand/vendor/price.                              the same story as Table 5. Morpheme features with high importance
values are listed in Table 6. Note that the RNN+GF approach was the              アの4種類のナッツとクッキークランチや
only method that did not select any keywords correlated with prod-               アーモンドパフを一本のチョコレートバー
uct, brand, or price. Additionally, every method except RNN+GF                   にぎっしり詰め込みました。こちらは夏期
selected pecan (ピーカン・ペカン). Lalala’s pecan chocolate is                           クール便発送商品です。
one of the most popular products on the marketplace. Although
                                                                            The item with the former description was preferred by customers.
it is understandable that these tokens contribute to sales, they are
                                                                        It contains words suggestive of authority (“standard”, “staff”), infor-
product-specific and thus not generalizable. On the other hand,
                                                                        mativeness (“package”, “souvenir”), and concern for the customer
RNN+GF gave high scores to location-related words. Similar ten-
                                                                        while the latter description is primarily concerned with ingredients.
dencies were observed in the health category. BPE tokens, though
                                                                            Influential part-of-speech tags. We found a large number of
not listed, followed similar patterns.
                                                                        adjectives and adverbs in our influential word lists. This agrees with
                                                                        the influential word categories mentioned previously, because ad-
3.4    Analysis                                                         jectives and adverbs can be indicative of informativeness. We found
Influential words. To investigate the influence of keywords on          that adjectives were more frequently influential in the chocolate
sales, we performed t-tests on the coefficients of mixed-effects mod-   category while adverbs were more common in the health category.
els trained with RNN+GF-selected features (both morphemes and           Adjectives describing additional information such as “loved”(大好
BPE). We found out that influential descriptions generally contained    きだ), “healthy”(健康だ), and “perfect for”(ぴったりだ) had high
words in the following four categories:                                 coefficients in the chocolate category. Adverbs describing symp-
      • Informativeness This includes informative appeals to            toms or effect such as “irritated”(イライラ) and “vigorously” (ガ
        logos with language other than raw product attributes           ンガン) appeared in the health category.
        (i.e. brand name, product name, ingredients, price, and
        shipping). Words like “family size” (ファミリーサイズ),                 4    RELATED WORK
        “package design” (パッケージデザイン), “souvenir” (お                     In using large-scale text mining to characterize the behavior of
        土産), delimiters of structured information (“】【”, “★”,           e-commerce consumers, we draw on a large body of prior work in
        “●”), and indicators of detail (“x2”, “70%”, etc.) belong to    the space. Our inspiration comes from research on (i) unearthing
        this category.                                                  the drivers of purchasing behavior in e-commerce, (ii) modeling the
      • Authority This includes appeals to authority, in the form       relationship between product presentations and business outcomes,
        of authoritative figures or long-standing tradition. Words      and (iii) text mining and feature discovery in a confound-controlled
        such as “staff” (スタッフ), “old-standing shop” (老舗),               setting.
        and “doctor” (お医者様) belong to this category.                       There is an extensive body of literature on the progenitors of
      • Seasonality These words suggest seasonal dependencies.          e-commerce purchasing behavior. Classic work in psychology has
        Words such as “Christmas” (クリスマス), “Mother’s day”               shown that human and judgment and behavior influenced by per-
        (母の日), and “year-end gift” (歳暮) belong to this category.        suasive rhetoric [12, 49]. When our notions of human behavior
        Note that words related to out-of-season events had low         are narrowed to purchasing decisions on the internet, despite the
        influence on sales.                                             extreme diversity of online shoppers [38], prior work suggests
      • Politeness These expressions show politeness, respectful-       that vendor-disseminated information exhibits a strong persua-
        ness, and humbleness. Honorific Japanese (special words         sive influence. In fact, vendor-disseminated information affects
        and conjugations reserved for polite contexts) such as “ing”    purchase likelihood just as much as user-generated information
        (しており), “will do” (致します), “receive” (いただく)                      like word-of-mouth reviews [9]. The work of [22] incorporated
        belong to this category.                                        vendor-disseminated product information into a model of customer
  The following are two differing descriptions of the exact same        satisfaction, a precursor of purchasing behavior [4]. Similar work
product. Words with high coefficients are shown in bold.                has shown that product presentation (which entails textual descrip-
        Royce’s chocolate has become a standard Hokkaido                tions) has a significant impact on perceived convenience [26] and
        souvenir. They are packaged one by one so your                  credibility [36].
        hands won’t get dirty! Also, our staff recommends                  We also draw from prior research concerned with mining e-
        this product!                                                   commerce information and predicting sales outcomes. Most of the
        北海道のお土産で定番品となっているロイ                                             work in this space is concerned with product reviews, not descrip-
        ズ. 手が汚れないように1本ずつパッケージさ                                          tions. [18] and [2] mined product reviews for textual features that
        れているのもありがたい! 当店 スタッフもお                                          are predictive of economic outcomes. This research used summary
        すすめするロイズの自信作です！                                                 statistics of review text like length, Flesch-Kincaid readability scores
                                                                        [29], or, in the paradigm of [24], cluster membership in a semantic
        Four types of nuts: almonds, cashews, pecans, macadamia,        embedding space. Similar to us, [33] used product reviews to gener-
        as well as cookie crunch and almond puff were                   ate a domain-specific lexicon. However, this lexicon was used to pre-
        packed carefully into each chocolate bar. This item             dict sentiment, and then sales was predicted from sentiment. Some
        is shipped with a refrigerated courier service dur-             research has incorporated information from textual descriptions,
        ing the summer.                                                 but the best of these authors knowledge, the effect of descriptions
        アーモンド、カシュー、ペカン、マカダミ                                             alone is not studied. [42] used human subjects to illicit preferences
Table 6: The highest-scoring morpheme tokens according to each feature selection algorithm. Tokens relating to confounds
like brand, vendor or price are denoted with an asterisk. RNN+GF is the only method that avoided such tokens.

                                                                     Chocolate
    Lasso                                  Mutual Information               Odds-ratio                          RNN+GF
    *小川 (vendor address)                   高温 (hot)                         ペカン (pecan)                         神戸 (kobe)
    *商店 (vendor name)                      株式 (Co. Ltd)                     百貨店 (store dept.)                   説明 (description)
    送信 (send)                              詳細だ (detailed)                   ピーカン (pecan)                        フランス (france)
    さまざまだ (various)                        *ロイズコンフェクト (name)                新宿 (shinjuku)                       オーストラリア (australia)
    *有料 (charge)                           *ロイズ (brand name)                名人 (master)                         タイ (thailand)
    ショ糖 (sucrose)                          温度 (temperature)                 玉露 (gyokuro)                        イタリア (italy)
    同時に (simultaneous)                     以下 (under)                       *ラララ (product name)                 老舗 (long-standing shop)
    制限 (limit)                             セット (set)                        伴う (come along)                     ハワイ (hawaii)
    *買い得 (bargain)                         常温 (room temp.)                  会議 (award name)                     ミルキー (milky)
    ピーカン (pecan)                           保存 (preserve)                    会頭 (award name)                     蒜山 (hiruzen)
                                                                       Health
    Lasso                                  Mutual Information               Odds-ratio                          RNN+GF
    倍数 (bulk unit)                         消費 (consumption)                 *アウトレット (discount outlet)           ダイエット (weight loss)
    ビック (big)                              *爽快 (vendor name)                アラゴナイト (aragonite)                  確認 (confirmation)
    *淀川 (vendor address)                   見る (see)                         ソマチット (somatid)                     オレンジ (orange)
    *アウトレット (discount outlet)              ブラウザ (brower)                    ダントツ (the very best)                予告 (notice)
    *爽快 (vendor name)                      相談 (consult)                     *アース (brand name)                   商品 (product)
    支店 (branch)                            形状 (shape)                       *コリー (product name)                 注文 (order)
    地区 (district)                          対応 (support)                     筋骨 (bones)                          入金 (payment)
    鹿児島 (kagoshima)                        ネット (internet)                   ランナー (runner)                       サプリ (supplement)
    *スカルプ (product name)                   取り寄せる (stock)                    *ガレノス (brand name)                  説明 (explanation)
    くだもの (fruit)                           合す (mix)                         内外 (inside and outside)             ます (is (formal))


between descriptions and actual products, but did not compare               5    CONCLUSION
between descriptions. [53] tagged product descriptions with senti-          In this paper, we discovered that that seasonal, polite, authorita-
ment information and used this alongside review information to              tive and informative product descriptions led to the best business
predict sales. Similarly, [21] and [54] used description labellings and     outcomes in Japanese e-commerce.
summary statistics alongside other features to predict purchasing              In making these observations, we presented a statistical method
intent. Importantly, none of the prior work in this space seeks to          that infers consumer demand from e-commerce product descrip-
untangle the influence of confounding hidden variables (e.g. brand          tions. We showed for the first time that words in the embedded
loyalty, pricing strategies) from mined features.                           narratives of product descriptions are important determinants of
   Another body of research we draw from is that concerned with             sales, even when accounting for the influence of factors like brand
text mining and lexicon discovery in a confound-controlled setting.         loyalty and item identity.
Using odds ratios to select features and hierarchical regression to            In the process, we noted the inadequacies of traditional text
determine their importance is a canonical technique in the compu-           feature-selection algorithms, namely their ability to select features
tational linguistics literature [19, 28]. In general, alternative feature   that are decorrelated from these factors. To this end we presented
mining methods for downstream regression or classification tasks            a novel neural network feature selection method. The features
are rarely explored. [50] began with a set of hand-compiled corpora,        generated by this model are both high-performance and confound-
then ran t-tests to prune these corpora of insignificant keywords.          decorrelated.
[43] developed a neural architecture that picks out keywords from              There are many directions for future work. These include extend-
a passage. However, this group did not use an attention mechanism           ing our feature selectors to the broader setting of generalized lexi-
to pick these words, and the model was developed for summariza-             con induction, and applying our statistical models to e-commerce
tion applications. In the e-commerce literature, work alternatives          markets in other consumer cultures.
to odds-ratio still rely on uncontrolled co-occurrence statistics [35].


                                                                            ACKNOWLEDGMENTS
                                                                            We are grateful to David Jurgens and Will Hamilton for their advice.
REFERENCES                                                                                      Monday 19, 4 (2014).
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,               [29] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom.
     Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and                 1975. Derivation of new readability formulas (automated readability index, fog
     others. 2016. Tensorflow: Large-scale machine learning on heterogeneous dis-               count and flesch reading ease formula) for navy enlisted personnel. Technical
     tributed systems. arXiv preprint arXiv:1603.04467 (2016).                                  Report. DTIC Document.
 [2] Nikolay Archak, Anindya Ghose, and Panagiotis G Ipeirotis. 2011. Deriving the         [30] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
     pricing power of product features by mining consumer reviews. Management                   tion. International Conference for Learning Representations (2014).
     Science 57, 8 (2011), 1485–1509.                                                      [31] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-
 [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine                   semantic embeddings with multimodal neural language models. arXiv preprint
     translation by jointly learning to align and translate. International Conference on        arXiv:1411.2539 (2014).
     Learning Representations (ICLR) (2015).                                               [32] Alexandra Kuznetsova, Per Bruun Brockhoff, and Rune Haubo Bojesen Chris-
 [4] Billy Bai, Rob Law, and Ivan Wen. 2008. The impact of website quality on                   tensen. 2015. Package ‘lmerTest’. R package version 2 (2015).
     customer satisfaction and purchase intentions: Evidence from Chinese online           [33] Raymond YK Lau, Wenping Zhang, Peter D Bruza, and Kam-Fai Wong. 2011.
     visitors. International journal of hospitality management 27, 3 (2008), 391–402.           Learning domain-specific sentiment lexicons for predicting product sales. In
 [5] Kamil Bartoń. 2013. MuMIn: Multi-model inference. R package version 1.9. 13.               e-Business Engineering (ICEBE), 2011 IEEE 8th International Conference on. IEEE,
     The Comprehensive R Archive Network (CRAN), Vienna, Austria (2013).                        131–138.
 [6] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting            [34] Eun-Ju Lee and Soo Yun Shin. 2014. When do consumers buy online product
     Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1              reviews? Effects of review quality, product type, and reviewer’s photo. Computers
     (2015), 1–48.                                                                              in Human Behavior 31 (2014), 356–366.
 [7] Douglas M Bates. 2010. lme4: Mixed-effects modeling with R. (2010).                   [35] Thomas Lee and Eric T Bradlow. 2007. Automatic construction of conjoint
 [8] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, and others.                  attributes and levels from online customer reviews. University Of Pennsylvania,
     2007. Analysis of representations for domain adaptation. Advances in neural                The Wharton School Working Paper (2007).
     information processing systems 19 (2007), 137.                                        [36] Ziqi Liao and Michael Tow Cheung. 2001. Internet-based e-shopping and con-
 [9] Barbara Bickart and Robert M Schindler. 2001. Internet forums as influential               sumer attitudes: an empirical study. Information & management 38, 5 (2001),
     sources of consumer information. Journal of interactive marketing 15, 3 (2001),            299–306.
     31–40.                                                                                [37] Moez Limayem, Mohamed Khalifa, and Anissa Frini. 2000. What makes con-
[10] J Martin Bland and Douglas G Altman. 2000. The odds ratio. Bmj 320, 7247                   sumers buy from Internet? A longitudinal study of online shopping. IEEE Trans-
     (2000), 1468.                                                                              actions on Systems, Man, and Cybernetics-Part A: Systems and Humans 30, 4 (2000),
[11] Denny Britz, Anna Goldie, Thang Luong, and Quoc Le. 2017. Massive Exploration              421–432.
     of Neural Machine Translation Architectures. arXiv preprint arXiv:1703.03906          [38] Ying Liu, Hong Li, Geng Peng, Benfu Lv, and Chong Zhang. 2015. Online
     (2017).                                                                                    purchaser segmentation and promotion strategy selection: evidence from Chinese
[12] Shelly Chaiken, Mark P Zanna, James M Olson, and C Peter Herman. 1987. The                 E-commerce market. Annals of Operations Research 233, 1 (2015), 263–279.
     heuristic model of persuasion. In Social influence: the ontario symposium, Vol. 5.    [39] Gerald L Lohse and Peter Spiller. 1998. Quantifying the effect of user interface
     Hillsdale, NJ: Lawrence Erlbaum, 3–39.                                                     design features on cyberstore traffic and sales. In Proceedings of the SIGCHI
[13] Judith A Chevalier and Dina Mayzlin. 2006. The effect of word of mouth on sales:           conference on Human factors in computing systems. ACM Press/Addison-Wesley
     Online book reviews. Journal of marketing research 43, 3 (2006), 345–354.                  Publishing Co., 211–218.
[14] Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990),     [40] Daniel Lowd and Christopher Meek. 2005. Adversarial learning. In Proceedings
     179–211.                                                                                   of the eleventh ACM SIGKDD international conference on Knowledge discovery in
[15] Richard Friberg, Mattias Ganslandt, and Mikael Sandström. 2001. Pricing strate-            data mining. ACM, 641–647.
     gies in e-commerce: Bricks vs. clicks. Technical Report. IUI working paper.           [41] Christopher D Manning, Hinrich Schütze, and others. 1999. Foundations of
[16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo                       statistical natural language processing. Vol. 999. MIT Press.
     Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016.          [42] Deborah Brown McCabe and Stephen M Nowlis. 2003. The effect of examining
     Domain-adversarial training of neural networks. Journal of Machine Learning                actual products or product descriptions on consumer preference. Journal of
     Research 17, 59 (2016), 1–35.                                                              Consumer Psychology 13, 4 (2003), 431–439.
[17] David Gefen. 2002. Customer loyalty in e-commerce. Journal of the association         [43] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu
     for information systems 3, 1 (2002), 2.                                                    Chi. 2017. Deep Keyphrase Generation. Annual Meeting of the Association for
[18] Anindya Ghose and Panagiotis G Ipeirotis. 2011. Estimating the helpfulness and             Computational Linguistics (2017).
     economic impact of product reviews: Mining text and reviewer characteristics.         [44] Shinichi Nakagawa and Holger Schielzeth. 2013. A general and simple method for
     IEEE Transactions on Knowledge and Data Engineering 23, 10 (2011), 1498–1512.              obtaining R2 from generalized linear mixed-effects models. Methods in Ecology
[19] Anindya Ghose and Arun Sundararajan. 2006. Evaluating pricing strategy using               and Evolution 4, 2 (2013), 133–142.
     e-commerce data: Evidence and estimation challenges. Statist. Sci. (2006), 131–       [45] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
     142.                                                                                       Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
[20] David Godes and Dina Mayzlin. 2004. Using online conversations to study                    napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
     word-of-mouth communication. Marketing science 23, 4 (2004), 545–560.                      Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[21] Dennis Herhausen, Jochen Binder, Marcus Schoegel, and Andreas Herrmann.               [46] Franklin E Satterthwaite. 1946. An approximate distribution of estimates of
     2015. Integrating bricks with clicks: retailer-level and channel-level outcomes of         variance components. Biometrics bulletin 2, 6 (1946), 110–114.
     online–offline channel integration. Journal of Retailing 91, 2 (2015), 309–325.       [47] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
[22] Chin-Fu Ho and Wen-Hsiung Wu. 1999. Antecedents of customer satisfaction                   Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual
     on the Internet: an empirical study of online shopping. In Systems Sciences, 1999.         Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12,
     HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on.               2016, Berlin, Germany.
     IEEE, 9–pp.                                                                           [48] Srini S Srinivasan, Rolph Anderson, and Kishore Ponnavolu. 2002. Customer
[23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural               loyalty in e-commerce: an exploration of its antecedents and consequences.
     computation 9, 8 (1997), 1735–1780.                                                        Journal of retailing 78, 1 (2002), 41–50.
[24] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews.               [49] Brian Sternthal, Ruby Dholakia, and Clark Leavitt. 1978. The persuasive effect
     In Proceedings of the tenth ACM SIGKDD international conference on Knowledge               of source credibility: Tests of cognitive response. Journal of Consumer research 4,
     discovery and data mining. ACM, 168–177.                                                   4 (1978), 252–260.
[25] Nan Hu, Jie Zhang, and Paul A Pavlou. 2009. Overcoming the J-shaped distribu-         [50] Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The effect of wording on message
     tion of product reviews. Commun. ACM 52, 10 (2009), 144–147.                               propagation: Topic-and author-controlled natural experiments on Twitter. In
[26] Ling Jiang, Zhilin Yang, and Minjoon Jun. 2013. Measuring consumer perceptions             Proceedings of the 52nd Annual Meeting of the Association for Computational
     of online shopping convenience. Journal of Service Management 24, 2 (2013),                Linguistics. ACL, Baltimore, Maryland, 175–185.
     191–214.                                                                              [51] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal
[27] Fredrik D. Johansson, Uri Shalit, and David Sontag. 2016. Learning Representa-             of the Royal Statistical Society. Series B (Methodological) (1996), 267–288.
     tions for Counterfactual Inference. In Proceedings of the 33rd International Con-     [52] Lou W Turley and Ronald E Milliman. 2000. Atmospheric effects on shopping
     ference on International Conference on Machine Learning - Volume 48 (ICML’16).             behavior: a review of the experimental evidence. Journal of business research 49,
     JMLR.org, 3020–3029.                                                                       2 (2000), 193–211.
[28] Dan Jurafsky, Victor Chahuneau, Bryan R Routledge, and Noah A Smith. 2014.            [53] Hui Yuan, Wei Xu, Qian Li, and Raymond Lau. 2017. Topic sentiment mining
     Narrative framing of consumer sentiment in online restaurant reviews. First                for sales performance prediction in e-commerce. Annals of Operations Research
     (2017), 1–24.
[54] Cai-Nicolas Ziegler, Lars Schmidt-Thieme, and Georg Lausen. 2004. Exploiting
     semantic product descriptions for recommender systems. In Proceedings of the
     2nd ACM SIGIR Semantic Web and Information Retrieval Workshop. 25–29.


                                                                                    10