Predicting Sales from the Language of Product Descriptions Reid Pryzant Young-joo Chung Dan Jurafsky Stanford University Rakuten Institute of Technology Stanford University rpryzant@stanford.edu yjchung@acm.org jurafsky@stanford.edu ABSTRACT test, and evaluate products before making purchasing decisions, What can a business say to attract customers? E-commerce vendors the remote nature of e-commerce renders such tactile evaluations frequently sell the same items but use different marketing strate- obsolete. gies to present their goods. Understanding consumer responses to In lieu of in-store evaluation, online shoppers increasingly rely this heterogeneous landscape of information is important both as on alternative sources of information. This includes “word-of-mouth” business intelligence and, more broadly, a window into consumer recommendations from outside sources [9] and local product re- attitudes. When studying consumer behavior, the existing litera- views [13, 18, 20]. These factors, though well studied, are only ture is primarily concerned with product reviews. In this paper we indirectly controllable from a business perspective [25, 52]. Busi- posit that textual product descriptions are also important determi- ness owners have considerably stronger control over their own nants of consumer choice. We mine 90,000+ product descriptions product descriptions. The same products may be sold by multiple on the Japanese e-commerce marketplace Rakuten and identify ac- vendors, with each item having a different textual description (note tionable writing styles and word usages that are highly predictive that we take product to mean a purchasable object, and item to mean of consumer purchasing behavior. In the process, we observe the an individual e-commerce listing). Studying consumers’ reactions inadequacies of traditional feature extraction algorithms, namely to these descriptions is valuable both as business intelligence and their inability to control for the implicit effects of confounds like as a new window into consumer attitudes. brand loyalty and pricing strategies. To circumvent this problem, The hypothesis that business-generated product descriptions we propose a novel neural network architecture that leverages an affect consumer behavior (manifested in sales) has received strong adversarial objective to control for confounding factors, and atten- support in prior empirical studies [22, 26, 34, 37, 39]. However, these tional scores over its input to automatically elicit textual features studies have only used summary statistics of these descriptions (i.e. as a domain-specific lexicon. We show that these textual features readability, length, completeness). We propose that embedded in can predict the sales of each product, and investigate the narratives these product descriptions are narratives that affect shoppers, which highlighted by these words. Our results suggest that appeals to au- can be studied by examining the words in each description. thority, polite language, and mentions of informative and seasonal Our hypothesis is that product descriptions are fundamentally language win over the most customers. a kind of social discourse, one whose linguistic contents have real control over consumer purchasing behavior. Business owners em- CCS CONCEPTS ploy narratives to portray their products, and consumers react accordingly according to their beliefs and attitudes. • Information systems → Content analysis and feature se- To test this hypothesis, we mine 93,591 product descriptions and lection; • Computing methodologies → Information extrac- sales records from the Japanese e-commerce website rakuten.co.jp tion; Neural networks; (“Rakuten”). We build models that can explain how the textual con- tent of product descriptions impacts sales. Second, we use these KEYWORDS models to conduct a explanatory analysis, identifying what linguis- e-commerce, feature selection, neural networks, adversarial learn- tic aspects of product descriptions are the most important determi- ing, natural language processing nants of success. ACM Reference format: We seek to unearth actionable phrases that can help e- Reid Pryzant, Young-joo Chung, and Dan Jurafsky. 2017. Predicting Sales commerce vendors increase their sales regardless of what’s from the Language of Product Descriptions. In Proceedings of SIGIR, Tokyo, being sold. Thus, we want to study the effect of language on sales Japan, August 2017 (SIGIR 2017 eCom), 10 pages. in isolation, i.e. find textual features that are untangled from the effects of pricing strategies [15], brand loyalty [17, 48], and product identity. Choosing features for such a task is a challenging problem, 1 INTRODUCTION because product descriptions are embedded in a larger e-commerce The internet has dramatically altered consumer shopping habits. experience that leverages the shared power of these confounds to Whereas customers of physical stores can physically manipulate, market a product. For a not-so-subtle example, product descrip- tions frequently boast “free shipping!”, overtly pointing to a pricing Copyright © 2017 by the paper’s authors. Copying permitted for private and academic purposes. strategy with known power over consumer choice [19]. In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): We develop a new text feature selection algorithm to operate Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published in this confound-controlled setting. This algorithm makes use of a at http://ceur-ws.org novel neural network architecture. The network uses attentional scores over its input and an adversarial objective to select a lexi- (e.g low-selling descriptions). Note that this method requires di- con that is simultaneously predictive of consumer behavior and chotomized targets, which we discuss further in Section 3.1. controlled for confounds such as brand and price. Mutual information (MI) is a measurement of how informative We evaluate our feature selection algorithm on two pools of fea- the presence of a token is to making correct classification decisions. ture candidates: morphemes obtained with the JUMAN tokenizer1 , Formally, the mutual information MI (t, c) of a token t and binary and sub-word units obtained via byte-pair encoding (“BPE”) [47]. class c is From these pools we select features with either (1) our proposed neural network, (2) odds ratios [10], (3) mutual information [41], Õ Õ P(It , Ic ) MI (t, c) = P(It , Ic ) log (2) and (4) the features with nonzero coefficients of a L1 regularized P(It )P(IC ) I t ∈ {1,0} I t ∈ {1,0} linear regression. Our results suggest that lexicons produced by the neural model are both less correlated with confounding factors and where It and Ic are indicators on term presence and class label for the most powerful predictors of sales. a given description. Like OR, this method requires dichotomized In summary, our contributions are as follows: sales targets. • We demonstrate that the narratives embedded in e-commerce Lasso Regularization (L1) can perform variable selection on a product descriptions influence sales. linear regression model [51] by including a regularization term to • We propose a novel neural architecture to mine features the least squares objective. This term penalizes the L1 norm of the for the task. model parameters: • We discover actionable writing styles and words that have especially high influence on these outcomes. N  nÕ Õ o arg min yi − β 0 − β j xi j , (3) 2 PREDICTING SALES FROM DESCRIPTIONS i=1 j Õ Our task is to predict consumer demand (measured in log(sales)) subject to |β j | ≤ α (4) from the narratives embedded in product descriptions. To do so, j . we mine features from these textual data and fit a statistical model. Where yi is the ith target, β 0 is an intercept, β j is the jth coefficient In this section, we review our feature-mining baselines, present of the ith predictor x i . α is pre-specified parameter that determines our novel approach to feature-mining, and outline our statistical the amount of regularization. The parameter α can be obtained by technique for predicting sales from these features while accounting minimizing the error in cross-validation. for confounding factors like brand loyalty and product identity. 2.1 Feature Mining Preliminaries 2.3 Deep Adversarial Feature Mining An important limitation of all the aforementioned feature selection We approach the featurization problem by first segmenting prod- methods is that they are incapable of selecting features that are uct descriptions into sequences of tokens, then selecting tokens decorrelated from confounds like brand and price. Recall from from the vocabulary of tokens that are predictive of high sales. Section 1 the price-related example of “free shipping!”. Consider We take subsets of these vocabularies (rather than one feature per the brand-related example of “the quality you know and love from vocabulary item) because (1) we need to be able to examine the Daison”. Though effective marketing tools, these phrases leverage linguistic contents of the resulting feature sets, and (2) we need the power of pricing strategies and brand loyalty, factors with models that are highly generalizable, and not too closely adapted known power over consumers. We wish to study the impact of to the peculiarities of these data’s vocabulary distributions. linguistic structures in product descriptions in isolation, beyond We select predictive subsets of the data’s tokenized vocabularies those indicators of price or branding. Thus, we consider brand, in four ways. Three of these (Section 2.2) are traditional feature product, and price information as confounding factors that confuse selection methods that serve as strong baselines for our proposed the effect of language on consumers. method (Section 2.3). As a solution to this problem, we propose a novel feature-selecting neural network (RNN+/-GF), sketched in Figure 1. The model uses 2.2 Traditional Feature Mining an attention mechanism to produce estimates for log(sales), brand, Odds Ratio (OR) finds words that are over-represented in a partic- and price. We omit product because it is only present in our test ular copora when compared to another (e.g. descriptions of high data; see Section 3.1 for details. During training, the model uses selling items verses those of low-selling counterparts). Formally, an adversarial objective to discourage feature effectiveness with this is: respect to two of these prediction targets: brand and price. That is, the model finds features that are good at predicting sales, and bad pi /(1 − pi ) at predicting brand and price. (1) p j /(1 − p j ) Deep learning review. Before we describe the model, we review where pi is the probability of the word in copora i (e.g high-selling its primary building blocks. descriptions) and p j is the probability of the word in copora j Feedforward Neural Networks (FFNNs) are composed of a series of fully connected layers, where each layer takes on the form 1 JUMAN (a User-Extensible Morphological Analyzer for Japanese), http://nlp.ist.i. kyoto-u.ac.jp/EN/index.php?JUMAN y = f (W x + b). (5) Figure 1: An illustration of the proposed RNN+GF model operating on an example product description with three timesteps. All operations and dimensionalities are explicitly shown. Vectors are depicted as rounded rectangles, matrix multiplications as squared rectangles, and scalars as circles. Trainable parameters are grey, while dynamically computed values are colored. Gradient reversal layers multiply gradients by -1 as they backpropagate from the prediction networks to the encoder. In this example, the model attends to the description’s final token the most, so that would be the most likely candidate for a generated lexicon. Note that x ∈ Rn is a vector of inputs (e.g. from a previous layer), states (eq. 8). Finally, this distribution is used to compute a weighted W ∈ Ry×n is a matrix of parameters, b ∈ Ry is a vector of biases, average of hidden states c (eq. 9). Formally, this can be written as: y ∈ Ry is an output vector, and f (·) is some nonlinear activation ⊺ function, e.g. the ReLU: ReLU (x) = max{0, x }. âi = va tanh(Wa hi ) (7) Recurrent Neural Networks (RNNs) are effective tools for a = softmax(â) (8) learning structure from sequential data [14]. RNNs take a vector Õ c= aj hj (9) x t at each timestep. They compute a hidden state vector ht ∈ Rh j at each timestep by applying nonlinear maps to the the previous hidden state ht −1 and the current input x t (note that h 0 is initialized Our model. We continue by describing our adversarial feature to 0®): mining model. The process of obtaining features from the model can   be thought of as a three-stage algorithm: (1) forward pass, where ht = σ W (hx )x t + W (hh)ht −1 . (6) predictions are generated, (2) backward pass, where parameters are updated, and, after repeated iterations of 1 and 2, (3) feature W (hx ) ∈ Rh×n , W (hh) ∈ Rx ×h are parameterized matrices. We selection, where we use attentional scores to elicit lexicons. use Long Short-Term Memory Network (LSTM) cells, a variant of The forward pass operates as follows: the traditional RNN cell that can more effectively model long-term temporal dependencies [23]. (1) The segmented input is fed into an LSTM to produce hidden Attention mechanisms. Attentional mechanisms allow neural state encodings for each timestep. models to focus on parts of the encoded input before producing (2) We compute an attentional summary of these hidden states predictions. We calculate Bahdanau-style attentional contexts [3] to obtain a single vector encoding of the input. because these have been shown to perform well for other tasks (3) We feed this encoding into three FFNNs. One is a regres- like translation and language modeling [11, 31], and preliminary sion network that tries to minimize L = ||ŷ − x ||2 , the experiments suggested that this mechanism worked best for our squared loss between the predicted and true log(price). problem setting. The second and third are classification networks, which Bahdanau-style attention computes the attentional context as a predict a likelihood distribution over all possible labels, weighted average of hidden states. The weights are computed as and are trained to minimize L = − log p(y), the negative follows: pass each hidden state hi through a fully-connected neural log probability of the correct class label. We attach classi- network, then compute a dot product with a vector of parameters to fication networks for brand id and a dochotomization of produce an intermediary scalar âi (eq. 7). Next, the âi ’s are scaled price (see Section 3.1 for details). We dichotomized sales in by a softmax function so that they map to a distribution over hidden this way to create a fair comparison between this method and the baselines: other feature selection algorithms (OR, We proceed with a formal description of our mixed-effects model. MI) are not so flexible and require dichotomized targets. Let yi jk be the log(sales) of item i, which is product j and sold The backward pass draws on prior work in leveraging adver- by brand k. The description for this item is written as xi jk , and (h) sarial objective functions to match feature distributions in different each x i jk ∈ xi jk is the h th feature of this description. With these settings [40]. In particular, we draw from a line of research in the definitions, we can write our mixed-effects model model as style of [16], [8], and [27]. This method involves passing gradients Õ (h) through a gradient reversal layer, which multiplies gradients by yi jk = β 0 + βh x i jk + γ j + α k + ϵi jk (11) a negative constant, i.e. -1, as they propagate back through the h network. Intuitively, this encourages parameters to update away γ j ∼ N(0, σγ2 ) (12) from the optimization objective. α k ∼ N(0, σα2 ) (13) If Lsal es , Lbr and , Lpr ice are the regression and classification losses from each prediction network, then the final loss we are opti- ϵi jk ∼ N(0, σϵ2 ) (14) mizing is L = Lsal es +Lbr and +Lpr ice . However, when backprop- where γ j and α k are the random effects of product and brand, respec- agating from each prediction network to the encoder, we reverse tively, and ϵi jk is an item-specific effect, i.e. this item’s deviation the gradients of the networks that are predicting confounds. This from the mean item sales. means that the prediction networks still learn to predict brand and Nakagawa and Schielzeth [44] introduced the marginal and con- price, but the encoder is forced to learn brand- and price-invariant ditional R 2 (Rm 2 and R 2 ) as summary statistics of mixed-effects mod- c representations which are not useful to these downstream tasks. els. Marginal Rm 2 is the R 2 of the textual effects only. It reports the We hope that such representations encourage the model to attend proportion of variance in the model’s predictions can be explained to confound-decorrelated tokens. (h) with fixed effects variables x i jk . It is written as; The lexicon induction stage uses a trained model defined above to select textual features that are predictive of sales, but control for σf2 2 the influence of brand and price. This stage operates as follows: Rm = , (15) σf2 + σγ2 + σα2 + σϵ2 (1) Generate predictions for each test item, but rather than ! saving those predictions, save the attentional distribution 2 Õ (h) σf = var βh x i jk . (16) over each source sequence. h (2) Standardize these distributions. For each input i, standard- ize the distribution over timesteps p(i) by computing Conditional Rc2 is the R 2 of the entire model (text + product + brand). It conditions on the variances of the random factors we are (i) (i) p(i) − µ p controlling for (product and brand): z = (10) σp (i) σf2 + σγ2 + σα2 Rc2 = . (17) (3) Merge these standardized distribution over each input se- σf2 + σγ2 + σα2 + σϵ2 quence. If there is a word collision (i.e. we observe the same token in multiple input sequences and the model assigned 3 EXPERIMENTS each observation a different z-score), take the max of those We now detail a series of experiments that were conducted to evalu- words’ z-scores. ate the effectiveness of each feature set, and, more generally, to test (4) Select the k tokens with highest z-scores. This is our in- the hypothesis that narratives embedded in product descriptions duced lexicon. are indeed predictive of sales. 2.4 Using Features to Predict Sales 3.1 Product and Sales Data Once we have mined textual features from product descriptions, we We obtained data on e-commerce product descriptions, sales, ven- need a statistical model that accounts for the effects of confounding dors, and prices from a December 2012 snapshot of the Rakuten variables like product identity and brand loyalty in predicting the marketplace2 . We focused on items belonging to two product cate- sales of each item. We use a mixed-effects model, a type of hierar- gories: chocolate and health. These two categories are both popular chical regression that assumes observations can be explained with on the marketplace, but their characteristics are different. There is two types of categorical variables: fixed effect variables and random more variability among chocolate products than health products; effect variables [7]. many vendors are boutiques that sell handmade goods. Health ven- We model textual features as fixed effects. We take the product dors, on the other hand, are often large providers of pharmaceuticals that each item corresponds to and the brand selling each item as goods, sometimes wholesale. random effects. Thus, we force the model to assume that product We segment product descriptions two ways. First, we tokenize de- and brand information is decorrelated from everything else, and we scriptions into morphological units (morphemes) with the JUMAN expect to observe the explanatory power of text features without tokenizer3 . Second, we break descriptions into frequently occurring the influence of brand or product. Note that the continuous nature 2 Please refer to https://rit.rakuten.co.jp/opendata.html for details on data acquisition. of the “price” confound precludes our ability to model it (Section 3 Using JUMAN (a User-Extensible Morphological Analyzer for Japanese), 3.1). http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN sub-word units 4 . From here on we refer to the morpheme features hidden states, and 32-dimensional intermediate Bahdanau vectors as “morph”, and sub-word features as “BPE”. as described in Figure 1. Dropout at a rate of 0.2 was applied to the Details of these data can be found in Table 1. Notably, the ratio input of each LSTM cell. We optimized using Adam, a batch size of the size of vocabulary (unique keywords) to the size of tokens of 128, and a learning rate of 0.0001 [30]. All models took approx- (occurrence of keywords) in the chocolate category is twice as large imately three hours to reach convergence on a Nvidia TITAN X as that of the health category as listed in (%) in Table 1. This implies GPU. that product descriptions in the chocolate category are written with The L1 regularization parameter α was obtained with the scikit- more diverse language. learn library [45] by minimizing the error in the four-fold cross Recall that some feature selection algorithms (OR, MI) require validation on training set. dichotomized prediction targets. Thus, we dichotomized the data In all of our experiments, we analyzed the log(sales) of an item on log(sales), taking the top-selling 30% and bottom-selling 30% as as a function of textual description features. We used mixed-effects positive and negative examples, respectively. Our textual features regression to model the relationship between these two entities. were selected using these dichotomized data. We included linguistic features obtained by the methods of Section In order to evaluate mixed-effects regression models on these 2.2 and 2.3 as fixed effect variables, and the confounding prod- data, we consider the vendor selling an item as its “brand identifier” uct/vendor identifiers in the test set as random effect variables. We (vendors have unique branding on the Rakuten platform). We also used the “lme4” package in the R software environment v. 3.3.3 need to know what product each item corresponds to, something to perform these analyses [6]. To evaluate feature effectiveness not present in the data. Thus, we hand-labeled 2,131 items with and goodness of fit, we obtained conditional and marginal R 2 val- product identifiers and separated these into a separate dataset for ues with the “MuMIn” R package [5]. We also performed t-tests testing (Table 2). Our experimental results are reported on this test to obtain significance measurements on the model’s fitted parame- data set. ters. For this we obtained degrees of freedom with Satterthwaite approximations [46] with the “lmerTest” R package[32]. Table 1: Characteristics of the Rakuten data. These data con- In addition to keywords, we experimented with two additional sist of 93,591 product descriptions, vendors, prices, and sales types of features: description length in number of keywords and figures. part-of-speech tags obtained with JUMAN. Chocolate Health 3.3 Experimental Results # items 32,104 61,487 Influence of narratives. Figure 2 depicts the performance of # vendors 1,373 1,533 mixed-effects regression models fitted with the top 500 features # morph tokens 5,237,277 11,544,145 from each approach. Overall, these results strongly support the # BPE tokens 6,581,490 16,706,646 hypothesis that narrative elements of product descriptions are pre- # morph vocab (%) 18,807 (0.36%) 20,669 (0.18%) dictive of consumer behavior. Adding text features to the model # BPE vocab (%) 16,000 (0.24%) 16,000 (0.10%) increased its explanatory power in all settings. The marginal Rm 2 ’s of each approach are listed on Table 3. The RNN+GF method selected features superior in both marginal and conditional R 2 . This implies that it could select features that perform well in both isolated and Table 2: Characteristics of the test data. Product identifiers confound-combined settings. were manually assigned to these data for evaluation. To investigate whether the high performance of RNN+GF fea- tures is simply a phenomenon of model capacity, we compared Chocolate Health RNN+GF and one of the best-performing baselines, that of the lasso. We varied the number of features each algorithm is allowed to # items 924 1207 select and compared the resulting conditional R 2 values, finding # products 186 50 that RNN+GF features are consistently on-par with or outperform # vendors 201 384 that of the lasso, regardless of feature count as shown in Figure 3. avg. # items per product 4 9 (min, max) (2, 26) (2, 134) Effect of gradient reversal To determine the role of gradient 3.2 Experimental Protocol reversal in the efficacy of the RNN+GF features, we conducted All deep learning models were implemented using the Tensor- an ablation test, toggling the gradient reversal layer of our model flow framework [1]. In order to obtain features from the proposed and observing the performance of the elicited features. From Ta- RNN+GF model, we conducted a brief hyperparameter search on a ble 4, it is apparent that the confound-invariant representations held-out development set. This set consisted of of 2,000 examples encouraged by gradient reversal lead to more effective features randomly drawn from the pool of training data. The final model being selected. Apart from summary statistics, this observation used 32-dimensional word vectors, an LSTM with 64-dimensional can be seen in the features themselves. For example, one of the 4 Using https://github.com/google/sentencepiece highest scoring morphemes without gradient reversal was 無料 Figure 3: Conditional R 2 (Rc2 ) of the model trained varying numbers of of morpheme/BPE features. Despite being decor- related from the random effects of brand and price, RNN+GF features are competitive with that of the lasso regardless of token type and feature set size. Figure 2: Conditional R 2 of random effects only models Table 4: Gradient reversal ablation and its impact on condi- (brand + product) and full models (brand + product + key- tional R 2 . The confound-invariance encouraged by the ad- words + POS + BPE tokens) from Table 3. Including tex- versarial objective helps downstream regressions. tual features in mixed effect regressions improves predictive power regardless of dataset and feature selection method Chocolate Health features provide the largest gains. Morpheme tokens yielded BPE morph BPE morph similar results. +GF 0.81 0.81 0.78 0.75 -GF 0.76 0.75 0.64 0.69 Table 3: The explanatory power of random effect confounds (brand, product), text (BPE features, description length, and Comparison of different feature mining strategies. To in- POS tags), and the combination of confounds and text. Mar- vestigate whether the proposed method successfully discovered ginal and conditional R 2 are depicted where appropriate. features that are simultaneously explanatory of sales and untan- The RNN+GF-selected features appear superior with and gled from the confounding effects of product, brand, and price, we without confounds (Rc2 and Rm 2 ). Morpheme features yielded computed the correlations between BPE tokens selected by differ- similar results. ent methods and these non-linguistic confounds. For each feature set, the average per-feature Cramer’s V was computed for product and brand, while the average per-feature point-biserial correlation Chocolate coefficient was computed for price. Our results indicate that the Model features R 2 type L1 MI OR RNN+GF RNN+GF features are less correlated with these confounds than any other method (Table 5). confounds only conditional 0.57 0.57 0.57 0.57 text only marдinal 0.58 0.53 0.49 0.60 confounds + text conditional 0.78 0.73 0.71 0.81 Table 5: Average association strengths between each BPE to- ken set and non-linguistic factors. The RNN+GF features Health are the least correlated with these confounding factors. Mor- Model Features R 2 type L1 MI OR RNN+GF pheme tokens yielded similar results. confounds only conditional 0.44 0.44 0.44 0.44 text only marдinal 0.40 0.40 0.36 0.44 L1 MI OR RNN+GF confounds + text conditional 0.65 0.71 0.69 0.78 product 0.55 0.57 0.55 0.38 brand 0.58 0.54 0.57 0.42 price 0.08 0.08 0.08 0.07 (“free”). The RNN+GF features, on the other hand, are devoid of Examining the keywords selected by different methods suggests words relating to brand/vendor/price. the same story as Table 5. Morpheme features with high importance values are listed in Table 6. Note that the RNN+GF approach was the アの4種類のナッツとクッキークランチや only method that did not select any keywords correlated with prod- アーモンドパフを一本のチョコレートバー uct, brand, or price. Additionally, every method except RNN+GF にぎっしり詰め込みました。こちらは夏期 selected pecan (ピーカン・ペカン). Lalala’s pecan chocolate is クール便発送商品です。 one of the most popular products on the marketplace. Although The item with the former description was preferred by customers. it is understandable that these tokens contribute to sales, they are It contains words suggestive of authority (“standard”, “staff”), infor- product-specific and thus not generalizable. On the other hand, mativeness (“package”, “souvenir”), and concern for the customer RNN+GF gave high scores to location-related words. Similar ten- while the latter description is primarily concerned with ingredients. dencies were observed in the health category. BPE tokens, though Influential part-of-speech tags. We found a large number of not listed, followed similar patterns. adjectives and adverbs in our influential word lists. This agrees with the influential word categories mentioned previously, because ad- 3.4 Analysis jectives and adverbs can be indicative of informativeness. We found Influential words. To investigate the influence of keywords on that adjectives were more frequently influential in the chocolate sales, we performed t-tests on the coefficients of mixed-effects mod- category while adverbs were more common in the health category. els trained with RNN+GF-selected features (both morphemes and Adjectives describing additional information such as “loved”(大好 BPE). We found out that influential descriptions generally contained きだ), “healthy”(健康だ), and “perfect for”(ぴったりだ) had high words in the following four categories: coefficients in the chocolate category. Adverbs describing symp- • Informativeness This includes informative appeals to toms or effect such as “irritated”(イライラ) and “vigorously” (ガ logos with language other than raw product attributes ンガン) appeared in the health category. (i.e. brand name, product name, ingredients, price, and shipping). Words like “family size” (ファミリーサイズ), 4 RELATED WORK “package design” (パッケージデザイン), “souvenir” (お In using large-scale text mining to characterize the behavior of 土産), delimiters of structured information (“】【”, “★”, e-commerce consumers, we draw on a large body of prior work in “●”), and indicators of detail (“x2”, “70%”, etc.) belong to the space. Our inspiration comes from research on (i) unearthing this category. the drivers of purchasing behavior in e-commerce, (ii) modeling the • Authority This includes appeals to authority, in the form relationship between product presentations and business outcomes, of authoritative figures or long-standing tradition. Words and (iii) text mining and feature discovery in a confound-controlled such as “staff” (スタッフ), “old-standing shop” (老舗), setting. and “doctor” (お医者様) belong to this category. There is an extensive body of literature on the progenitors of • Seasonality These words suggest seasonal dependencies. e-commerce purchasing behavior. Classic work in psychology has Words such as “Christmas” (クリスマス), “Mother’s day” shown that human and judgment and behavior influenced by per- (母の日), and “year-end gift” (歳暮) belong to this category. suasive rhetoric [12, 49]. When our notions of human behavior Note that words related to out-of-season events had low are narrowed to purchasing decisions on the internet, despite the influence on sales. extreme diversity of online shoppers [38], prior work suggests • Politeness These expressions show politeness, respectful- that vendor-disseminated information exhibits a strong persua- ness, and humbleness. Honorific Japanese (special words sive influence. In fact, vendor-disseminated information affects and conjugations reserved for polite contexts) such as “ing” purchase likelihood just as much as user-generated information (しており), “will do” (致します), “receive” (いただく) like word-of-mouth reviews [9]. The work of [22] incorporated belong to this category. vendor-disseminated product information into a model of customer The following are two differing descriptions of the exact same satisfaction, a precursor of purchasing behavior [4]. Similar work product. Words with high coefficients are shown in bold. has shown that product presentation (which entails textual descrip- Royce’s chocolate has become a standard Hokkaido tions) has a significant impact on perceived convenience [26] and souvenir. They are packaged one by one so your credibility [36]. hands won’t get dirty! Also, our staff recommends We also draw from prior research concerned with mining e- this product! commerce information and predicting sales outcomes. Most of the 北海道のお土産で定番品となっているロイ work in this space is concerned with product reviews, not descrip- ズ. 手が汚れないように1本ずつパッケージさ tions. [18] and [2] mined product reviews for textual features that れているのもありがたい! 当店 スタッフもお are predictive of economic outcomes. This research used summary すすめするロイズの自信作です! statistics of review text like length, Flesch-Kincaid readability scores [29], or, in the paradigm of [24], cluster membership in a semantic Four types of nuts: almonds, cashews, pecans, macadamia, embedding space. Similar to us, [33] used product reviews to gener- as well as cookie crunch and almond puff were ate a domain-specific lexicon. However, this lexicon was used to pre- packed carefully into each chocolate bar. This item dict sentiment, and then sales was predicted from sentiment. Some is shipped with a refrigerated courier service dur- research has incorporated information from textual descriptions, ing the summer. but the best of these authors knowledge, the effect of descriptions アーモンド、カシュー、ペカン、マカダミ alone is not studied. [42] used human subjects to illicit preferences Table 6: The highest-scoring morpheme tokens according to each feature selection algorithm. Tokens relating to confounds like brand, vendor or price are denoted with an asterisk. RNN+GF is the only method that avoided such tokens. Chocolate Lasso Mutual Information Odds-ratio RNN+GF *小川 (vendor address) 高温 (hot) ペカン (pecan) 神戸 (kobe) *商店 (vendor name) 株式 (Co. Ltd) 百貨店 (store dept.) 説明 (description) 送信 (send) 詳細だ (detailed) ピーカン (pecan) フランス (france) さまざまだ (various) *ロイズコンフェクト (name) 新宿 (shinjuku) オーストラリア (australia) *有料 (charge) *ロイズ (brand name) 名人 (master) タイ (thailand) ショ糖 (sucrose) 温度 (temperature) 玉露 (gyokuro) イタリア (italy) 同時に (simultaneous) 以下 (under) *ラララ (product name) 老舗 (long-standing shop) 制限 (limit) セット (set) 伴う (come along) ハワイ (hawaii) *買い得 (bargain) 常温 (room temp.) 会議 (award name) ミルキー (milky) ピーカン (pecan) 保存 (preserve) 会頭 (award name) 蒜山 (hiruzen) Health Lasso Mutual Information Odds-ratio RNN+GF 倍数 (bulk unit) 消費 (consumption) *アウトレット (discount outlet) ダイエット (weight loss) ビック (big) *爽快 (vendor name) アラゴナイト (aragonite) 確認 (confirmation) *淀川 (vendor address) 見る (see) ソマチット (somatid) オレンジ (orange) *アウトレット (discount outlet) ブラウザ (brower) ダントツ (the very best) 予告 (notice) *爽快 (vendor name) 相談 (consult) *アース (brand name) 商品 (product) 支店 (branch) 形状 (shape) *コリー (product name) 注文 (order) 地区 (district) 対応 (support) 筋骨 (bones) 入金 (payment) 鹿児島 (kagoshima) ネット (internet) ランナー (runner) サプリ (supplement) *スカルプ (product name) 取り寄せる (stock) *ガレノス (brand name) 説明 (explanation) くだもの (fruit) 合す (mix) 内外 (inside and outside) ます (is (formal)) between descriptions and actual products, but did not compare 5 CONCLUSION between descriptions. [53] tagged product descriptions with senti- In this paper, we discovered that that seasonal, polite, authorita- ment information and used this alongside review information to tive and informative product descriptions led to the best business predict sales. Similarly, [21] and [54] used description labellings and outcomes in Japanese e-commerce. summary statistics alongside other features to predict purchasing In making these observations, we presented a statistical method intent. Importantly, none of the prior work in this space seeks to that infers consumer demand from e-commerce product descrip- untangle the influence of confounding hidden variables (e.g. brand tions. We showed for the first time that words in the embedded loyalty, pricing strategies) from mined features. narratives of product descriptions are important determinants of Another body of research we draw from is that concerned with sales, even when accounting for the influence of factors like brand text mining and lexicon discovery in a confound-controlled setting. loyalty and item identity. Using odds ratios to select features and hierarchical regression to In the process, we noted the inadequacies of traditional text determine their importance is a canonical technique in the compu- feature-selection algorithms, namely their ability to select features tational linguistics literature [19, 28]. In general, alternative feature that are decorrelated from these factors. To this end we presented mining methods for downstream regression or classification tasks a novel neural network feature selection method. The features are rarely explored. [50] began with a set of hand-compiled corpora, generated by this model are both high-performance and confound- then ran t-tests to prune these corpora of insignificant keywords. decorrelated. [43] developed a neural architecture that picks out keywords from There are many directions for future work. These include extend- a passage. However, this group did not use an attention mechanism ing our feature selectors to the broader setting of generalized lexi- to pick these words, and the model was developed for summariza- con induction, and applying our statistical models to e-commerce tion applications. In the e-commerce literature, work alternatives markets in other consumer cultures. to odds-ratio still rely on uncontrolled co-occurrence statistics [35]. ACKNOWLEDGMENTS We are grateful to David Jurgens and Will Hamilton for their advice. REFERENCES Monday 19, 4 (2014). [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, [29] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and 1975. Derivation of new readability formulas (automated readability index, fog others. 2016. Tensorflow: Large-scale machine learning on heterogeneous dis- count and flesch reading ease formula) for navy enlisted personnel. Technical tributed systems. arXiv preprint arXiv:1603.04467 (2016). Report. DTIC Document. [2] Nikolay Archak, Anindya Ghose, and Panagiotis G Ipeirotis. 2011. Deriving the [30] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza- pricing power of product features by mining consumer reviews. Management tion. International Conference for Learning Representations (2014). Science 57, 8 (2011), 1485–1509. [31] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual- [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine semantic embeddings with multimodal neural language models. arXiv preprint translation by jointly learning to align and translate. International Conference on arXiv:1411.2539 (2014). Learning Representations (ICLR) (2015). [32] Alexandra Kuznetsova, Per Bruun Brockhoff, and Rune Haubo Bojesen Chris- [4] Billy Bai, Rob Law, and Ivan Wen. 2008. The impact of website quality on tensen. 2015. Package ‘lmerTest’. R package version 2 (2015). customer satisfaction and purchase intentions: Evidence from Chinese online [33] Raymond YK Lau, Wenping Zhang, Peter D Bruza, and Kam-Fai Wong. 2011. visitors. International journal of hospitality management 27, 3 (2008), 391–402. Learning domain-specific sentiment lexicons for predicting product sales. In [5] Kamil Bartoń. 2013. MuMIn: Multi-model inference. R package version 1.9. 13. e-Business Engineering (ICEBE), 2011 IEEE 8th International Conference on. IEEE, The Comprehensive R Archive Network (CRAN), Vienna, Austria (2013). 131–138. [6] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting [34] Eun-Ju Lee and Soo Yun Shin. 2014. When do consumers buy online product Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1 reviews? Effects of review quality, product type, and reviewer’s photo. Computers (2015), 1–48. in Human Behavior 31 (2014), 356–366. [7] Douglas M Bates. 2010. lme4: Mixed-effects modeling with R. (2010). [35] Thomas Lee and Eric T Bradlow. 2007. Automatic construction of conjoint [8] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, and others. attributes and levels from online customer reviews. University Of Pennsylvania, 2007. Analysis of representations for domain adaptation. Advances in neural The Wharton School Working Paper (2007). information processing systems 19 (2007), 137. [36] Ziqi Liao and Michael Tow Cheung. 2001. Internet-based e-shopping and con- [9] Barbara Bickart and Robert M Schindler. 2001. Internet forums as influential sumer attitudes: an empirical study. Information & management 38, 5 (2001), sources of consumer information. Journal of interactive marketing 15, 3 (2001), 299–306. 31–40. [37] Moez Limayem, Mohamed Khalifa, and Anissa Frini. 2000. What makes con- [10] J Martin Bland and Douglas G Altman. 2000. The odds ratio. Bmj 320, 7247 sumers buy from Internet? A longitudinal study of online shopping. IEEE Trans- (2000), 1468. actions on Systems, Man, and Cybernetics-Part A: Systems and Humans 30, 4 (2000), [11] Denny Britz, Anna Goldie, Thang Luong, and Quoc Le. 2017. Massive Exploration 421–432. of Neural Machine Translation Architectures. arXiv preprint arXiv:1703.03906 [38] Ying Liu, Hong Li, Geng Peng, Benfu Lv, and Chong Zhang. 2015. Online (2017). purchaser segmentation and promotion strategy selection: evidence from Chinese [12] Shelly Chaiken, Mark P Zanna, James M Olson, and C Peter Herman. 1987. The E-commerce market. Annals of Operations Research 233, 1 (2015), 263–279. heuristic model of persuasion. In Social influence: the ontario symposium, Vol. 5. [39] Gerald L Lohse and Peter Spiller. 1998. Quantifying the effect of user interface Hillsdale, NJ: Lawrence Erlbaum, 3–39. design features on cyberstore traffic and sales. In Proceedings of the SIGCHI [13] Judith A Chevalier and Dina Mayzlin. 2006. The effect of word of mouth on sales: conference on Human factors in computing systems. ACM Press/Addison-Wesley Online book reviews. Journal of marketing research 43, 3 (2006), 345–354. Publishing Co., 211–218. [14] Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), [40] Daniel Lowd and Christopher Meek. 2005. Adversarial learning. In Proceedings 179–211. of the eleventh ACM SIGKDD international conference on Knowledge discovery in [15] Richard Friberg, Mattias Ganslandt, and Mikael Sandström. 2001. Pricing strate- data mining. ACM, 641–647. gies in e-commerce: Bricks vs. clicks. Technical Report. IUI working paper. [41] Christopher D Manning, Hinrich Schütze, and others. 1999. Foundations of [16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo statistical natural language processing. Vol. 999. MIT Press. Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. [42] Deborah Brown McCabe and Stephen M Nowlis. 2003. The effect of examining Domain-adversarial training of neural networks. Journal of Machine Learning actual products or product descriptions on consumer preference. Journal of Research 17, 59 (2016), 1–35. Consumer Psychology 13, 4 (2003), 431–439. [17] David Gefen. 2002. Customer loyalty in e-commerce. Journal of the association [43] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu for information systems 3, 1 (2002), 2. Chi. 2017. Deep Keyphrase Generation. Annual Meeting of the Association for [18] Anindya Ghose and Panagiotis G Ipeirotis. 2011. Estimating the helpfulness and Computational Linguistics (2017). economic impact of product reviews: Mining text and reviewer characteristics. [44] Shinichi Nakagawa and Holger Schielzeth. 2013. A general and simple method for IEEE Transactions on Knowledge and Data Engineering 23, 10 (2011), 1498–1512. obtaining R2 from generalized linear mixed-effects models. Methods in Ecology [19] Anindya Ghose and Arun Sundararajan. 2006. Evaluating pricing strategy using and Evolution 4, 2 (2013), 133–142. e-commerce data: Evidence and estimation challenges. Statist. Sci. (2006), 131– [45] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. 142. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- [20] David Godes and Dina Mayzlin. 2004. Using online conversations to study napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine word-of-mouth communication. Marketing science 23, 4 (2004), 545–560. Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [21] Dennis Herhausen, Jochen Binder, Marcus Schoegel, and Andreas Herrmann. [46] Franklin E Satterthwaite. 1946. An approximate distribution of estimates of 2015. Integrating bricks with clicks: retailer-level and channel-level outcomes of variance components. Biometrics bulletin 2, 6 (1946), 110–114. online–offline channel integration. Journal of Retailing 91, 2 (2015), 309–325. [47] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine [22] Chin-Fu Ho and Wen-Hsiung Wu. 1999. Antecedents of customer satisfaction Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual on the Internet: an empirical study of online shopping. In Systems Sciences, 1999. Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on. 2016, Berlin, Germany. IEEE, 9–pp. [48] Srini S Srinivasan, Rolph Anderson, and Kishore Ponnavolu. 2002. Customer [23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural loyalty in e-commerce: an exploration of its antecedents and consequences. computation 9, 8 (1997), 1735–1780. Journal of retailing 78, 1 (2002), 41–50. [24] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. [49] Brian Sternthal, Ruby Dholakia, and Clark Leavitt. 1978. The persuasive effect In Proceedings of the tenth ACM SIGKDD international conference on Knowledge of source credibility: Tests of cognitive response. Journal of Consumer research 4, discovery and data mining. ACM, 168–177. 4 (1978), 252–260. [25] Nan Hu, Jie Zhang, and Paul A Pavlou. 2009. Overcoming the J-shaped distribu- [50] Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The effect of wording on message tion of product reviews. Commun. ACM 52, 10 (2009), 144–147. propagation: Topic-and author-controlled natural experiments on Twitter. In [26] Ling Jiang, Zhilin Yang, and Minjoon Jun. 2013. Measuring consumer perceptions Proceedings of the 52nd Annual Meeting of the Association for Computational of online shopping convenience. Journal of Service Management 24, 2 (2013), Linguistics. ACL, Baltimore, Maryland, 175–185. 191–214. [51] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal [27] Fredrik D. Johansson, Uri Shalit, and David Sontag. 2016. Learning Representa- of the Royal Statistical Society. Series B (Methodological) (1996), 267–288. tions for Counterfactual Inference. In Proceedings of the 33rd International Con- [52] Lou W Turley and Ronald E Milliman. 2000. Atmospheric effects on shopping ference on International Conference on Machine Learning - Volume 48 (ICML’16). behavior: a review of the experimental evidence. Journal of business research 49, JMLR.org, 3020–3029. 2 (2000), 193–211. [28] Dan Jurafsky, Victor Chahuneau, Bryan R Routledge, and Noah A Smith. 2014. [53] Hui Yuan, Wei Xu, Qian Li, and Raymond Lau. 2017. Topic sentiment mining Narrative framing of consumer sentiment in online restaurant reviews. First for sales performance prediction in e-commerce. Annals of Operations Research (2017), 1–24. [54] Cai-Nicolas Ziegler, Lars Schmidt-Thieme, and Georg Lausen. 2004. Exploiting semantic product descriptions for recommender systems. In Proceedings of the 2nd ACM SIGIR Semantic Web and Information Retrieval Workshop. 25–29. 10