Improving Semantic Matching via Multi-Task Learning in E-Commerce Hongchun Zhang Tianyi Wang Xiaonan Meng∗ hongchun.zhc@alibaba-inc.com joshua.wty@alibaba-inc.com xiaonan.mengxn@alibaba-inc.com Alibaba Group Alibaba Group Alibaba Group Hangzhou, China Hangzhou, China Hangzhou, China Yi Hu Hao Wang erwin.huy@alibaba-inc.com longran.wh@alibaba-inc.com Alibaba Group Alibaba Group Hangzhou, China Hangzhou, China ABSTRACT Platform category taxonomy tree Semantic matching plays a critical role in an e-commerce search deliver engine, while one of the biggest challenges is the language gap be- E-commerce products Seller Buyer Interface tween queries and products. Traditionally, some auxiliary functions, Search Engine Ads bidding Creativity such as the category navigation, are designed to help buyers to clar- ify their intent. Recently, the advances in deep learning provide Account Platform new opportunities to bridge the gap, however, these techniques buyer behavior log bid-word suggestion suffer from the data sparseness problem. To address this issue, in addition to the click-through data from buyers, we exploit other Figure 1: Overview of the e-commerce business ecosystem. types of semantic knowledge from the product category taxonomy and sellers’ behavior. We investigate the correlation between query intent classification and semantic textual similarity, and propose a multi-task framework to boost their performance simultaneously. 1 INTRODUCTION Moreover, we design a Progressively Hierarchical Classification Nowadays, buyers increasingly rely on the e-commerce search en- (PHC) network architecture with the taxonomy to solve the cate- gine to find their desired products. Similar to the web search, one of gory imbalance problem . We conduct extensive offline and online the biggest challenges to retrieve relevant products for a query is the A/B experiments on a real-world e-commerce platform, and the language gap problem, when buyers and sellers use different vocab- results show that the proposed method in this paper significantly ularies to express the same meaning. Traditionally, some auxiliary outperforms the baseline and achieves higher commercial value. functions, such as the category navigation, are designed to help buyers to clarify their intent in more details. For example, when a CCS CONCEPTS buyer input a query q=“car light”, firstly, the two different prod- • Information systems → Retrieval models and ranking; Com- ucts, p1=“auto halogen bulb” and p2=“led lamp for toy car” putational advertising; Information retrieval; • Computing method- which are both relevant, may be difficult to get recalled by classical ologies → Natural language processing; Machine learning. text matching; secondly, most buyers who like the first may need another action to filter out the second by choosing an intent cate- KEYWORDS gory. So, how to predict the query’s intent category and address E-Commerce, Multi-Task Learning, Semantic Matching the language discrepancy problem between queries and products are crucial to improve matching efficiency. ACM Reference Format: Recently, many deep neural networks have been successfully Hongchun Zhang, Tianyi Wang, Xiaonan Meng, Yi Hu, and Hao Wang. 2019. applied to classification [6, 11, 23, 26, 28], and also provide new Improving Semantic Matching via Multi-Task Learning in E-Commerce. In opportunities to learn better distributed representations of words Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), and sentences to bridge the language gap. However, training a 7 pages. state-of-the-art deep neural network model usually requires a large ∗Corresponding author. amount of labeled data which is not always readily available. In a commercial web search engine, it’s popular to use click-through data as implicit feedback label [8, 16, 21]. Unfortunately, the data in Copyright © 2019 by the paper’s authors. Copying permitted for private and academic e-commerce is biased and noisy, because the buyer’s click behavior purposes. In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): is influenced not only by the recall rate of the online algorithm but Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at also by the product snapshot [24]. http://ceur-ws.org Actually, as shown in Figure 1, in addition to the buyer’s demand side behavior, there are many other types of semantic knowledge hidden in the platform and the seller’s supply side: SIGIR 2019 eCom, July 2019, Paris, France Hongchun Zhang, et al. (1) product and category: The platform builds a large-scale tax- onomy. Products delivered by sellers are classified into a suitable leaf category. The similarity between products with same category is much higher than that of different cate- gories. However, as shown in Figure 2, the distribution of category data is extremely imbalanced. (2) query and category: When many buyers search the same query and click the same category’s products many times, then the category has a very high probability to satisfy the query intent. (3) product and bid-word: The candidate bid-words are usually the history queries with high page view (PV) or conversion rate. The advertisers should pay for the clicks of bid-words to get more exposure. Meanwhile, the cost guarantees their similarity. (4) category taxonomy tree: The path from root to leaf is a process of subdividing layer-by-layer. The tree distance between two Figure 2: Distribution of products number on category leafs. category nodes is positively related to their similarity. This is extremely imbalanced. Inspired by these observations, in this paper, we propose a multi- task learning framework for semantic matching with multi-type knowledge from e-commerce ecosystem. We firstly generate dis- tributed representation for each input text with TextCNN [28], and separately, and then applies a classifier to compute the matching then apply two learning tasks: One is a classification task using score. Although recent works show that interaction-based methods data (1) and (2), the category plays a bridge role in intent similarity perform better in multiple text matching tasks, but suffer from the between queries and products. Moreover, we design a Progressively expensive online computational complexity. Hierarchical Classification (PHC) network architecture to enrich Moreover, training a deep model needs a large amount of labeled the similarity of (4). The other task is a pair-wise semantic textual data, which is expensive to obtain. To mitigate this issue, many similarity. Specially, we make use of (3) and click-through data as unsupervised learning models seek to exploit the implicit internal weakly supervised label, and generate comparison training pairs structure of the corpus data. For example, various methods for between titles and their positive/negative queries. learning distributed word representations, such as word2vec [14], Our contribution can be summarized as follows: GloVe [18], and sentence representations, such as paragraph vec- tors [12], skip-thought [10], have been shown very useful for NLP • we propose a multi-task learning framework of query in- tasks such as sentence classification, sentiment analysis, etc. But it’s tent classification and semantic textual similarity to improve almost impossible to learn a good representation by unsupervised semantic matching efficiency, and make use of multi-type methods for complex task. Traditionally, weakly supervised learning knowledge from the e-commerce ecosystem to address the methods are more popular in the industry. DSSM [8], CDSSM [21], data sparseness problem; LSTM-RNN [16] are trained in a weakly supervised manner with • we design a PHC network architecture to solve the cate- the click-through data. Mostafa et al. [4] used the output of an un- gory imbalance problem, and enrich the similarity between supervised ranking model as a weak supervision signal. Xiao et al. taxonomy tree nodes simultaneously. [24] proposed a co-training framework to make use of the unlabeled • We conduct extensive offline and online experiments on data. Unfortunately, the click-through data in e-commerce is biased an e-commerce search engine. The results demonstrate the and noisy. Meanwhile, many models address the problem by implic- effectiveness of our framework. itly performing limited transfer learning through the pre-trained embedding of unsupervised methods. Daniel et al. [2] presented 2 RELATED WORKS a model to learn universal sentence encoder, which specifically In recent years, there have been many works to study deep learning targets transfer learning to several NLP tasks. Subramanian et al. for semantic matching. Depending on the stage of signal matching, [20] explored a multi-task learning technique with different training these methods can be divided into two categories: Interaction based objectives to learn general sentence representation. These works and Representation based. The former constructs basic low-level demonstrate that sharing a single sentence representation across matching signals, and then aggregates matching patterns. For in- related tasks leads to consistent improvements. stance, ARC-II [7] and MatchPyramid [17] and Match-SRNN [22] In an e-commerce scenario, in addition to the buyers’ behavior, are based on word-level similarity matrix, then different network the platform and sellers also provide different types of labeled architectures are applied, such as 2-D CNNs [7, 17], RNNs [22]. data. Inspired by the weakly unsupervised and multi-task learning KNRM [25] and Conv-KNRM [3] make the interaction between ev- methods, in this paper, we investigate query intent classification ery n-gram pair from two pieces of text and employ a kernel pooling and semantic textual similarity as two related tasks for semantic layer. The later, such as DSSM [8], CDSSM [21], ARCI [7], CNTN matching. Moreover, to solve the category imbalance problem, we [19], generates the distributed representation for each input text design a PHC architecture with the taxonomy category path, which Improving Semantic Matching via Multi-Task Learning in E-Commerce SIGIR 2019 eCom, July 2019, Paris, France is different from the algorithm-level [13] and data-level methods Table 1: Structure of Taxonomy Tree of our site [1]. Level1 Level2 Level3 Level4 3 APPROACHES 30+ 200+ 500+ 5000+ In this paper, we illustrate a novel multi-task framework via leverag- ing product category taxonomy tree to incorporate the correlation between semantic textual similarity and query intent classification Suppose there are K negative queries for each title t, (1) could using a weakly supervised method for generating training data. We be rewritten as formally define these two tasks at first, and then present our data K Õ generation method and modeling paradigm. Lst = −log(Fs (t, q ∗ ; θ s )) − Eqk ∼Pn (c) [log Fs (t, qk ; θ s )], k =1 (2) s.t . q ∗ ∈ Q t+ , 3.1 Semantic Similarity 3.1.1 Semantic Similarity (SS). SS is a core problem in many NLP thus, the final loss of QTS task is defined as tasks. While in an e-commerce scenario, we also formulate a SS prob- Õ lem as measuring the similarity between a query and all products’ Ls = Lst (3) t ∈T titles to retrieve those products which are semantically consistent with the query. We call this task as Query-Title Similarity (QTS). Now it could be trained via standard gradient descent. We would Traditionally, in the QTS problem, the similarity between each propose our method to generate Q in Sec. 3.3.1. query and its best-matched titles would be calculated. In this work, we implement QTS in a dual way to utilize the knowledge from 3.2 Query Taxonomy sellers’ side, and calculate the similarity between each title and its 3.2.1 Text Classification. In an e-commerce scenario, query classi- best-matched queries instead. fication (QC) is important to understand buyer’s intent to retrieve Given a product title t and its candidate queries as set Q = more related products. In addition to the QTS task, we also in- {qi |1 ≤ i ≤ n}, the category labels of t and Q are denoted as c t troduce a classification problem to infer the cq , which is equal and CQ = {c i |1 ≤ i ≤ n}, s.t. c i ∈ [1, |Nc |], |Nc | is the category to maximize the posterior probability P(cq |q; θc ), thus the trained size. In QTS task, our framework needs to calculate the similarity objective function of QC can be written as between t and each qi in Q, which is defined as Fs (t, qi ; θ s ) → R+ , where Fs is a score function and θ s is the parameter of Fs . Suppose − log(P(cq |q; θc )) (4) t’s best-matched queries could be denoted as Q t+ ⊆ Q, then others Actually, this method could only assign just one category to each as Q t− and Q = Q t+ Q t− . Consequently, the objective of QTS task Ð query, while products are all arranged via a taxonomy tree (TT) could be designed as minimizing: and have several levels of categories. As shown in Table 1, in al- ibaba.com, products are arranged into four levels, from broad field to specific. For instance, “balance scooter” falls under the cate- q j ∈Q t+ Fs (t, q j ; θ s ) Í log( Í ) (1) gory ’Sports & Entertainment→Outdoor Sports→Scooters→Self- q k ∈Q Fs (t, qk ; θ s ) balancing Electric Scooters’. Consequently, we also need the taxon- omy tree to define a query’s categories as well. But this formulation is impractical because the cost of computa- 3.2.2 Progressively Hierarchical Classification. To take into account tional complexity is proportional to sample numbers in Q + and Q, all different levels of the category path, we design a hierarchical which is often very large in our e-commerce scenario (more than108 ). softmax structure named Progressively Hierarchical Classification We would utilize an alternative method to solve it, and refer to 3.1.2 (PHC) network to leverage the semantic information from root to for more details. leaf progressively. Our proposed structure is different from those conventional hierarchical softmax methods, such as in Mikolov et 3.1.2 Negative Sampling. An alternative method to optimize (1) al. [14], where hierarchical softmax is used as a speedup technique, is using Noise Contrastive Estimation (NCE), which is applied by and the binary Huffman tree is constructed by samples frequency [5, 15] to language modeling. This strategy is similar to hinge loss and could hardly represent the correlation between different leaf which is also trained by telling positive data from noise samples. nodes. We call the query taxonomy problem as Query Taxonomy To simplify (1), we select only one query q ∗ from Q + and limit the Classification Task (QTC). number of negative queries. The NCE has a noise distribution Pn (·) Suppose the taxonomy of products composed of L layer, each as a free parameter. Inspired by what has been implemented in [14], level has its fitting parameters θcl , l ∈ [1, L], level l’s category is c l . we randomly chose samples according to the category frequency We implement an unsupervised method to build a large amount of distribution U (c)3/4 /Z . It would reduce training time because of its title-category pairs and query-category pairs, denoted as Q̃ = Q T Ð sampling on high frequency categories and the 0.75 power could respectively and their CQ and CT , refer to 3.3.1 for details. As shown make the low frequency categories would be sampled more times in Figure 4, we design a recursive structure which could take all than 1 power. levels of categories before a specific layer l and the original first SIGIR 2019 eCom, July 2019, Paris, France Hongchun Zhang, et al. Embedding q1, q2, …, qM Title vector category PHC CNN+MLP Embedding ti1, ti2, …, tiM Query vector similarity NCE Convolution Layers Pooling Concatenate Dense Layers Query Feature D D D Query Embedding 4 D 4D M M 4 Title vector Pooling Concatenate D 4 Flattened Title Embedding D 4 D 4D M M Pooling 4 Concatenate Query vector Title Feature Figure 3: Model Structure: The above half is the whole structure of our framework and another half below is the detailed operations in CNN+MLP block C1 C2 C3 C4 layer into account and output a category at each level, PHC P(cql |q; θcl ) = σ (Fl [Φl (q)]) (5) F1 F2 F3 F4 where Query/title f1 f2 f3 f4 ( f 1 (q), l = 1, representation Φl (q) = (6) fl (fl −1 (fl −2 (...f 1 (q)))) + f 1 (q), 2 ≤ l ≤ L, where σ is softmax function, fl (·) refers to no-linear feature map- ping at level l and it would output an intermediate representation, Figure 4: PHC structure and the Fl would project the representation into one-hot category ids. For brevity, we omit parameter in fl (·). So the final objective function of QTC is • title-category pair. We use the product’s category path L Õ Õ submitted by sellers as this product title’s category path. Lc = − P(cql |q; θcl ) (7) We defined our product taxonomy as four levels, top level, q ∈ Q̃ l =1 second level, third level, and leaf category. Without a lack It might be similar to a recurrent-based decoder for generating of generality, for those which have less than four levels, we TT path because the distribution of category in a layer l always copy their last available category node to fill out the absent depends on information from previous layer. However, instead of ones. In Table 1, we could see the taxonomy size for each using recurrent neural network to model the category path, we level and it’s a typical imbalanced distribution, which results incorporate more parameters {θ 1 , ..., θ L } which could store more from commercial discipline. Therefore, re-sampling from information of the correlation and difference between category the original data and sample duplication are significant for nodes. What’s more, field knowledge from the taxonomy tree could ensuring the model ability. be preserved more when we add f 1 (q) to future levels as a residual • query-category pair. Different from title-category pairs, than otherwise. queries have no category path originally. Instead of asking the human to evaluate or annotate millions of queries, we 3.3 Multi-Task Learning For Semantic use click-through data as implicit feedback for generating Similarity queries’ categories [8, 21, 24]. From the click-through data, we filter out high frequency queries firstly and assign the In section 3.1 and 3.2, we have presented formulation of the two product’s category, which has the highest click-through rate, tasks. In this part, we illustrate our system architecture and show to that query. Also, we use bid-words as queries, and set more details on data generation, multi-task strategy and model their binding product’s category as training label. establishment. • Q t+ to each title. Similar to what we do with query-category 3.3.1 Unlabelled Data Generation. As we mentioned in 3.1, in order pairs, firstly we screen out query q and its corresponding to generate enough data for QTS and QTC tasks, we implement products, whose click-through rate is higher than a thresh- an unsupervised method to build a large amount of title-category old to form some part of < q + , t > pairs. We also extract pairs (T , CT ), query-category pairs (Q, CQ ) and also Q t+ for each t. bid-words for each title to form another part of < q + , t > Improving Semantic Matching via Multi-Task Learning in E-Commerce SIGIR 2019 eCom, July 2019, Paris, France pairs. Secondly, we randomly take K samples following the 4.2 Implementation Details distribution Pn (c) as negative queries. In order to prove that improvement of performance on semantic similarity and query classification could be achieved simultaneously 3.3.2 Model Architecture. Inspired by prior works on multi-tasks and they mutually boost each other, we design a joint-training and co-training [24, 27], we use a multi-task strategy to optimize experiment and also other ablations. We compare our work with QTS and QTC tasks simultaneously, which are defined before. The following methods: flowchart of whole system is illustrated in Figure 3, we employ a semantic embedding network at the base of whole model, which is (i) TextCNN [9] + QTS similar to TextCNN [9] on the architecture. (ii) TextCNN∗ + QTS Our model would embed the query and title firstly, as shown (iii) TextCNN∗ + QTS + QTC w/o PHC in Figure 3, and then multi-filter-size convolution blocks would (iv) TextCNN∗ + QTC with PHC extract specific features at different granularities, from uni-gram (v) TextCNN∗ + QTC w/o PHC to 4-gram, to cover more types of correlation between words and (vi) TextCNN∗ + QTS + QTC with PHC phrases. Mean-pooling is applied after each convolution block and outputs a sentence level representation. The representations of all TextCNN∗ represents a classic TextCNN model which is initialized convolution blocks are concatenated together, forming a compre- via a word2vec [14] embedding. QTC w/o PHC means that this hensive vector, then two MLP layers are appended behind to project configuration implements a QTC task but only uses the leaf category this vector to semantic representation. Because we do not focus on in TT and drops the PHC structure. If the configuration has no QTS model structure designing in this paper, so for brevity, we define task, then it is only a taxonomy classification model. We use (ii) as Fs to represent all operations above as our baseline. In these experiments, we use L=4 to construct a PHC structure with four levels. The TextCNN’s embedding size is 80∗V , Fs (t, q j ; θ s ) = NormDistcosine (Φ1 (t), Φ1 (q j )) (8) where V = 900,000 is the vocabulary size. After that, there have 2 × 4 convolution layers behind, four sets of kernels with lengths from in which NormDistcosine (·) is scaled cosine function, 1 to 4 respectively and we applied max/mean-pooling after each convolution layer at each set. There are also two fully-connected vT1 v 2 NormDistcosine (v 1 , v 2 ) = 0.5 ∗ + 0.5 (9) layers behind with size of [128, 128]. fl (·) → R128 , s.t .1 ≤ l ≤ L. |v 1 ||v 2 | where Φ(∗) → Rd , Φ(·) represents the whole non-linear mapping 4.3 Results Analysis part of embedding model, and our proposed PHC takes Φ(t), Φ(q j ) 4.3.1 Performance of Multi-Task Training. We evaluate AUC on the as input and outputs {c tl }lL=1 , {cql j }lL=1 . Now the training objective annotated dataset, and test all levels of accuracy too. From Table 2, of our multi-task framework could be denoted as we can see that TextCNN + QTS has the lowest score on AUC. Since TextCNN∗ improves a lot on AUC, word2vec embedding initializa- L = λs Ls + λc Lc (10) tion is significant. If we take a comparison between results from (ii) (iii) (vi) and (iv) (vi), in which exp. (vi) increases relatively 5.10% and where λs and λc are hyper-parameters. We set λs as 1, and λc as 11.27% at AUC comparing to (iii) and (ii) respectively, at the same 0.1 in experiments. time, exp. (iii) and (vi) all perform better than other single-task solutions. At Acc4, (vi) outperform (iv) by 0.87% and (iii) outstrip 4 EXPERIMENTS (v) 4.77%. So it is easy for us to conclude that QTS and QTC are col- 4.1 Dataset and Metric laboratively optimized and jointly-training them could remarkably 4.1.1 Dataset. As we introduced in Sec. 3.3.1, our multi-tasks enhance each other. As mentioned in the introduction, it could force framework needs (T , CT ), (Q, CQ ) and Q t+ to train QTS and QTC the category information to flow back into semantic representation tasks jointly. Each sample is composed of a triplet of (q + , Q − , t), and promote its ability to restore more knowledge about categories, where Q − = {q − }. We build the unlabeled triplets dataset D = which could be intuitively inferred through our observation on {(q + , Q − , t)} by first sampling search queries and click data from data structure in alibaba.com’s e-commerce platform. What’s more, 1-year logs and then generating 10 candidates q − for every query for all methods, we test their text embedding and illustrate the q + , t. In total, we get an unlabeled dataset consisting of about 5e8 ROC curve to directly depict their difference of effectiveness. In (q + , Q − , t) triplets. In order to evaluate the semantic representation Figure 5, our proposed method with QTS + QTC + PHC configu- performance of QTS task, 33,188 pairs were annotated into ration achieve the highest AUC score, which supersedes anyone two categories, correlated and uncorrelated, via human effort or without multi-task learning. user’s click-through data as well. Also, queries of these 30,000+ pairs were assigned its category path by human for QTC task too. 4.3.2 Gain from PHC structure. From results of (iv) (v) and (iii) (vi), Acc4 gains a 2.51% and 3.40% improvements from (iv) to (v) and 4.1.2 Metric. Our multi-task semantic similarity is composed of (iii) to (vi). These improvements come from the application of the two tasks: QTS and QTC. In QTS, these pairs which have a similarity taxonomy tree, in which more levels information of non-leaf layers score higher than threshold would be assigned 1, the others are 0. are restored and also basic field knowledge are strengthened. Addi- So we utilize the classical AUC score to determine the effectiveness. tional performance difference between (vi) (iii) and (vi) (ii) prove In QTC, we use the accuracy to judge the classifier at each level. that with the PHC structure, the QTS could be boosted more. We SIGIR 2019 eCom, July 2019, Paris, France Hongchun Zhang, et al. Table 2: Metric Scores in QTS and QTC Model Similarity Metric Taxonomy Metric AUC Acc1/Acc2/Acc3/Acc4 (%) (i)TextCNN +QTS 0.5557 - (ii)TextCNN∗ +QTS 0.6300 - (iii)TextCNN∗ + QTS + QTC w/o PHC 0.6670 -/-/-/44.39 (iv)TextCNN∗ + QTC with PHC 0.6444 72.44/65.26/58.70/53.23 (v) TextCNN∗ + QTC w/o PHC 0.6193 -/-/-/40.39 (vi)TextCNN∗ + QTS + QTC with PHC 0.7010 75.78/65.29/59.26/54.10 Table 3: Online Common Search Evaluation SE performance PV-CTR(%) FBR(%) NLS(%) 100% traffic +1.5 +4.2 -68.7 * SE stands for Search Engine Table 4: Online Ads Search Evaluation Ads SE performance ADS-COV(%) ADS-CTR(%) RPM(%) 30% traffic +1.4 +4.6 +6.7 100% traffic +4.8 +6.6 +13.4 guess all these improvements contributed by adding into relation- ship between query words and all levels of category-information rather than only the leaf nodes. From Figure 5, we can also see a Figure 5: ROC Curve of different Methods 4.05% gain from TM* + QTC to TM* + QTC + PHC and a 5.09% from TM* + QTC + QTS to TM* + QTC + QTS + PHC on AUC. 4.4 Analysis on Online Evaluation their products in advertising campaign to be exposed to buyers with 4.4.1 Common Search Experiment. For our motivation on seman- implicit interest for getting orders or feedbacks. In order to prove tic matching, we wish to understand user’s intent more precisely, that our strategy could also enhance the performance in ADS, we retrieve more relevant products and indirectly enhance the Click- implement another online A/B testing experiment on our online ads Through Rate (CTR) and Feedback Rate (FBR) per page view (PV). engine. We use the ads-coverage rate of pv (ADS-COV), Exposure In e-commerce scenario, queries without enough retrieval results ADS CTR (ADS-CTR) and Revenue per mille (PRM) of platform to are called "null and low search" queries (NLS). It would harm the evaluate the matching ability. The common metric, ADS-CTR and user experience if NLS frequently occur and further undermine ADS-COV could be defined as below. the benefit of platform and sellers. So we conduct an online experi- ments in real-world e-commerce scenario, the www.alibaba.com. N ADS-PV We conduct an A/B test based on our multi-task approach inside ADS − CT R = , (11) N Exposed ADS our online search engine and calculate the NLS rate on tail queries N ADS-PV to evaluate the online performance of our methods. ADS − COV = , (12) In Table 3, the NLS rate drops 68.7% which is significant while the N PV PV-CTR still increases 1.5%. So we can make the conclusion that our where the N ∗ means number of ∗. In Figure 6, we can see that multi-task learning could recall more products and the improved after the Ads SE employed our proposed method, the metric on PV-CTR proves that the increments of products are relevant rather ADS-CTR, ADS-COV and RPM are all improved significantly. Also, than uncorrelated ones which would lead to lower PV-CTR on the results of continuous five-days online experiment could be found contrary. Also the FBR get a gain of 4.2%, which means precision in Table 4. From this result, there is a 13.4% gain on RPM in 100% of matching buyer’s intent is also increased. in search traffic configuration. Also the ADS-CTR is also improved 4.4.2 E-commerce Advertisement Experiment. This paradigm used 6.6%, which means that those additional 4.8% ads exposed, which in common search could also be expanded into e-commerce com- resulting from our strategy are also relatively correlated to buyer’s putational advertising scenario (ADS), where the advertisers want intents. Improving Semantic Matching via Multi-Task Learning in E-Commerce SIGIR 2019 eCom, July 2019, Paris, France ADS-COV ADS-CTR RPM 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 100% 30% 0% 100% 30% 0% 100% 30% 0% (a) (b) (c) Figure 6: Five Days Experiment on Ads SE: (a) ADS-COV (b) ADS-CTR (c) RPM 5 CONCLUSION [13] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International In this paper, we propose a multi-task method to jointly train query Conference on Computer Vision. 2980–2988. intent classification and semantic textual similarity simultaneously [14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In and a novel recursive way to add taxonomy tree into this frame- Advances in Neurl Information Processing Systems. 3111–3119. work. Experiments show that our proposed strategy could achieve [15] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training higher accuracy and AUC on classification and similarity problem neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012). [16] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward. 2015. respectively, which both justify our assumptions that there are pos- Deep sentence embedding using the long short term memory network: analysis itive interaction between these two tasks and using taxonomy tree and application to information retrieval. IEEE/ACM Transactions on Audio Speech also can improve semantic representation for queries. & Language Processing 24, 4 (2015), 694–707. [17] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Future work would be carried on two directions: first, we will add 2016. Text matching as image recognition. In Thirtieth AAAI Conference on more information about products besides titles to improve matching Artificial Intelligence. [18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: precision to user’s query. Secondly, there are many other advanced Global vectors for word representation. In Proceedings of conference on Empirical framework on textual representation, and we would incorporate Methods in Natural Language Processing (EMNLP). 1532–1543. them into our tasks to obtain more improvement on business. [19] Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for community-based question answering. In Twenty-Fourth Interna- tional Joint Conference on Artificial Intelligence. REFERENCES [20] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [1] Lida Abdi and Sattar Hashemi. 2015. To combat multi-class imbalanced problems 2018. Improving language understanding by generative pre-training. by means of over-sampling techniques. IEEE Transactions on Knowledge and Data URL https://s3-us-west-2. amazonaws. com/openai-assets/research- Engineering 28, 1 (2015), 238–251. covers/languageunsupervised/language understanding paper. pdf (2018). [2] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St [21] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. A latent semantic model with convolutional-pooling structure for information 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018). retrieval. In Proceedings of the 23rd ACM International Conference on Information [3] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional & Knowledge Management. ACM, 101–110. neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of [22] Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. the eleventh ACM international conference on web search and data mining. ACM, 2016. Match-srnn: Modeling the recursive matching structure with spatial rnn. 126–134. arXiv preprint arXiv:1604.04378 (2016). [4] Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce [23] Jing Wang and Min-Ling Zhang. 2018. Towards mitigating the class-imbalance Croft. 2017. Neural ranking models with weak supervision. In Proceedings of problem for partial label learning. In Proceedings of the 24th ACM SIGKDD Inter- the 40th International ACM SIGIR Conference on Research and Development in national Conference on Knowledge Discovery & Data Mining. ACM, 2427–2436. Information Retrieval. ACM, 65–74. [24] Rong Xiao, Jianhui Ji, Baoliang Cui, Haihong Tang, Wenwu Ou, Yanghua Xiao, [5] Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation Jiwei Tan, and Xuan Ju. 2019. Weakly Supervised Co-Training of Query Rewrit- of unnormalized statistical models, with applications to natural image statistics. ing andSemantic Matching for e-Commerce. In Proceedings of the Twelfth ACM Journal of Machine Learning Research 13, Feb (2012), 307–361. International Conference on Web Search and Data Mining. ACM, 402–410. [6] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning [25] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. for text classification. arXiv preprint arXiv:1801.06146 (2018). End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th [7] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neu- International ACM SIGIR conference on research and development in information ral network architectures for matching natural language sentences. In Advances retrieval. ACM, 55–64. in Neural Information Processing Systems. 2042–2050. [26] Honglun Zhang, Liqiang Xiao, Yongkun Wang, and Yaohui Jin. 2017. A general- [8] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry ized recurrent neural architecture for text classification with multi-task learning. Heck. 2013. Learning deep structured semantic models for web search using arXiv preprint arXiv:1707.02892 (2017). clickthrough data. In Proceedings of the 22nd ACM International Conference on [27] Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, Information & Knowledge Management. ACM, 2333–2338. and Rong Jin. 2018. Visual search at alibaba. In Proceedings of the 24th ACM [9] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, preprint arXiv:1408.5882 (2014). 993–1001. [10] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, [28] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in guide to) convolutional neural networks for sentence classification. arXiv preprint Neural Information Processing systems. 3294–3302. arXiv:1510.03820 (2015). [11] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neu- ral networks for text classification. In Twenty-ninth AAAI conference on Artificial Intelligence. [12] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.