Improving Semantic Matching via Multi-Task Learning in
                            E-Commerce
             Hongchun Zhang                                        Tianyi Wang                                               Xiaonan Meng∗
       hongchun.zhc@alibaba-inc.com                        joshua.wty@alibaba-inc.com                         xiaonan.mengxn@alibaba-inc.com
              Alibaba Group                                       Alibaba Group                                        Alibaba Group
             Hangzhou, China                                    Hangzhou, China                                       Hangzhou, China

                                               Yi Hu                                          Hao Wang
                                  erwin.huy@alibaba-inc.com                         longran.wh@alibaba-inc.com
                                        Alibaba Group                                      Alibaba Group
                                       Hangzhou, China                                    Hangzhou, China

ABSTRACT                                                                                       Platform                             category taxonomy tree

Semantic matching plays a critical role in an e-commerce search                                                                                               deliver
engine, while one of the biggest challenges is the language gap be-                                                  E-commerce
                                                                                                                                                              products
                                                                                                                                                                         Seller
                                                                                    Buyer       Interface
tween queries and products. Traditionally, some auxiliary functions,                                                Search Engine
                                                                                                                                                Ads
                                                                                                                                                              bidding
                                                                                                                                              Creativity
such as the category navigation, are designed to help buyers to clar-                                                                       <prd, bid-word>

ify their intent. Recently, the advances in deep learning provide                                                                              Account
                                                                                                                                               Platform
new opportunities to bridge the gap, however, these techniques                                          buyer behavior log                                        bid-word suggestion

suffer from the data sparseness problem. To address this issue, in
addition to the click-through data from buyers, we exploit other                 Figure 1: Overview of the e-commerce business ecosystem.
types of semantic knowledge from the product category taxonomy
and sellers’ behavior. We investigate the correlation between query
intent classification and semantic textual similarity, and propose a
multi-task framework to boost their performance simultaneously.                 1     INTRODUCTION
Moreover, we design a Progressively Hierarchical Classification                     Nowadays, buyers increasingly rely on the e-commerce search en-
(PHC) network architecture with the taxonomy to solve the cate-                     gine to find their desired products. Similar to the web search, one of
gory imbalance problem . We conduct extensive offline and online                    the biggest challenges to retrieve relevant products for a query is the
A/B experiments on a real-world e-commerce platform, and the                        language gap problem, when buyers and sellers use different vocab-
results show that the proposed method in this paper significantly                   ularies to express the same meaning. Traditionally, some auxiliary
outperforms the baseline and achieves higher commercial value.                      functions, such as the category navigation, are designed to help
                                                                                    buyers to clarify their intent in more details. For example, when a
CCS CONCEPTS                                                                        buyer input a query q=“car light”, firstly, the two different prod-
• Information systems → Retrieval models and ranking; Com-                          ucts, p1=“auto halogen bulb” and p2=“led lamp for toy car”
putational advertising; Information retrieval; • Computing method- which are both relevant, may be difficult to get recalled by classical
ologies → Natural language processing; Machine learning.                            text matching; secondly, most buyers who like the first may need
                                                                                    another action to filter out the second by choosing an intent cate-
KEYWORDS                                                                            gory. So, how to predict the query’s intent category and address
E-Commerce, Multi-Task Learning, Semantic Matching                                  the language discrepancy problem between queries and products
                                                                                    are crucial to improve matching efficiency.
ACM Reference Format:                                                                   Recently, many deep neural networks have been successfully
Hongchun Zhang, Tianyi Wang, Xiaonan Meng, Yi Hu, and Hao Wang. 2019.               applied to classification [6, 11, 23, 26, 28], and also provide new
Improving Semantic Matching via Multi-Task Learning in E-Commerce. In               opportunities to learn better distributed representations of words
Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom),
                                                                                    and sentences to bridge the language gap. However, training a
7 pages.
                                                                                    state-of-the-art deep neural network model usually requires a large
∗Corresponding author.
                                                                                    amount of labeled data which is not always readily available. In a
                                                                                    commercial web search engine, it’s popular to use click-through
                                                                                    data as implicit feedback label [8, 16, 21]. Unfortunately, the data in
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic e-commerce is biased and noisy, because the buyer’s click behavior
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                      is influenced not only by the recall rate of the online algorithm but
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at also by the product snapshot [24].
http://ceur-ws.org                                                                      Actually, as shown in Figure 1, in addition to the buyer’s demand
                                                                                    side behavior, there are many other types of semantic knowledge
                                                                                    hidden in the platform and the seller’s supply side:
SIGIR 2019 eCom, July 2019, Paris, France                                                                              Hongchun Zhang, et al.


    (1) product and category: The platform builds a large-scale tax-
        onomy. Products delivered by sellers are classified into a
        suitable leaf category. The similarity between products with
        same category is much higher than that of different cate-
        gories. However, as shown in Figure 2, the distribution of
        category data is extremely imbalanced.
    (2) query and category: When many buyers search the same
        query and click the same category’s products many times,
        then the category has a very high probability to satisfy the
        query intent.
    (3) product and bid-word: The candidate bid-words are usually
        the history queries with high page view (PV) or conversion
        rate. The advertisers should pay for the clicks of bid-words
        to get more exposure. Meanwhile, the cost guarantees their
        similarity.
    (4) category taxonomy tree: The path from root to leaf is a process
        of subdividing layer-by-layer. The tree distance between two
                                                                          Figure 2: Distribution of products number on category leafs.
        category nodes is positively related to their similarity.
                                                                          This is extremely imbalanced.
   Inspired by these observations, in this paper, we propose a multi-
task learning framework for semantic matching with multi-type
knowledge from e-commerce ecosystem. We firstly generate dis-
tributed representation for each input text with TextCNN [28], and        separately, and then applies a classifier to compute the matching
then apply two learning tasks: One is a classification task using         score. Although recent works show that interaction-based methods
data (1) and (2), the category plays a bridge role in intent similarity   perform better in multiple text matching tasks, but suffer from the
between queries and products. Moreover, we design a Progressively         expensive online computational complexity.
Hierarchical Classification (PHC) network architecture to enrich             Moreover, training a deep model needs a large amount of labeled
the similarity of (4). The other task is a pair-wise semantic textual     data, which is expensive to obtain. To mitigate this issue, many
similarity. Specially, we make use of (3) and click-through data as       unsupervised learning models seek to exploit the implicit internal
weakly supervised label, and generate comparison training pairs           structure of the corpus data. For example, various methods for
between titles and their positive/negative queries.                       learning distributed word representations, such as word2vec [14],
   Our contribution can be summarized as follows:                         GloVe [18], and sentence representations, such as paragraph vec-
                                                                          tors [12], skip-thought [10], have been shown very useful for NLP
     • we propose a multi-task learning framework of query in-
                                                                          tasks such as sentence classification, sentiment analysis, etc. But it’s
       tent classification and semantic textual similarity to improve
                                                                          almost impossible to learn a good representation by unsupervised
       semantic matching efficiency, and make use of multi-type
                                                                          methods for complex task. Traditionally, weakly supervised learning
       knowledge from the e-commerce ecosystem to address the
                                                                          methods are more popular in the industry. DSSM [8], CDSSM [21],
       data sparseness problem;
                                                                          LSTM-RNN [16] are trained in a weakly supervised manner with
     • we design a PHC network architecture to solve the cate-
                                                                          the click-through data. Mostafa et al. [4] used the output of an un-
       gory imbalance problem, and enrich the similarity between
                                                                          supervised ranking model as a weak supervision signal. Xiao et al.
       taxonomy tree nodes simultaneously.
                                                                          [24] proposed a co-training framework to make use of the unlabeled
     • We conduct extensive offline and online experiments on
                                                                          data. Unfortunately, the click-through data in e-commerce is biased
       an e-commerce search engine. The results demonstrate the
                                                                          and noisy. Meanwhile, many models address the problem by implic-
       effectiveness of our framework.
                                                                          itly performing limited transfer learning through the pre-trained
                                                                          embedding of unsupervised methods. Daniel et al. [2] presented
2    RELATED WORKS                                                        a model to learn universal sentence encoder, which specifically
In recent years, there have been many works to study deep learning        targets transfer learning to several NLP tasks. Subramanian et al.
for semantic matching. Depending on the stage of signal matching,         [20] explored a multi-task learning technique with different training
these methods can be divided into two categories: Interaction based       objectives to learn general sentence representation. These works
and Representation based. The former constructs basic low-level           demonstrate that sharing a single sentence representation across
matching signals, and then aggregates matching patterns. For in-          related tasks leads to consistent improvements.
stance, ARC-II [7] and MatchPyramid [17] and Match-SRNN [22]                 In an e-commerce scenario, in addition to the buyers’ behavior,
are based on word-level similarity matrix, then different network         the platform and sellers also provide different types of labeled
architectures are applied, such as 2-D CNNs [7, 17], RNNs [22].           data. Inspired by the weakly unsupervised and multi-task learning
KNRM [25] and Conv-KNRM [3] make the interaction between ev-              methods, in this paper, we investigate query intent classification
ery n-gram pair from two pieces of text and employ a kernel pooling       and semantic textual similarity as two related tasks for semantic
layer. The later, such as DSSM [8], CDSSM [21], ARCI [7], CNTN            matching. Moreover, to solve the category imbalance problem, we
[19], generates the distributed representation for each input text        design a PHC architecture with the taxonomy category path, which
Improving Semantic Matching via Multi-Task Learning in E-Commerce                                        SIGIR 2019 eCom, July 2019, Paris, France


is different from the algorithm-level [13] and data-level methods                  Table 1: Structure of Taxonomy Tree of our site
[1].
                                                                                                Level1    Level2         Level3      Level4
3     APPROACHES                                                                                  30+      200+           500+       5000+
In this paper, we illustrate a novel multi-task framework via leverag-
ing product category taxonomy tree to incorporate the correlation
between semantic textual similarity and query intent classification           Suppose there are K negative queries for each title t, (1) could
using a weakly supervised method for generating training data. We          be rewritten as
formally define these two tasks at first, and then present our data                                            K
                                                                                                               Õ
generation method and modeling paradigm.                                    Lst = −log(Fs (t, q ∗ ; θ s )) −          Eqk ∼Pn (c) [log Fs (t, qk ; θ s )],
                                                                                                               k =1
                                                                                                                                                             (2)
                                                                                 s.t . q ∗ ∈ Q t+ ,
3.1    Semantic Similarity
3.1.1 Semantic Similarity (SS). SS is a core problem in many NLP           thus, the final loss of QTS task is defined as
tasks. While in an e-commerce scenario, we also formulate a SS prob-                                         Õ
lem as measuring the similarity between a query and all products’                                      Ls =      Lst                                         (3)
                                                                                                                      t ∈T
titles to retrieve those products which are semantically consistent
with the query. We call this task as Query-Title Similarity (QTS).         Now it could be trained via standard gradient descent. We would
    Traditionally, in the QTS problem, the similarity between each         propose our method to generate Q in Sec. 3.3.1.
query and its best-matched titles would be calculated. In this work,
we implement QTS in a dual way to utilize the knowledge from               3.2     Query Taxonomy
sellers’ side, and calculate the similarity between each title and its     3.2.1 Text Classification. In an e-commerce scenario, query classi-
best-matched queries instead.                                              fication (QC) is important to understand buyer’s intent to retrieve
    Given a product title t and its candidate queries as set Q =           more related products. In addition to the QTS task, we also in-
{qi |1 ≤ i ≤ n}, the category labels of t and Q are denoted as c t         troduce a classification problem to infer the cq , which is equal
and CQ = {c i |1 ≤ i ≤ n}, s.t. c i ∈ [1, |Nc |], |Nc | is the category    to maximize the posterior probability P(cq |q; θc ), thus the trained
size. In QTS task, our framework needs to calculate the similarity         objective function of QC can be written as
between t and each qi in Q, which is defined as Fs (t, qi ; θ s ) → R+ ,
where Fs is a score function and θ s is the parameter of Fs . Suppose                                    − log(P(cq |q; θc ))                                (4)
t’s best-matched queries could be denoted as Q t+ ⊆ Q, then others
                                                                           Actually, this method could only assign just one category to each
as Q t− and Q = Q t+ Q t− . Consequently, the objective of QTS task
                     Ð
                                                                           query, while products are all arranged via a taxonomy tree (TT)
could be designed as minimizing:
                                                                           and have several levels of categories. As shown in Table 1, in al-
                                                                           ibaba.com, products are arranged into four levels, from broad field
                                                                           to specific. For instance, “balance scooter” falls under the cate-
                            q j ∈Q t+ Fs (t, q j ; θ s )
                          Í
                      log( Í                               )        (1)    gory ’Sports & Entertainment→Outdoor Sports→Scooters→Self-
                              q k ∈Q Fs (t, qk ; θ s )                     balancing Electric Scooters’. Consequently, we also need the taxon-
                                                                           omy tree to define a query’s categories as well.
But this formulation is impractical because the cost of computa-           3.2.2 Progressively Hierarchical Classification. To take into account
tional complexity is proportional to sample numbers in Q + and Q,          all different levels of the category path, we design a hierarchical
which is often very large in our e-commerce scenario (more than108 ).      softmax structure named Progressively Hierarchical Classification
We would utilize an alternative method to solve it, and refer to 3.1.2     (PHC) network to leverage the semantic information from root to
for more details.                                                          leaf progressively. Our proposed structure is different from those
                                                                           conventional hierarchical softmax methods, such as in Mikolov et
3.1.2 Negative Sampling. An alternative method to optimize (1)             al. [14], where hierarchical softmax is used as a speedup technique,
is using Noise Contrastive Estimation (NCE), which is applied by           and the binary Huffman tree is constructed by samples frequency
[5, 15] to language modeling. This strategy is similar to hinge loss       and could hardly represent the correlation between different leaf
which is also trained by telling positive data from noise samples.         nodes. We call the query taxonomy problem as Query Taxonomy
   To simplify (1), we select only one query q ∗ from Q + and limit the    Classification Task (QTC).
number of negative queries. The NCE has a noise distribution Pn (·)            Suppose the taxonomy of products composed of L layer, each
as a free parameter. Inspired by what has been implemented in [14],        level has its fitting parameters θcl , l ∈ [1, L], level l’s category is c l .
we randomly chose samples according to the category frequency              We implement an unsupervised method to build a large amount of
distribution U (c)3/4 /Z . It would reduce training time because of its    title-category pairs and query-category pairs, denoted as Q̃ = Q T
                                                                                                                                                 Ð
sampling on high frequency categories and the 0.75 power could             respectively and their CQ and CT , refer to 3.3.1 for details. As shown
make the low frequency categories would be sampled more times              in Figure 4, we design a recursive structure which could take all
than 1 power.                                                              levels of categories before a specific layer l and the original first
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                                                       Hongchun Zhang, et al.


                                            Embedding
                q1, q2, …, qM                                                                                      Title vector                      category PHC

                                                                              CNN+MLP


                                            Embedding
                ti1, ti2, …, tiM                                                                               Query vector                          similarity NCE


                                         Convolution Layers                                        Pooling                        Concatenate                              Dense Layers
                                                                              Query Feature

                               D                        D                         D
        Query Embedding                                                                        4               D                                        4D
                     M                                                  M                                                 4                                                               Title vector


                                                                                                    Pooling
                                                                                                                                  Concatenate


                               D                                  4                                                                                  Flattened
         Title Embedding
                                                                                   D           4               D
                                                                                                                                                        4D
                     M                                                  M


                                                                                                    Pooling
                                                                                                                           4      Concatenate                                             Query vector


                                                                               Title Feature


Figure 3: Model Structure: The above half is the whole structure of our framework and another half below is the detailed
operations in CNN+MLP block

                                                                                                                                                           C1         C2           C3       C4
layer into account and output a category at each level,
                                                                                                                                                     PHC
                               P(cql |q; θcl ) = σ (Fl [Φl (q)])                                         (5)                                               F1         F2           F3       F4

where
                                                                                                                                      Query/title
                                                                                                                                                           f1         f2           f3       f4
                      (
                           f 1 (q),                                   l = 1,                                                        representation

        Φl (q) =                                                                                         (6)
                           fl (fl −1 (fl −2 (...f 1 (q)))) + f 1 (q), 2 ≤ l ≤ L,
where σ is softmax function, fl (·) refers to no-linear feature map-
ping at level l and it would output an intermediate representation,                                                                             Figure 4: PHC structure
and the Fl would project the representation into one-hot category
ids. For brevity, we omit parameter in fl (·). So the final objective
function of QTC is
                                                                                                                     • title-category pair. We use the product’s category path
                                          L
                                        Õ Õ                                                                            submitted by sellers as this product title’s category path.
                              Lc = −                        P(cql |q; θcl )                              (7)
                                                                                                                       We defined our product taxonomy as four levels, top level,
                                       q ∈ Q̃ l =1
                                                                                                                       second level, third level, and leaf category. Without a lack
It might be similar to a recurrent-based decoder for generating                                                        of generality, for those which have less than four levels, we
TT path because the distribution of category in a layer l always                                                       copy their last available category node to fill out the absent
depends on information from previous layer. However, instead of                                                        ones. In Table 1, we could see the taxonomy size for each
using recurrent neural network to model the category path, we                                                          level and it’s a typical imbalanced distribution, which results
incorporate more parameters {θ 1 , ..., θ L } which could store more                                                   from commercial discipline. Therefore, re-sampling from
information of the correlation and difference between category                                                         the original data and sample duplication are significant for
nodes. What’s more, field knowledge from the taxonomy tree could                                                       ensuring the model ability.
be preserved more when we add f 1 (q) to future levels as a residual                                                 • query-category pair. Different from title-category pairs,
than otherwise.                                                                                                        queries have no category path originally. Instead of asking
                                                                                                                       the human to evaluate or annotate millions of queries, we
3.3     Multi-Task Learning For Semantic                                                                               use click-through data as implicit feedback for generating
        Similarity                                                                                                     queries’ categories [8, 21, 24]. From the click-through data,
                                                                                                                       we filter out high frequency queries firstly and assign the
In section 3.1 and 3.2, we have presented formulation of the two
                                                                                                                       product’s category, which has the highest click-through rate,
tasks. In this part, we illustrate our system architecture and show
                                                                                                                       to that query. Also, we use bid-words as queries, and set
more details on data generation, multi-task strategy and model
                                                                                                                       their binding product’s category as training label.
establishment.
                                                                                                                     • Q t+ to each title. Similar to what we do with query-category
3.3.1 Unlabelled Data Generation. As we mentioned in 3.1, in order                                                     pairs, firstly we screen out query q and its corresponding
to generate enough data for QTS and QTC tasks, we implement                                                            products, whose click-through rate is higher than a thresh-
an unsupervised method to build a large amount of title-category                                                       old to form some part of < q + , t > pairs. We also extract
pairs (T , CT ), query-category pairs (Q, CQ ) and also Q t+ for each t.                                               bid-words for each title to form another part of < q + , t >
Improving Semantic Matching via Multi-Task Learning in E-Commerce                                          SIGIR 2019 eCom, July 2019, Paris, France


       pairs. Secondly, we randomly take K samples following the                 4.2    Implementation Details
       distribution Pn (c) as negative queries.                                  In order to prove that improvement of performance on semantic
                                                                                 similarity and query classification could be achieved simultaneously
3.3.2 Model Architecture. Inspired by prior works on multi-tasks
                                                                                 and they mutually boost each other, we design a joint-training
and co-training [24, 27], we use a multi-task strategy to optimize
                                                                                 experiment and also other ablations. We compare our work with
QTS and QTC tasks simultaneously, which are defined before. The
                                                                                 following methods:
flowchart of whole system is illustrated in Figure 3, we employ a
semantic embedding network at the base of whole model, which is                    (i) TextCNN [9] + QTS
similar to TextCNN [9] on the architecture.                                       (ii) TextCNN∗ + QTS
   Our model would embed the query and title firstly, as shown                   (iii) TextCNN∗ + QTS + QTC w/o PHC
in Figure 3, and then multi-filter-size convolution blocks would                 (iv) TextCNN∗ + QTC with PHC
extract specific features at different granularities, from uni-gram               (v) TextCNN∗ + QTC w/o PHC
to 4-gram, to cover more types of correlation between words and                  (vi) TextCNN∗ + QTS + QTC with PHC
phrases. Mean-pooling is applied after each convolution block and
outputs a sentence level representation. The representations of all              TextCNN∗ represents a classic TextCNN model which is initialized
convolution blocks are concatenated together, forming a compre-                  via a word2vec [14] embedding. QTC w/o PHC means that this
hensive vector, then two MLP layers are appended behind to project               configuration implements a QTC task but only uses the leaf category
this vector to semantic representation. Because we do not focus on               in TT and drops the PHC structure. If the configuration has no QTS
model structure designing in this paper, so for brevity, we define               task, then it is only a taxonomy classification model. We use (ii) as
Fs to represent all operations above as                                          our baseline. In these experiments, we use L=4 to construct a PHC
                                                                                 structure with four levels. The TextCNN’s embedding size is 80∗V ,
            Fs (t, q j ; θ s ) = NormDistcosine (Φ1 (t), Φ1 (q j ))        (8)   where V = 900,000 is the vocabulary size. After that, there have 2 ×
                                                                                 4 convolution layers behind, four sets of kernels with lengths from
in which NormDistcosine (·) is scaled cosine function,                           1 to 4 respectively and we applied max/mean-pooling after each
                                                                                 convolution layer at each set. There are also two fully-connected
                                                    vT1 v 2
           NormDistcosine (v 1 , v 2 ) = 0.5 ∗                    + 0.5    (9)   layers behind with size of [128, 128]. fl (·) → R128 , s.t .1 ≤ l ≤ L.
                                                   |v 1 ||v 2 |

where Φ(∗) → Rd , Φ(·) represents the whole non-linear mapping                   4.3    Results Analysis
part of embedding model, and our proposed PHC takes Φ(t), Φ(q j )                4.3.1 Performance of Multi-Task Training. We evaluate AUC on the
as input and outputs {c tl }lL=1 , {cql j }lL=1 . Now the training objective     annotated dataset, and test all levels of accuracy too. From Table 2,
of our multi-task framework could be denoted as                                  we can see that TextCNN + QTS has the lowest score on AUC. Since
                                                                                 TextCNN∗ improves a lot on AUC, word2vec embedding initializa-
                             L = λs Ls + λc Lc                            (10)   tion is significant. If we take a comparison between results from (ii)
                                                                                 (iii) (vi) and (iv) (vi), in which exp. (vi) increases relatively 5.10% and
where λs and λc are hyper-parameters. We set λs as 1, and λc as
                                                                                 11.27% at AUC comparing to (iii) and (ii) respectively, at the same
0.1 in experiments.
                                                                                 time, exp. (iii) and (vi) all perform better than other single-task
                                                                                 solutions. At Acc4, (vi) outperform (iv) by 0.87% and (iii) outstrip
4 EXPERIMENTS                                                                    (v) 4.77%. So it is easy for us to conclude that QTS and QTC are col-
4.1 Dataset and Metric                                                           laboratively optimized and jointly-training them could remarkably
4.1.1 Dataset. As we introduced in Sec. 3.3.1, our multi-tasks                   enhance each other. As mentioned in the introduction, it could force
framework needs (T , CT ), (Q, CQ ) and Q t+ to train QTS and QTC                the category information to flow back into semantic representation
tasks jointly. Each sample is composed of a triplet of (q + , Q − , t),          and promote its ability to restore more knowledge about categories,
where Q − = {q − }. We build the unlabeled triplets dataset D =                  which could be intuitively inferred through our observation on
{(q + , Q − , t)} by first sampling search queries and click data from           data structure in alibaba.com’s e-commerce platform. What’s more,
1-year logs and then generating 10 candidates q − for every query                for all methods, we test their text embedding and illustrate the
q + , t. In total, we get an unlabeled dataset consisting of about 5e8           ROC curve to directly depict their difference of effectiveness. In
(q + , Q − , t) triplets. In order to evaluate the semantic representation       Figure 5, our proposed method with QTS + QTC + PHC configu-
performance of QTS task, 33,188 <q,t> pairs were annotated into                  ration achieve the highest AUC score, which supersedes anyone
two categories, correlated and uncorrelated, via human effort or                 without multi-task learning.
user’s click-through data as well. Also, queries of these 30,000+
pairs were assigned its category path by human for QTC task too.                 4.3.2 Gain from PHC structure. From results of (iv) (v) and (iii) (vi),
                                                                                 Acc4 gains a 2.51% and 3.40% improvements from (iv) to (v) and
4.1.2 Metric. Our multi-task semantic similarity is composed of                  (iii) to (vi). These improvements come from the application of the
two tasks: QTS and QTC. In QTS, these pairs which have a similarity              taxonomy tree, in which more levels information of non-leaf layers
score higher than threshold would be assigned 1, the others are 0.               are restored and also basic field knowledge are strengthened. Addi-
So we utilize the classical AUC score to determine the effectiveness.            tional performance difference between (vi) (iii) and (vi) (ii) prove
In QTC, we use the accuracy to judge the classifier at each level.               that with the PHC structure, the QTS could be boosted more. We
SIGIR 2019 eCom, July 2019, Paris, France                                                                           Hongchun Zhang, et al.

                                                  Table 2: Metric Scores in QTS and QTC

              Model                                                        Similarity Metric           Taxonomy Metric
                                                                                 AUC                Acc1/Acc2/Acc3/Acc4 (%)
              (i)TextCNN              +QTS                                      0.5557                            -
              (ii)TextCNN∗            +QTS                                      0.6300                            -
              (iii)TextCNN∗           + QTS           + QTC w/o PHC             0.6670                      -/-/-/44.39
              (iv)TextCNN∗                            + QTC with PHC            0.6444                72.44/65.26/58.70/53.23
              (v) TextCNN∗                            + QTC w/o PHC             0.6193                      -/-/-/40.39
              (vi)TextCNN∗            + QTS           + QTC with PHC            0.7010               75.78/65.29/59.26/54.10


         Table 3: Online Common Search Evaluation

   SE performance         PV-CTR(%)           FBR(%)       NLS(%)
   100% traffic              +1.5              +4.2         -68.7
    * SE stands for Search Engine

            Table 4: Online Ads Search Evaluation

 Ads SE performance       ADS-COV(%)       ADS-CTR(%)      RPM(%)
       30% traffic             +1.4             +4.6         +6.7
      100% traffic             +4.8             +6.6         +13.4


guess all these improvements contributed by adding into relation-
ship between query words and all levels of category-information
rather than only the leaf nodes. From Figure 5, we can also see a
                                                                                   Figure 5: ROC Curve of different Methods
4.05% gain from TM* + QTC to TM* + QTC + PHC and a 5.09% from
TM* + QTC + QTS to TM* + QTC + QTS + PHC on AUC.

4.4    Analysis on Online Evaluation
                                                                         their products in advertising campaign to be exposed to buyers with
4.4.1 Common Search Experiment. For our motivation on seman-
                                                                         implicit interest for getting orders or feedbacks. In order to prove
tic matching, we wish to understand user’s intent more precisely,
                                                                         that our strategy could also enhance the performance in ADS, we
retrieve more relevant products and indirectly enhance the Click-
                                                                         implement another online A/B testing experiment on our online ads
Through Rate (CTR) and Feedback Rate (FBR) per page view (PV).
                                                                         engine. We use the ads-coverage rate of pv (ADS-COV), Exposure
In e-commerce scenario, queries without enough retrieval results
                                                                         ADS CTR (ADS-CTR) and Revenue per mille (PRM) of platform to
are called "null and low search" queries (NLS). It would harm the
                                                                         evaluate the matching ability. The common metric, ADS-CTR and
user experience if NLS frequently occur and further undermine
                                                                         ADS-COV could be defined as below.
the benefit of platform and sellers. So we conduct an online experi-
ments in real-world e-commerce scenario, the www.alibaba.com.                                                   N ADS-PV
We conduct an A/B test based on our multi-task approach inside                            ADS − CT R     =                  ,            (11)
                                                                                                              N Exposed ADS
our online search engine and calculate the NLS rate on tail queries
                                                                                                              N ADS-PV
to evaluate the online performance of our methods.                                       ADS − COV       =              ,                (12)
   In Table 3, the NLS rate drops 68.7% which is significant while the                                          N PV
PV-CTR still increases 1.5%. So we can make the conclusion that our
                                                                         where the N ∗ means number of ∗. In Figure 6, we can see that
multi-task learning could recall more products and the improved
                                                                         after the Ads SE employed our proposed method, the metric on
PV-CTR proves that the increments of products are relevant rather
                                                                         ADS-CTR, ADS-COV and RPM are all improved significantly. Also,
than uncorrelated ones which would lead to lower PV-CTR on the
                                                                         results of continuous five-days online experiment could be found
contrary. Also the FBR get a gain of 4.2%, which means precision
                                                                         in Table 4. From this result, there is a 13.4% gain on RPM in 100%
of matching buyer’s intent is also increased.
                                                                         in search traffic configuration. Also the ADS-CTR is also improved
4.4.2 E-commerce Advertisement Experiment. This paradigm used            6.6%, which means that those additional 4.8% ads exposed, which
in common search could also be expanded into e-commerce com-             resulting from our strategy are also relatively correlated to buyer’s
putational advertising scenario (ADS), where the advertisers want        intents.
Improving Semantic Matching via Multi-Task Learning in E-Commerce                                                                   SIGIR 2019 eCom, July 2019, Paris, France


                          ADS-COV                                                      ADS-CTR                                                               RPM


        1          2           3           4         5              1          2           3               4          5               1           2           3          4           5
                       100%         30%   0%                                       100%        30%        0%                                          100%        30%   0%


                              (a)                                                          (b)                                                                (c)


                                     Figure 6: Five Days Experiment on Ads SE: (a) ADS-COV (b) ADS-CTR (c) RPM


5     CONCLUSION                                                                                     [13] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
                                                                                                          Focal loss for dense object detection. In Proceedings of the IEEE International
In this paper, we propose a multi-task method to jointly train query                                      Conference on Computer Vision. 2980–2988.
intent classification and semantic textual similarity simultaneously                                 [14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
                                                                                                          Distributed representations of words and phrases and their compositionality. In
and a novel recursive way to add taxonomy tree into this frame-                                           Advances in Neurl Information Processing Systems. 3111–3119.
work. Experiments show that our proposed strategy could achieve                                      [15] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training
higher accuracy and AUC on classification and similarity problem                                          neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012).
                                                                                                     [16] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward. 2015.
respectively, which both justify our assumptions that there are pos-                                      Deep sentence embedding using the long short term memory network: analysis
itive interaction between these two tasks and using taxonomy tree                                         and application to information retrieval. IEEE/ACM Transactions on Audio Speech
also can improve semantic representation for queries.                                                     & Language Processing 24, 4 (2015), 694–707.
                                                                                                     [17] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.
    Future work would be carried on two directions: first, we will add                                    2016. Text matching as image recognition. In Thirtieth AAAI Conference on
more information about products besides titles to improve matching                                        Artificial Intelligence.
                                                                                                     [18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
precision to user’s query. Secondly, there are many other advanced                                        Global vectors for word representation. In Proceedings of conference on Empirical
framework on textual representation, and we would incorporate                                             Methods in Natural Language Processing (EMNLP). 1532–1543.
them into our tasks to obtain more improvement on business.                                          [19] Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network
                                                                                                          architecture for community-based question answering. In Twenty-Fourth Interna-
                                                                                                          tional Joint Conference on Artificial Intelligence.
REFERENCES                                                                                           [20] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
 [1] Lida Abdi and Sattar Hashemi. 2015. To combat multi-class imbalanced problems                        2018.       Improving language understanding by generative pre-training.
     by means of over-sampling techniques. IEEE Transactions on Knowledge and Data                        URL        https://s3-us-west-2.     amazonaws.         com/openai-assets/research-
     Engineering 28, 1 (2015), 238–251.                                                                   covers/languageunsupervised/language understanding paper. pdf (2018).
 [2] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St                     [21] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.
     John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al.                          A latent semantic model with convolutional-pooling structure for information
     2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).                            retrieval. In Proceedings of the 23rd ACM International Conference on Information
 [3] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional                        & Knowledge Management. ACM, 101–110.
     neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of                   [22] Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng.
     the eleventh ACM international conference on web search and data mining. ACM,                        2016. Match-srnn: Modeling the recursive matching structure with spatial rnn.
     126–134.                                                                                             arXiv preprint arXiv:1604.04378 (2016).
 [4] Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce                       [23] Jing Wang and Min-Ling Zhang. 2018. Towards mitigating the class-imbalance
     Croft. 2017. Neural ranking models with weak supervision. In Proceedings of                          problem for partial label learning. In Proceedings of the 24th ACM SIGKDD Inter-
     the 40th International ACM SIGIR Conference on Research and Development in                           national Conference on Knowledge Discovery & Data Mining. ACM, 2427–2436.
     Information Retrieval. ACM, 65–74.                                                              [24] Rong Xiao, Jianhui Ji, Baoliang Cui, Haihong Tang, Wenwu Ou, Yanghua Xiao,
 [5] Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation                             Jiwei Tan, and Xuan Ju. 2019. Weakly Supervised Co-Training of Query Rewrit-
     of unnormalized statistical models, with applications to natural image statistics.                   ing andSemantic Matching for e-Commerce. In Proceedings of the Twelfth ACM
     Journal of Machine Learning Research 13, Feb (2012), 307–361.                                        International Conference on Web Search and Data Mining. ACM, 402–410.
 [6] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning                   [25] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017.
     for text classification. arXiv preprint arXiv:1801.06146 (2018).                                     End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th
 [7] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neu-                        International ACM SIGIR conference on research and development in information
     ral network architectures for matching natural language sentences. In Advances                       retrieval. ACM, 55–64.
     in Neural Information Processing Systems. 2042–2050.                                            [26] Honglun Zhang, Liqiang Xiao, Yongkun Wang, and Yaohui Jin. 2017. A general-
 [8] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry                              ized recurrent neural architecture for text classification with multi-task learning.
     Heck. 2013. Learning deep structured semantic models for web search using                            arXiv preprint arXiv:1707.02892 (2017).
     clickthrough data. In Proceedings of the 22nd ACM International Conference on                   [27] Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren,
     Information & Knowledge Management. ACM, 2333–2338.                                                  and Rong Jin. 2018. Visual search at alibaba. In Proceedings of the 24th ACM
 [9] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv                     SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM,
     preprint arXiv:1408.5882 (2014).                                                                     993–1001.
[10] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,                   [28] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’
     Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in                       guide to) convolutional neural networks for sentence classification. arXiv preprint
     Neural Information Processing systems. 3294–3302.                                                    arXiv:1510.03820 (2015).
[11] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neu-
     ral networks for text classification. In Twenty-ninth AAAI conference on Artificial
     Intelligence.
[12] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
     documents. In International Conference on Machine Learning. 1188–1196.