A Two-Phased Approach to Training Data Generation for
Shopping Query Intent Prediction
Gautam Kumar1,† , Chikara Hashimoto1,†
1
    Rakuten Institute of Technology (RIT), Rakuten Group Inc., 1-chōme-14 Tamagawa, Setagaya City, Tokyo, Japan 158-0094


                                       Abstract
                                       Shopping Query Intent Prediction (SQIP) is, given an online shopping user’s search query, e.g., “lv bag”, to predict their
                                       intents, e.g., Brand: Louis Vuitton. SQIP is an extreme multi-label classification task for which many excellent algorithms
                                       have been developed. However, little attention has been paid to how to create training data for SQIP. Previous studies used
                                       pseudo-labeled data derived from query-click logs for training and suffered from the noise in the logs. Although there are
                                       more sophisticated training data generation methods, they cannot be directly applied to SQIP. In this paper, we propose a
                                       novel training data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is
                                       valid for a query. The model then works as an "annotator" who checks a number of pairs comprising an intent and a query
                                       to generate training data for SQIP. We show that such a model can be trained without manual supervision by utilizing a
                                       huge amount of online shopping data. We demonstrate that the SQIP model trained with data generated by our labeling
                                       model outperforms a model trained with query-click logs only and a model trained with data created by a competitive
                                       data-programming-based method.

                                       Keywords
                                       training data generation, data-centric ai, shopping query intent, text classification, query attribute value extraction, online
                                       shopping, e-commerce query intent


1. Introduction                                                                                  Table 1
                                                                                                 Examples of queries and their intents
Online shoppers use search queries to search for products,
and most queries have search intents that indicate what                                                          Query                         Intents
products shoppers want. For example, the query “lv bag                                                “lv bag zebra”                  Brand: Louis Vuitton
zebra” has Brand: Louis Vuitton and Pattern: Zebra                                                                                    Pattern: Zebra
as its intents, as shown in Table 1.1                                                                 “100% orange juice”             Fruit taste: Orange
   In this study, we assume that queries’ intents are rep-                                            “cologne orange blossom”        Scent: Orange
resented with attribute values of products defined in an                                              “sneaker mens orange”           Color: Orange
online shopping service. Notice that simple string match-                                             “wheel 19inch”                  Tire size: 18 - 19.9inch
                                                                                                      “nicole down jacket”            Brand: Nicole
ing between queries and intents would not work since
                                                                                                                                      Filling: Feather
queries are written in natural languages; they can be
represented with abbreviations, e.g., “lv” for “Louis Vuit-
ton”, and ambiguous words, e.g., “orange”, as indicated in
                                                                                                       have attribute values such as Brand: Louis Vuitton.
Table 1. Moreover, intents might not always be explic-
                                                                                                       If we aggregate these intents in bulk, they will be very
itly written in queries, as the last example in the table
                                                                                                       useful in understanding trend of different attributes e.g.
illustrates.
                                                                                                       shoes of which brand and color the users wanted the
   These intents, once correctly predicted, would be uti-
                                                                                                       most in last month. Also, they will be very helpful in
lized by a search system to retrieve relevant products,
                                                                                                       understanding the overall market demand which could
since most products sold at an online shopping service
                                                                                                       help the merchants and the manufacturing companies.
DL4SR’22: Workshop on Deep Learning for Search and Recommen-                                              Shopping query intent prediction (SQIP), given a query,
dation, co-located with the 31st ACM International Conference on predicts its intents by selecting the most relevant subset
Information and Knowledge Management (CIKM), October 17-21, 2022, of attribute values from the attribute value inventory de-
Atlanta, USA
†                                                                                                      fined in an online shopping service. In other words, SQIP
  These authors contributed equally.                                                                   gives a natural language query a structure to facilitate
$ gautam.kumar@rakuten.com (G. Kumar);
chikara.hashimoto@rakuten.com (C. Hashimoto)
                                                                                                       the retrieval of products.
 https://chikarahashimoto.wixsite.com/home (C. Hashimoto)                                                In brief, our proposed method has following two
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).
                                                                                                       phases:
    CEUR

            CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org


                                                                                                      1. Making of Labeling Model: Our labeling model
    Workshop      ISSN 1613-0073
    Proceedings


1
    Intents are represented in the form of Attribute-name:
    Attribute-value in this paper. We also represent attribute values                                    is a binary classification model which predicts
    of products in a similar way.                                                                        whether given a (query, intent) pair is valid or
       not? For this we generate good quality training    logs. Despite the notable difficulty of obtaining high-
       data and train a BERT Sequence Classification      quality training data, little attention has been paid to the
       model. For the data generation, we follow follow-  problem in previous SQIP studies. Due to the success of
       ing steps:                                         pre-trained models [11], transfer learning has also been
          a) Create Base SQIP model trained on prod-      popular recently [12], where pre-trained models can be
              uct catalog data with input "product title" seen as providing weak supervision. With this approach,
              (could be considered as long pseudo shop-   one fine-tunes a model that has been trained on a relevant
              ping query) and output "attribute values"   task for the purpose of the target task using a reasonable
              (could be considered as the pseudo shop-    amount of quality training data, which we cannot expect
              ping query’s intents)                       in SQIP.
          b) Generate (query, intent) pairs by getting       There have also been many studies on combining weak
              intents of queries from Query Click Logs    supervision signals to dispense with manually annotated
              using the Base SQIP model and take in-      training data [13, 14, 15, 16, 17, 18], which would be useful
              tersection with (query, intent) pairs from  if we may devise more than one kind of weak supervision
              Query Click Logs                            signal for a given task. For SQIP, however, it would be
    2. Training Data Generation for SQIP: From raw        infeasible to assume that labeling functions [14, 15, 17]
       queries, get intents using the Base SQIP model     or keywords [16, 18] for target classes can be frequently
       and filter these intents using the labeling model. applied to or matched against queries since queries are
                                                          usually very short and diverse. It would also be infeasible
The contributions of this paper are following:            to prepare labeling functions or keywords for each class
    1. We present a novel two-phased approach to train- since the number of classes in SQIP amounts to tens of
       ing data generation for SQIP that requires no man- thousands and also the classes can be changed over time.
       ual supervision.                                      Automatically correcting corrupted labels has also
    2. We present how to build the labeling model, the    gained  much attention recently [19, 20, 21]. These meth-
       key module of our two-phased approach, by com-     ods learn  label corruption matrices, which would be pro-
       bining weak supervision signals readily available hibitively large in SQIP since it has to deal with tens of
       in online shopping services.                       thousands of classes.
     3. We empirically demonstrate that our two-phased
         approach is effective through large-scale experi- 1.2. Preview of the Proposed Method
         ments.
                                                              What makes training data generation for SQIP difficult?
                                                              We think it is the large number of classes; considering
1.1. Background                                               many classes for a query at once tends to be difficult. We
SQIP is an extreme multi-label text classification task for therefore propose to decompose the task into two phases.
which many excellent algorithms have been developed In the first phase, we build a labeling model that checks
recently [1, 2, 3, 4, 5, 6, 7, 8]. These classification algo- whether an intent is valid for a query. In the second phase,
rithms can be used for SQIP once high-quality training we use this labeling model to verify each pair comprising
data is available.                                            a query and an intent on a large scale. Here, the labeling
   However, obtaining high-quality training data for SQIP model can be seen as an annotator who is asked to create
is not straightforward. First of all, manual creation of training data for SQIP. Refer to figures 1 and 2 for more
a sufficient volume of training data would be infeasible details.
because there are tens of thousands of predefined intents        How can we build the labeling model? We propose to
and understanding shopping query intents would require        utilize catalog data and query-click logs since they are
deep knowledge of a large number of product domains.          readily  available in online shopping services and provide
Accordingly, previous studies [9, 10] used query-click weak but different supervision signals so that they would
logs to automatically generate training data by assuming reinforce each other, as we will demonstrate in Section 4.
that if a product has an attribute-value like Brand: Louis       Base SQIP model is a weak SQIP model that takes
Vuitton and the page of the product is clicked by a           queries   as input and predicts their intents, from which
user who issued a query like “lv bag zebra,” an intent        we  generate   a set of (query, intent) pairs. The base SQIP
of the query is Brand: Louis Vuitton. This heuristic          model   is trained with catalog data, the database of prod-
suffers from the inherent noise in query-click logs due       ucts sold  at an online  shopping service, where various in-
to, for instance, inconsistent click behaviors of fickle formation about products such as product titles and their
users or erroneous retrieval results. Besides, it cannot attribute values are registered. Product titles are usually
utilize a number of queries that are absent in query-click a set of words that describe the features of products such
                                                              as “Louis Vuitton Shoulder bag Leather Zebra print,” which
                      Phase 1
                      Phase 2                  Catalog Data                Query-Click Logs                      Labeling Model Builder


                                                                                                                                         Final
                 Unlabeled                 Base SQIP                     Candidate                            Labeling
                                                                                                                                       Training
                  Queries                   Model                       Training Data                          Model
                                                                                                                                        Data
                  “lv bag zebra”                 “lv bag zebra”                            (query, intent)          “lv bag zebra”
                                                 • Brand: Louis Vuitton                                             • Brand: Louis Vuitton
                                                 • Nation: Latvia                         Labeling Model            • Nation: Latvia
                                                 • Pattern: Zebra                                                   • Pattern: Zebra
                                                                                           Valid / Invalid


              “lv bag zebra”, Brand: Louis Vuitton            “lv bag zebra”, Nation: Latvia               “lv bag zebra”, Pattern: Zebra

                            Labeling Model                             Labeling Model                                Labeling Model

                                 Valid                                         Invalid                                      Valid

Figure 1: Overview of our training data generation method. In first phase we build the labeling model, which is depicted in
Figure 2 in detail. In second phase, we generate candidate training data from unlabeled queries by using the base SQIP model.
Afterwards, the labeling model filters out invalid (query, intent) pairs to generate the final training data.


                     Catalog Data
          Product Title        Attribute-Value

          “Louis Vuitton     • Brand: Louis Vuitton
         Bag Zebra print”    • Pattern: Zebra                                              Generated (query, intent) pairs
         “Nike Air Jordan    • Brand: Nike
             Men’s”          • Target: Men’s                                             (“louis vuitton bag”, Brand: Louis Vuitton)
                                                        Training                         (“lv bag zebra”, Brand: Louis Vuitton)
          Queries
                                                                                         (“lv bag zebra”, Pattern: Zebra)
        “louis vuitton bag”
                                                   Base SQIP Model                       (“lv bag zebra”, Nation: Latvia)
        “lv bag zebra”
                                                                                                                                       Labeling
                                                                                                   Intersection        Training
                                                                                                                                        Model
                                Query-click logs
                                                                                         (“louis vuitton bag”, Brand: Louis Vuitton)
                             Clicked products and attribute values
                                 Product Title           Attribute-Value                 (“louis vuitton bag”, Pattern: Zebra)
         Queries
                                                                                         (“louis vuitton bag”, Material: Leather)
       “louis                    “Louis Vuitton       • Brand: Louis Vuitton
       vuitton bag”             Bag Zebra print”      • Pattern: Zebra                   (“lv bag zebra”, Brand: Louis Vuitton)
                                “LV Louis Vuitton     • Brand: Louis Vuitton             (“lv bag zebra”, Pattern: Zebra)
       “lv bag zebra”             Leather Bag”        • Material: Leather
                                                                                         (“lv bag zebra”, Material: Leather)

Figure 2: Closer look at the labeling model builder. Training data for the labeling model is the intersection of two sets of pairs
comprising a query and an intent. Each set is generated by one of two weak generators; the base SQIP model and query-click
logs.


can be seen as lengthy, detailed, merchant-made pseudo                   queries and clicked item’s product attribute values (i.e.
queries about the products. Since these titles (i.e., pseudo             intents). We generate another set of (query, intent) pairs
queries) are associated with attribute values of products                based on this association.
(i.e., intents) we can use the catalog data to train the base               Catalog data provides the direct evidence of the associ-
SQIP model without manual annotation.                                    ation between product titles and attribute values (intents),
    Query-click logs indicate the association between                    but the titles are not real queries. In contrast, click logs
show the association between real queries and intents,         we discussed in Section 1, most of the previous weak-
but it is only indicated indirectly and tends to be noisy.     supervision methods are not appropriate for SQIP since
However, these two data sources can generate reliable          they require external knowledge bases [23, 24], a reason-
training data for the labeling model in tandem.                able amount of quality training data [12], labeling func-
   In summary, our proposed method creates a "ma-              tions or keywords for target classes [14, 15, 16, 17, 18],
chine annotator" namely, the labeling model, using huge        or label corruption matrices to be learned [19, 20, 21].
amount of online shopping data to generate training data       Shen et al. proposed learning classifiers with only class
for SQIP on a large scale without requiring any manual         names [25]. However, their method assumes that classes
labor.                                                         are organized in a hierarchy, so we cannot use their
   Through large-scale SQIP experiments, we demon-             method for SQIP where classes (intents) are not orga-
strate that the model trained with data generated by           nized in a hierarchy. Karamanolakis et al. [26] proposed
our proposed method outperforms a model trained with           a method that works with weak supervision such as lex-
query-click logs only and a model trained with data cre-       icons, regular expressions, and knowledge bases of the
ated by a competitive training data generation method          target domain. However, such weak supervision would
based on data programming [14].                                become obsolete quickly in SQIP, as discussed in Section 1.
   All the data used in this study were obtained from          Zhang et al. [27] proposed a teacher-student network
an online shopping service, Rakuten, and written in            method which utilizes weakly labeled behaviour data for
Japanese. However, the ideas and methods in this paper         SQIP. However, they do use strongly labeled data in their
are independent of particular languages, and examples in       training methodology to train the teacher network.
this paper are written in English for ease of explanation.
                                                               2.3. Extreme Multi-Label Classification
2. Related Work                                                SQIP is an extreme multi-label classification (XML),
                                                               which tags a data point with the most relevant subset of
2.1. Shopping Query Intent Prediction                          labels from an extremely large label set, that has gained
Previous methods for SQIP can be categorized into              much attention recently [1, 2, 3, 7, 8]. While many clas-
classification-based methods [9, 10] and sequence-             sification algorithms have been proposed, training data
labeling-based methods [22].                                   generation for XML has not been well studied. Zhang
   In this study, our proposed method generates train-         et al. [28] addressed data augmentation for XML, which
ing data for the classification-based methods for the fol-     assumed the existence of training data and thus cannot
lowing two reasons: First, with sequence-labeling-based        be applied to our setting. Our study therefore differs
methods, it would be more difficult to deal with tens of       from previous XML studies since we directly tackle the
thousands of classes, while, for classification-based meth-    task of training data generation, though our method is
ods, there have recently been many excellent extreme           specifically designed for SQIP.
classification algorithms that can handle a huge number           For a more comprehensive overview of classifica-
of classes. Second, sequence-labeling-based methods deal       tion algorithms and data sets for XML, refer to http://
with only intents that are explicitly written in queries.      manikvarma.org/downloads/XC/XMLRepository.html.
However, valid intents are not always explicit in queries;
e.g., “nicole down jacket” has Filling: Feather as its valid   3. Proposed Method
intent.
   Our study is different from previous ones because we        In this section, we describe each component of our
focus on how to obtain a huge volume of high-quality           method as illustrated in Figures 1 and 2; catalog data, the
training data for SQIP, rather than how to classify queries.   base SQIP model, query-click logs, the labeling model,
Previous studies simply used query-click logs to obtain        unlabeled queries, candidate training data, and the final
pseudo-labeled data [9, 10], which tends to be noisy           training data.
and unreliable. We will demonstrate that our proposed
method can generate better training data in Section 4.
                                                               3.1. Catalog Data
2.2. Learning with Weak Supervision                   Catalog data contains various information of products
                                                      sold at the shopping service, including product titles, de-
Our study can be seen as answering the research ques- scriptions, prices, various attribute values such as brands,
tion of how to train supervised models without rely- sizes, and colors, among others. We use product titles
ing on manual annotation, and therefore studies on and attribute values to train the base SQIP model, since
learning with weak supervision are quite relevant. As product titles are usually a set of words that indicate
Table 2
Examples of product titles and attribute values

                                     Product title                                                          Attribute values
  “[Next-day delivery] Nike Women’s Zoom Vaper 9.5 Tour 631475-602 Lady’s Shoes”             Brand: Nike, Color: Red
  “TIFFANY&CO. tiffany envelope charm [NEW] SILVER 270000487012x”                            Brand: Tiffany & Co., Color: Silver
  “[Kids clothes/STUSSY] Classic Logo Strapback Cap black a118a”                             Clothing fabric: Cotton
  “Fitty Closely-attached mask Pleated type Slightly small size Five-pack”                   Mask shape: Pleated
  “[Unused] iQOS 2.4PLUS IQOS White Electric cigarette Main body 58KK0100180”                Color: White
  “NIKE AIR MAX 90 ESSENTIAL Sneaker Men’s 537384-090 Black [In-stock, May 15]”              Shoe upper material: Leather, Brand: Nike


the features of products and can consequently be seen           3.3. Query-Click Logs
as lengthy, detailed queries about the products. Table 2
                                                                We used one year of query-click logs, which contained
shows examples of product titles and their attribute val-
                                                                72 million unique queries. As illustrated in Figure 2, the
ues in our catalog data, and indicates differences between
                                                                query-click logs are used to generate (query, intent) pairs
product titles and real queries. First, product titles some-
                                                                as part of training data for the labeling model. We simply
times contain tokens that would not appear in queries
                                                                enumerated all possible (query, intent) pairs such that a
usually, such as “[Unused]” and “[In-stock, May 15].” Sec-
                                                                query is associated with an intent (attribute-value) via
ond, real queries are usually much shorter than product
                                                                click relations in the logs.
titles. Third, attribute values might not always mean in-
tent. For example, color: red is not intent if we consider
product title as shopping query in first example of table       3.4. Labeling Model
2. Catalog data is a useful data source for training a SQIP
                                                                The labeling model takes a pair comprising a query (e.g.,
model but is not sufficiently reliable by itself due to these
                                                                “lv bag zebra”) and an intent (e.g., Brand: Louis Vuitton)
differences.
                                                                as input and predicts whether the intent is valid for the
   To train the base SQIP model, we used 117 million
                                                                query.
product titles and their associated attribute values. The
number of different attribute values was 19,416.
                                                                3.4.1. Model Architecture

3.2. Base SQIP Model                                         BERT[11]-based models have been very promising for
                                                             text pair classification and regression tasks, such as nat-
The base SQIP model takes unlabeled queries, such as ural language inference (NLI) [32] and semantic textual
“lv bag zebra” as input and predicts their intents such as similarity (STS) [33]. Since the task of the labeling model
Brand: Louis Vuitton and Pattern: Zebra. We had to is binary classification, we used BertForSequenceClas-
deal with hundreds of millions of training instances in sification2 where we use pretrained BERT model for
our experiments (Section 4) and chose extremeText [29]. Japanese3 .
It was the only extreme multi-label classification method        We intentionally adopted a very simple approach
that we experimented with that could handle all training so that we could demonstrate the effectiveness of our
instances in our environment. Other extreme multi-label method.
classification methods we experimented with include
Parabel [1], Bonsai [2], LightXML [7], XR-Linear [30], 3.4.2. Training Data
and XR-Transformer [8].
   The classification algorithm of extremeText is based on As shown in Figure 2, we automatically generate training
probabilistic label trees (PLT) [31], in which leaf nodes data for the labeling model which is the intersection of
represent the target labels and the other nodes are logistic two sets of (query, intent) pairs; one set is generated with
                                                                                      4
regression classifiers. PLT guides data points from the the base SQIP model and another is from the query-click
root node into their appropriate leaf nodes (labels) with logs. Although each of these two kinds of supervision sig-
the logistic regression classifiers. For training the model, nals is weak by itself, we can accurately obtain a number
we did not conduct extensive hyper-parameter tuning; we of valid (query, intent) pairs by combining them.
used its default hyper-parameters, except that we chose
PLT as the loss function, and used the TF-IDF weights 2 https://huggingface.co/transformers/model_doc/bert.html.
for words.                                                   3
                                                               https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-
                                                                    masking
                                                                4
                                                                    The input to the base SQIP model is the queries in the query click
                                                                    logs.
   To be specific, we obtained (query, intent) pairs such Table 3
that the query is associated with the intent in the query- Examples of the final training data
click logs and also, given the query as input, the base
                                                                       Query                            Intents
SQIP model predicted the intent with probability 1.0. As
a result, we generated 5.3 million (query, intent) pairs    “alpha ma-1”                    Brand: Alpha Industries
as positive examples for training of the labeling model.    “orange t-shirt”                Color: Orange
We then generated 5.3 million (query, intent) pairs by      “tropicana orange”              Fruit taste: Orange
randomly pairing queries and intents, which we used as                                      Series: Tropicana
                                                            “gres perfume orange”           Scent: Orange, Brand: Gres
negative examples.
                                                                   “washbowl 750”                   Capacity: 600 - 899ml
                                                                   “original message carnation”     Event/Holiday: Mother’s Day
3.4.3. Training Detail
The labeling model has been built with the training data
and the model architecture, as described above. Training              3. Our proposed method that exploits both catalog
is done for one epoch with batch size 32 using AdamW                     data and query-click logs can generate even better
[34] optimizer.                                                          training data.
                                                                      4. Without the labeling model, the performance of
3.5. Unlabeled Queries, Candidate                                        our method degrades, indicating the effectiveness
                                                                         of the labeling model.
     Training Data, and Final Training
                                                                      5. Our proposed method outperforms the compet-
     Data                                                                itive training data generation method based on
The second phase starts with predicting intents for unla-                data programming called Snorkel [14].
beled queries from query logs with the base SQIP model
to generate candidate training data. We then filter out
erroneous intents with the labeling model to generate         4.1. Experimental Conditions
the final training data.                                      In our experiments, we compared our proposed method
   Unlabeled queries were obtained from seven years of        with four baseline methods described in Section 4.2. All
query logs, which contained more than 1.5 billion unique      the compared methods differ only in how they obtain
queries.                                                      training data. For classification, they use the same archi-
   Candidate training data were generated under the fol-      tecture, extremeText; specifically, all the methods trained
lowing condition: 𝑘=5, meaning that the base SQIP model       their SQIP model with the PLT loss function and the TF-
predicted the most probable five intents for a query at       IDF weights for words; the other hyper-parameters were
most, and threshold=1.0, i.e., only those intents whose       set to the default values.
probability was 1.0 were outputted. As a result, we ob-          Test data has been manually created by a human an-
tained 377 million (query, intent) pairs. The number of       notator (who is not an author). The annotator was asked
unique queries was 264 million.                               to check (query, intent) pairs that were automatically
   The final training data were those (query, intent) pairs   generated by pairing a query and an intent, such that
whose probability given by the labeling model was at          at least one token in the query was semantically similar
least 0.99. Consequently, we obtained 169 million (query,     or relevant to the intent in order to exclude obviously
intent) pairs. The number of unique queries was 145           erroneous (query, intent) pairs from all possible pairs in
million. We trained and evaluated the SQIP model with         advance of manual annotation.5 As a result, 5,615 differ-
this final training data, as reported in Section 4. Table 3   ent queries with at least one intent were obtained as test
shows examples of the final training data.                    data, and 2.57 intents were given to a query, on average.
                                                                 Evaluation was based on precision and recall, which
4. Experiments                                                were calculated with extremeText’s test command. Pre-
                                                              cision and recall were calculated for top 𝑘 outputs (i.e.,
In this section, through large-scale SQIP experiments in      intents) with 𝑘 being 1, 3, and 5, and we drew precision-
which one predicts intents of a given query, we claim the     recall curves for the compared methods for each 𝑘 with
following:                                                    the probability threshold of extremeText changing from
                                                              0.0 to 1.0 with the interval of 0.01.
    1. Simply using query-click logs for training SQIP
       models delivers poor performance.                      5
                                                                  The semantic similarity was measured by the cosine similarity be-
    2. Using catalog data for training leads to better            tween their sentence embeddings. We use fastText embeddings [35],
       performance than simply using query-click logs             which had been learned from the query logs. The threshold for the
       but is still unsatisfactory.                               cosine similarity was set to 0.8.
                                                          years of query-click logs and obtained (query, intent)
                                                          pairs in which product pages that had the intent (i.e.,
                                                          attribute value) were clicked through the query at least
                                                          ten times in the logs. The purpose of this was to reduce
                                                          the inherent noise in the query-click logs. As a result, we
                                                          obtained more than 670 million (query, intent) pairs. The
                                                          number of unique queries was 7,962,605, which indicated
                                                          that each query was given as many as 84 intents on aver-
                                                          age. This number is obviously too large given that most
                                                          queries consist of less than ten tokens and supports our
                                                          claim that simply using query-click logs as training data
                                                          would be inadequate.

                                                          4.2.2. Base
                                                          This is the base SQIP model, which uses only product
                                                          titles and their associated attribute values for training.

                                                          4.2.3. Proposed
                                                          This is a SQIP model trained with the final training data
                                                          generated with our proposed method, as described in
                                                          Section 3.

                                                          4.2.4. Proposed-LM
                                                          This is the same as Proposed except that it does not
                                                          use the labeling model. Proposed-LM is then trained
                                                          with the candidate training data in the second phase; its
                                                          training process is similar to self-training. Note that the
                                                          difference in performances between Proposed-LM and
                                                          Proposed can be seen as indicating the effectiveness of
                                                          the labeling model.

                                                          4.2.5. Snorkel
                                                          This baseline is the same as Proposed, except that the
                                                          labeling model is replaced with Snorkel [14], a train-
                                                          ing data generation method based on data programming
                                                          [13]. Like Proposed’s labeling model, Snorkel’s label-
                                                          ing model can be learned without manual supervision.
                                                          However, Snorkel requires labeling functions that imple-
                                                          ment a variety of domain knowledge, heuristics, and
                                                          any kind of weak supervision that would be useful for a
                                                          given task. Each labeling function takes unlabeled data
Figure 3: Precision-recall curves for all Experiments
                                                          points as input and predicts their class labels. Snorkel
                                                          then uses these weakly-labeled data points to train a
                                                          generative labeling model which is supposed to be able
4.2. Compared Methods                                     to label each data point more accurately than the label-
We compared the following five methods:                   ing functions. Snorkel has influenced subsequent stud-
                                                          ies on training data generation [17], and has also been
                                                          adopted by the world’s leading organizations as described
4.2.1. QueryClick
                                                          in https://www.snorkel.org/. We therefore think that
The simplest baseline is QueryClick, which uses query- comparing with the Snorkel-based baseline would effec-
click logs to generate training data in a similar way to tively show Proposed’s performance.
the previous methods [9, 10]. Specifically, we used seven
Table 4                                                  data alone can only provide weak supervision. However,
Proposed’s best F1 scores                                combining them can lead to higher performances.
                                                            Comparing Proposed with Snorkel shows the supe-
        𝑘    F1     Precision Recall Threshold
                                                         riority of our labeling model over Snorkel. We think
        1   0.537     0.678     0.444       0.21         this is because labeling functions of Snorkel or learning
        3   0.535     0.620     0.470       0.26         methods with weak heuristic rules in general have been
        5   0.531     0.608     0.471       0.26         known to suffer from a low coverage [26]; rules tend to
                                                         be applied to only a small subset of instances. In fact, the
                                                         first labeling function for Snorkel covered only 1.94%
   Our Snorkel baseline, to be specific, was imple- of the training instances. The second and third label-
mented in the following way: The input and output of ing functions covered 60.61% and 12.73%, respectively.
Snorkel’s labeling model are the same as Proposed’s On the other hand, the labeling model of Proposed is
labeling model; the input (query, intent) pairs are gen- learned with natural language words and phrases, which
erated with the base SQIP model; the output is whether BERT makes the maximum use of; that is to say, the la-
given (query, intent) pairs are valid or not. We defined beling model of Proposed does not waste the training
following labeling functions that utilized the same two data.
kinds of weak supervision as Proposed, i.e., the query-
click logs and the base SQIP model which are following:
                                                               4.4. Error Analysis
    1. If given intent is associated with the given query
                                                          Table 5 illustrates examples of wrong prediction made
       in query-click logs, return valid; otherwise return
                                                          by Proposed (𝑘=1, threshold=0.21). Most of the errors
       invalid.
                                                          were due to the class imbalance in the training data; i.e.,
    2. If output probability of the base SQIP model,
                                                          the distribution of training instances across the intents is
       given (query, intent) pair is 1.0, return valid; oth-
                                                          biased or skewed, and intents for which we have few or
       erwise abstain.
                                                          no instances tend to be difficult to predict [36]. Regard-
    3. Return invalid if the output probability is not    ing the first example in Table 5, “prince” can be Brand:
       greater than 0.995; otherwise abstain.             Prince and be part of Brand: Glen Prince. However,
   Snorkel’s labeling model was trained with 11 million the frequency of the former intent in the final training
(query, intent) pairs that had been weakly-labeled with data was 119,972, whereas that of the latter was only 33,
the three labeling functions.                             which caused the SQIP model to choose the former for
   Proposed’s labeling model was trained with 10.6 mil- the query. Regarding the second example, the frequency
lion (query, intent) pairs as described in Section 3.4.2. of Color: Red was 1,486,315, while that of Brand: Red
                                                          Wing was 15,592. For the last one, there was no train-
                                                          ing instance for Memory Standards: DDR3 in our final
4.3. Results                                              training data and thus, the SQIP model could not predict
Figure 3 shows precision-recall curves for the compared it.
methods and from them we can make the following ob-
servations:                                               4.5. Effect of Training Data Size
    1. QueryClick’s precision decreases sharply as we          Figure 4 shows F1 scores of Proposed built with the
       try to increase recall.                                 final training data of different sizes (’K’ and ’M’ stand
    2. Base generally outperforms QueryClick,                  for ’thousand’ and ’million’). The 𝑘 and the threshold
       though its performance is still unsatisfactory.         of extremeText were set to 1 and 0.21 uniformly. The
    3. Proposed outperforms all the other methods. Ta-         graph indicates that increasing the data size leads to
       ble 4 shows Proposed’s best F1 scores and their         better performances and that our final training data is
       corresponding precision, recall, and threshold val-     effective for SQIP. Although the improvement from 10M
       ues for each 𝑘.                                         to 145M is small, it is noteworthy that additional data
    4. Proposed-LM’s performance is worse than that            could improve the model trained already with as many
       of Proposed.                                            as 10M instances.
    5. Snorkel can deliver good performances but can-
       not outperform Proposed.
                                                               5. Future Direction
  The relatively low performances of QueryClick and
Base and the relatively high performances of Proposed          For training data generation, one possible direction is
and Snorkel indicate that query-click logs and catalog         to use product genre/category information. If we could
Table 5
Examples of wrong prediction made by Proposed

                                Query                          True Intents         Predicted Intents
                    “glen prince”                      Brand: Glen Prince           Brand: Prince
                    “red wing engineer boots us 7.5”   Brand: Red Wing              Color: Red
                    “pc 3 12800 ddr 3 sdram”           Memory Standards: DDR3              −


                                                               an issue, in the future we aim to address this.


                                                               6. Conclusion
                                                               In this paper, we proposed the novel two-phased train-
                                                               ing data generation method for SQIP. The idea is to first
                                                               build a labeling model that checks whether an intent
                                                               is valid for a query. The model then works as an "an-
                                                               notator" who checks a number of pairs comprising an
                                                               intent and a query to generate training data for SQIP. We
                                                               presented how to train such a model without manual su-
                                                               pervision by utilizing a huge amount of online shopping
                                                               data. Through the series of large-scale experiments with
                                                               the data from a real online shopping service, we have
Figure 4: Changes in F1 due to different training data sizes   demonstrated the effectiveness of our proposed method.
for proposed

                                                               Acknowledgments
create query to product genre mapping of reliable qual- We thank our annotator Saki Hiraga-san for helping us
ity, we can filter (query, intent) pairs further and create in creation of evaluation dataset. We thank all the re-
higher quality training data. Also, we could utilize neigh- searchers in RIT for their support for this project.
bor signals, since similar queries should have more labels
in common, to remove noise from the dataset further.
   For the classification model, one possibility is to use References
label (i.e. intent) context information to create embed-
ding vector of input text (i.e. shopping query). Similar     [1] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, M. Varma,
previous work is by Chen et al. [37] who uses LGuid-              Parabel: Partitioned label trees for extreme classifi-
edLearn [38] for Product Item Category Classification.            cation with application to dynamic search advertis-
Another possible method could be Label-Specific Docu-             ing, in: Proceedings of the 2018 World Wide Web
ment Representation for Multi-Label Text Classification           Conference, WWW ’18, 2018, p. 993–1002.
by Xiao et al. [39]. Also, Cai et al. [40] proposes a hybrid [2] S. Khandagale, H. Xiao, R. Babbar, Bonsai – diverse
neural network model to simultaneously take advantage             and shallow trees for extreme multi-label classifica-
of both label semantics and fine-grained text informa-            tion, 2019. arXiv:1904.08249.
tion. Another possibilities are to consider Contrastive      [3] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S.
Learning and KNN based methods [41, 42].                          Dhillon, Taming Pretrained Transformers for Ex-
   Another direction is to extend our proposed method             treme Multi-Label Text Classification, KDD ’20,
in other domains. If we could find a way to exploit               2020, p. 3163–3171.
weak supervision signals readily available in a domain       [4] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave,
for building the labeling model, we can easily apply our          A. Soni, H. Jain, S. Agarwal, M. Varma, Deepxml:
approach to the domain. In the case of text classification        A deep extreme multi-label learning framework
into Wikipedia categories [43], for instance, not only the        applied to short text documents, in: Proceedings
category information in Wikipedia articles but also the           of the 14th ACM International Conference on Web
links among corresponding articles in different languages         Search and Data Mining, WSDM ’21, 2021, p. 31–39.
and the class hierarchy in Wikidata [44] can be exploited. [5] A. Mittal, K. Dahiya, S. Agrawal, D. Saini, S. Agar-
   As we have seen in section 4.4 that data imbalance is          wal, P. Kar, M. Varma, Decaf: Deep extreme classi-
     fication with label features, in: Proceedings of the          language explanations, Proceedings of The 56th An-
     14th ACM International Conference on Web Search               nual Meeting of the Association for Computational
     and Data Mining, WSDM ’21, 2021, p. 49–57.                    Linguistics 2018 (2018) 1884–1895.
 [6] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal,          [16] Y. Meng, J. Shen, C. Zhang, J. Han, Weakly-
     P. Kar, M. Varma, Eclare: Extreme classification              supervised neural text classification, in: Proceed-
     with label graph correlations, in: Proceedings                ings of the 27th ACM International Conference on
     of the Web Conference 2021, WWW ’21, 2021, p.                 Information and Knowledge Management, CIKM
     3721–3732.                                                    ’18, 2018, p. 983–992.
 [7] T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao,             [17] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi, Learn-
     F. Zhuang, Lightxml: Transformer with dynamic                 ing from rules generalizing labeled exemplars, in:
     negative sampling for high-performance extreme                8th International Conference on Learning Repre-
     multi-label text classification, in: Thirty-Fifth AAAI        sentations, ICLR 2020, Addis Ababa, Ethiopia, April
     Conference on Artificial Intelligence, AAAI 2021,             26-30, 2020, OpenReview.net, 2020. URL: https:
     2021, pp. 7987–7994.                                          //openreview.net/forum?id=SkeuexBtDr.
 [8] J. Zhang, W.-C. Chang, H.-F. Yu, I. S. Dhillon, Fast     [18] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji,
     multi-resolution transformer fine-tuning for ex-              C. Zhang, J. Han, Text classification using label
     treme multi-label text classification, in: 35th Con-          names only: A language model self-training ap-
     ference on Neural Information Processing Systems,             proach, in: Proceedings of the 2020 Conference on
     NeurIPS 2021, 2021.                                           Empirical Methods in Natural Language Processing,
 [9] C. Wu, A. Ahmed, G. R. Kumar, R. Datta, Predicting            EMNLP ’20, 2020, pp. 9006–9017.
     latent structured intents from shopping queries, in:     [19] G. Patrini, A. Rozza, A. K. Menon, R. Nock, L. Qu,
     Proceedings of the 26th International Conference              Making deep neural networks robust to label noise:
     on World Wide Web, WWW ’17, 2017, pp. 1133–                   A loss correction approach, in: 2017 IEEE Confer-
     1141.                                                         ence on Computer Vision and Pattern Recognition,
[10] J. Zhao, H. Chen, D. Yin, A dynamic product-aware             CVPR ’17, 2017, pp. 2233–2241.
     learning model for e-commerce query intent under-        [20] D. Hendrycks, M. Mazeika, D. Wilson, K. Gimpel,
     standing, in: Proceedings of the 28th ACM Interna-            Using trusted data to train deep networks on la-
     tional Conference on Information and Knowledge                bels corrupted by severe noise, in: Proceedings
     Management, CIKM ’19, 2019, p. 1843–1852.                     of the 32nd International Conference on Neural
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:           Information Processing Systems, NIPS’18, 2018, p.
     Pre-training of deep bidirectional transformers for           10477–10486.
     language understanding, in: Proceedings of the           [21] G. Zheng, A. H. Awadallah, S. Dumais, Meta label
     2019 Conference of the North American Chapter                 correction for noisy label learning, in: Proceedings
     of the Association for Computational Linguistics:             of the AAAI Conference on Artificial Intelligence,
     Human Language Technologies, NAACL-HLT ’19,                   volume 35 of AAAI ’21, 2021.
     2019, pp. 4171–4186.                                     [22] X. Li, Y.-Y. Wang, A. Acero, Extracting structured in-
[12] M. Ben Noach, Y. Goldberg, Transfer learning be-              formation from user queries with semi-supervised
     tween related tasks using expected label propor-              conditional random fields, in: Proceedings of the
     tions, in: Proceedings of the 2019 Conference on              32nd International ACM SIGIR Conference on Re-
     Empirical Methods in Natural Language Process-                search and Development in Information Retrieval,
     ing and the 9th International Joint Conference on             SIGIR ’09, 2009, p. 572–579.
     Natural Language Processing, EMNLP-IJCNLP ’19,           [23] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant
     2019, pp. 31–42.                                              supervision for relation extraction without labeled
[13] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré,           data, in: Proceedings of the Joint Conference of
     Data programming: Creating large training sets,               the 47th Annual Meeting of the ACL and the 4th In-
     quickly, in: D. Lee, M. Sugiyama, U. Luxburg,                 ternational Joint Conference on Natural Language
     I. Guyon, R. Garnett (Eds.), Advances in Neural               Processing of the AFNLP: Volume 2 - Volume 2,
     Information Processing Systems, volume 29 of                  ACL ’09, 2009, p. 1003–1011.
     NeurIPS ’16, 2016.                                       [24] F. Brahman, V. Shwartz, R. Rudinger, Y. Choi, Learn-
[14] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,         ing to rationalize for nonmonotonic reasoning with
     C. Ré, Snorkel: Rapid training data creation with             distant supervision, in: The Thirty-Fifth AAAI Con-
     weak supervision, Proceedings of the VLDB En-                 ference on Artificial Intelligence, AAAI ’21, AAAI
     dowment 11 (2017) 269–282.                                    Press, 2021, pp. 12592–12601.
[15] B. Hancock, M. Bringmann, P. Varma, P. Liang,            [25] J. Shen, W. Qiu, Y. Meng, J. Shang, X. Ren, J. Han,
     S. Wang, C. Ré, Training classifiers with natural             TaxoClass: Hierarchical multi-label text classifica-
     tion using only class names, in: Proceedings of the            riching word vectors with subword information,
     2021 Conference of the North American Chapter                  arXiv preprint arXiv:1607.04606 (2016).
     of the Association for Computational Linguistics:         [36] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P.
     Human Language Technologies, NAACL-HLT ’21,                    Kegelmeyer, Smote: Synthetic minority over-
     2021, pp. 4239–4249.                                           sampling technique, Journal of Artificial Intelli-
[26] G. Karamanolakis, S. Mukherjee, G. Zheng, A. H.                gence Research 16 (2002) 321–357.
     Awadallah, Self-training with weak supervision, in:       [37] L. Chen, H. Miyake, Label-guided learning for item
     Proceedings of the 2021 Conference of the North                categorization in e-commerce, in: NAACL, 2021.
     American Chapter of the Association for Computa-          [38] X. Liu, S. Wang, X. Zhang, X. You, J. Wu,
     tional Linguistics: Human Language Technologies,               D. Dou, Label-guided learning for text classifica-
     NAACL ’21, 2021, pp. 845–863.                                  tion, 2020. URL: https://arxiv.org/abs/2002.10772.
[27] D. Zhang, Z. Li, T. Cao, C. Luo, T. Wu, H. Lu,                 doi:10.48550/ARXIV.2002.10772.
     Y. Song, B. Yin, T. Zhao, Q. Yang, Queaco: Borrow-        [39] L. Xiao, X. Huang, B. Chen, L. Jing, Label-specific
     ing treasures from weakly-labeled behavior data                document representation for multi-label text classi-
     for query attribute value extraction, in: Proceed-             fication, in: Proceedings of the 2019 Conference on
     ings of the 30th ACM International Conference on               Empirical Methods in Natural Language Processing
     Information and Knowledge Management, CIKM                     and the 9th International Joint Conference on Nat-
     ’21, Association for Computing Machinery, New                  ural Language Processing (EMNLP-IJCNLP), Asso-
     York, NY, USA, 2021, p. 4362–4372. URL: https:                 ciation for Computational Linguistics, Hong Kong,
     //doi.org/10.1145/3459637.3481946. doi:10.1145/                China, 2019, pp. 466–475. URL: https://aclanthology.
     3459637.3481946.                                               org/D19-1044. doi:10.18653/v1/D19-1044.
[28] D. Zhang, T. Li, H. Zhang, B. Yin, On data augmen-        [40] L. Cai, Y. Song, T. Liu, K. Zhang, A hybrid bert
     tation for extreme multi-label classification, CoRR            model that incorporates label semantics via ad-
     abs/2009.10778 (2020). URL: https://arxiv.org/abs/             justive attention for multi-label text classification,
     2009.10778.                                                    IEEE Access 8 (2020) 152183–152192. doi:10.1109/
[29] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-                ACCESS.2020.3017382.
     Fekete, K. Dembczyński, A no-regret generalization        [41] L. Zhu, H. Chen, C. Wei, W. Zhang, Enhanced rep-
     of hierarchical softmax to extreme multi-label clas-           resentation with contrastive loss for long-tail query
     sification, in: Proceedings of the 32nd International          classification in e-commerce, in: Proceedings of
     Conference on Neural Information Processing Sys-               The Fifth Workshop on e-Commerce and NLP (EC-
     tems, NeurIPS ’18, 2018, p. 6358–6368.                         NLP 5), Association for Computational Linguistics,
[30] H.-F. Yu, K. Zhong, I. S. Dhillon, Pecos: Prediction           Dublin, Ireland, 2022, pp. 141–150. URL: https://
     for enormous and correlated output spaces, arXiv               aclanthology.org/2022.ecnlp-1.17. doi:10.18653/
     preprint arXiv:2010.05878 (2020).                              v1/2022.ecnlp-1.17.
[31] K. Jasinska, K. Dembczynski, R. Busa-Fekete,              [42] X. Su, R. Wang, X. Dai, Contrastive learning-
     K. Pfannschmidt, T. Klerx, E. Hullermeier, Extreme             enhanced nearest neighbor mechanism for multi-
     f-measure maximization using sparse probability                label text classification,     in: Proceedings of
     estimates, in: M. F. Balcan, K. Q. Weinberger (Eds.),          the 60th Annual Meeting of the Association for
     Proceedings of The 33rd International Conference               Computational Linguistics (Volume 2: Short Pa-
     on Machine Learning, volume 48 of Proceedings of               pers), Association for Computational Linguistics,
     Machine Learning Research, 2016, pp. 1435–1444.                Dublin, Ireland, 2022, pp. 672–679. URL: https://
[32] Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou,            aclanthology.org/2022.acl-short.75. doi:10.18653/
     X. Zhou, Semantics-aware BERT for language un-                 v1/2022.acl-short.75.
     derstanding, in: the Thirty-Fourth AAAI Confer-           [43] O. Dekel, O. Shamir, Multiclass-multilabel classifica-
     ence on Artificial Intelligence, 2020.                         tion with more classes than examples, in: Proceed-
[33] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia,        ings of the Thirteenth International Conference on
     SemEval-2017 task 1: Semantic textual similarity               Artificial Intelligence and Statistics, volume 9 of
     multilingual and crosslingual focused evaluation,              Proceedings of Machine Learning Research, 2010, pp.
     in: Proceedings of the 11th International Workshop             137–144.
     on Semantic Evaluation (SemEval-2017), 2017, pp.          [44] D. Vrandečić, M. Krötzsch, Wikidata: A free collab-
     1–14.                                                          orative knowledge base, Communications of the
[34] I. Loshchilov, F. Hutter, Decoupled weight decay               ACM 57 (2014) 78–85. URL: http://cacm.acm.org/
     regularization, in: 7th International Conference on            magazines/2014/10/178785-wikidata/fulltext.
     Learning Representations, ICLR ’19, 2019.
[35] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-