<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Artificial Intelli</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/ARXIV.2002.10772</article-id>
      <title-group>
        <article-title>A Two-Phased Approach to Training Data Generation for Shopping Query Intent Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gautam Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chikara Hashimoto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rakuten Institute of Technology (RIT), Rakuten Group Inc.</institution>
          ,
          <addr-line>1-cho ̄me-14 Tamagawa, Setagaya City, Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
          <addr-line>158-0094</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4239</fpage>
      <lpage>4249</lpage>
      <abstract>
        <p>Shopping Query Intent Prediction (SQIP) is, given an online shopping user's search query, e.g., “lv bag”, to predict their intents, e.g., Brand: Louis Vuitton. SQIP is an extreme multi-label classification task for which many excellent algorithms have been developed. However, little attention has been paid to how to create training data for SQIP. Previous studies used pseudo-labeled data derived from query-click logs for training and sufered from the noise in the logs. Although there are more sophisticated training data generation methods, they cannot be directly applied to SQIP. In this paper, we propose a novel training data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is valid for a query. The model then works as an "annotator" who checks a number of pairs comprising an intent and a query to generate training data for SQIP. We show that such a model can be trained without manual supervision by utilizing a huge amount of online shopping data. We demonstrate that the SQIP model trained with data generated by our labeling model outperforms a model trained with query-click logs only and a model trained with data created by a competitive data-programming-based method.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;training data generation</kwd>
        <kwd>data-centric ai</kwd>
        <kwd>shopping query intent</kwd>
        <kwd>text classification</kwd>
        <kwd>query attribute value extraction</kwd>
        <kwd>online shopping</kwd>
        <kwd>e-commerce query intent</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Online shoppers use search queries to search for products,</title>
        <p>and most queries have search intents that indicate what Query Intents
products shoppers want. For example, the query “lv bag “lv bag zebra” Brand: Louis Vuitton
zebra” has Brand: Louis Vuitton and Pattern: Zebra Pattern: Zebra
as its intents, as shown in Table 1.1 “100% orange juice” Fruit taste: Orange</p>
        <p>In this study, we assume that queries’ intents are rep- “cologne orange blossom” Scent: Orange
resented with attribute values of products defined in an “sneaker mens orange” Color: Orange
online shopping service. Notice that simple string match- “wheel 19inch” Tire size: 18 - 19.9inch
ing between queries and intents would not work since “nicole down jacket” FBirlalnindg::NFiecaotlheer
queries are written in natural languages; they can be
represented with abbreviations, e.g., “lv” for “Louis
Vuitton”, and ambiguous words, e.g., “orange”, as indicated in
Table 1. Moreover, intents might not always be explic- have attribute values such as Brand: Louis Vuitton.
itly written in queries, as the last example in the table If we aggregate these intents in bulk, they will be very
illustrates. useful in understanding trend of diferent attributes e.g.</p>
        <p>These intents, once correctly predicted, would be uti- shoes of which brand and color the users wanted the
lized by a search system to retrieve relevant products, most in last month. Also, they will be very helpful in
since most products sold at an online shopping service understanding the overall market demand which could
help the merchants and the manufacturing companies.</p>
        <p>DL4SR’22: Workshop on Deep Learning for Search and Recommen- Shopping query intent prediction (SQIP), given a query,
dation, co-located with the 31st ACM International Conference on predicts its intents by selecting the most relevant subset
Information and Knowledge Management (CIKM), October 17-21, 2022, of attribute values from the attribute value inventory
deAtlanta, USA ifned in an online shopping service. In other words, SQIP
† These authors contributed equally. gives a natural language query a structure to facilitate
c$hikgaaruat.ahmas.khuimmoatro@@rarakkuutetenn.c.coomm(G(C. .KHuamsharim); oto) the retrieval of products.
 https://chikarahashimoto.wixsite.com/home (C. Hashimoto) In brief, our proposed method has following two
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License phases:
1ICPWnrEooUrckReshtdoinpgesnIhStpN:/c1e6u1tr3-w-0s.os7r3g ACatrtEreibUutRironeW4p.0roIenrtsekernsnahttiooenpdal (PCCrinoBYce4t.0eh).deingfosr(mCEUoRf-WASt.torrgib)ute-name: 1. Making of Labeling Model: Our labeling model
Attribute-value in this paper. We also represent attribute values is a binary classification model which predicts
of products in a similar way. whether given a (query, intent) pair is valid or
The contributions of this paper are following:
not? For this we generate good quality training logs. Despite the notable dificulty of obtaining
highdata and train a BERT Sequence Classification quality training data, little attention has been paid to the
model. For the data generation, we follow follow- problem in previous SQIP studies. Due to the success of
ing steps: pre-trained models [11], transfer learning has also been
a) Create Base SQIP model trained on prod- popular recently [12], where pre-trained models can be
uct catalog data with input "product title" seen as providing weak supervision. With this approach,
(could be considered as long pseudo shop- one fine-tunes a model that has been trained on a relevant
ping query) and output "attribute values" task for the purpose of the target task using a reasonable
(could be considered as the pseudo shop- amount of quality training data, which we cannot expect
ping query’s intents) in SQIP.
b) Generate (query, intent) pairs by getting There have also been many studies on combining weak
intents of queries from Query Click Logs supervision signals to dispense with manually annotated
using the Base SQIP model and take in- training data [13, 14, 15, 16, 17, 18], which would be useful
tersection with (query, intent) pairs from if we may devise more than one kind of weak supervision
Query Click Logs signal for a given task. For SQIP, however, it would be
2. Training Data Generation for SQIP: From raw infeasible to assume that labeling functions [14, 15, 17]
queries, get intents using the Base SQIP model or keywords [16, 18] for target classes can be frequently
and filter these intents using the labeling model. applied to or matched against queries since queries are
usually very short and diverse. It would also be infeasible
to prepare labeling functions or keywords for each class
1. We present a novel two-phased approach to train- since the number of classes in SQIP amounts to tens of
ing data generation for SQIP that requires no man- thousands and also the classes can be changed over time.
ual supervision. Automatically correcting corrupted labels has also
2. We present how to build the labeling model, the gained much attention recently [19, 20, 21]. These
methkey module of our two-phased approach, by com- ods learn label corruption matrices, which would be
probining weak supervision signals readily available hibitively large in SQIP since it has to deal with tens of
in online shopping services. thousands of classes.
3. We empirically demonstrate that our two-phased
approach is efective through large-scale experi- 1.2. Preview of the Proposed Method
ments.</p>
        <sec id="sec-1-1-1">
          <title>1.1. Background</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>SQIP is an extreme multi-label text classification task for</title>
        <p>
          which many excellent algorithms have been developed
recently [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5, 6, 7, 8</xref>
          ]. These classification
algorithms can be used for SQIP once high-quality training
data is available.
        </p>
        <p>However, obtaining high-quality training data for SQIP
is not straightforward. First of all, manual creation of
a suficient volume of training data would be infeasible
because there are tens of thousands of predefined intents
and understanding shopping query intents would require
deep knowledge of a large number of product domains.
Accordingly, previous studies [9, 10] used query-click
logs to automatically generate training data by assuming
that if a product has an attribute-value like Brand: Louis
Vuitton and the page of the product is clicked by a
user who issued a query like “lv bag zebra,” an intent
of the query is Brand: Louis Vuitton. This heuristic
sufers from the inherent noise in query-click logs due
to, for instance, inconsistent click behaviors of fickle
users or erroneous retrieval results. Besides, it cannot
utilize a number of queries that are absent in query-click</p>
      </sec>
      <sec id="sec-1-3">
        <title>What makes training data generation for SQIP dificult?</title>
        <p>We think it is the large number of classes; considering
many classes for a query at once tends to be dificult. We
therefore propose to decompose the task into two phases.</p>
        <p>In the first phase, we build a labeling model that checks
whether an intent is valid for a query. In the second phase,
we use this labeling model to verify each pair comprising
a query and an intent on a large scale. Here, the labeling
model can be seen as an annotator who is asked to create
training data for SQIP. Refer to figures 1 and 2 for more
details.</p>
        <p>How can we build the labeling model? We propose to
utilize catalog data and query-click logs since they are
readily available in online shopping services and provide
weak but diferent supervision signals so that they would
reinforce each other, as we will demonstrate in Section 4.</p>
        <p>Base SQIP model is a weak SQIP model that takes
queries as input and predicts their intents, from which
we generate a set of (query, intent) pairs. The base SQIP
model is trained with catalog data, the database of
products sold at an online shopping service, where various
information about products such as product titles and their
attribute values are registered. Product titles are usually
a set of words that describe the features of products such
as “Louis Vuitton Shoulder bag Leather Zebra print,” which
Phase 1</p>
        <p>Phase 2
Unlabeled</p>
        <p>Queries
“lv bag zebra”</p>
        <p>Query-Click Logs
Base SQIP</p>
        <p>Model</p>
        <p>Candidate
Training Data</p>
        <p>Labeling</p>
        <p>Model
“lv bag zebra”
• Brand: Louis Vuitton
• Nation: Latvia
• Pattern: Zebra
(query, intent)
Labeling Model
Valid / Invalid
“lv bag zebra”
• Brand: Louis Vuitton
• Nation: Latvia
• Pattern: Zebra</p>
        <p>Final
Training</p>
        <p>Data
“lv bag zebra”, Brand: Louis Vuitton
“lv bag zebra”, Nation: Latvia
“lv bag zebra”, Pattern: Zebra
Labeling Model</p>
        <p>Valid</p>
        <p>Labeling Model</p>
        <p>Invalid</p>
        <p>Labeling Model</p>
        <p>Valid
can be seen as lengthy, detailed, merchant-made pseudo
queries about the products. Since these titles (i.e., pseudo
queries) are associated with attribute values of products
(i.e., intents) we can use the catalog data to train the base
SQIP model without manual annotation.</p>
        <p>Query-click logs indicate the association between
queries and clicked item’s product attribute values (i.e.
intents). We generate another set of (query, intent) pairs
based on this association.</p>
        <p>Catalog data provides the direct evidence of the
association between product titles and attribute values (intents),
but the titles are not real queries. In contrast, click logs
show the association between real queries and intents, we discussed in Section 1, most of the previous
weakbut it is only indicated indirectly and tends to be noisy. supervision methods are not appropriate for SQIP since
However, these two data sources can generate reliable they require external knowledge bases [23, 24], a
reasontraining data for the labeling model in tandem. able amount of quality training data [12], labeling
func</p>
        <p>In summary, our proposed method creates a "ma- tions or keywords for target classes [14, 15, 16, 17, 18],
chine annotator" namely, the labeling model, using huge or label corruption matrices to be learned [19, 20, 21].
amount of online shopping data to generate training data Shen et al. proposed learning classifiers with only class
for SQIP on a large scale without requiring any manual names [25]. However, their method assumes that classes
labor. are organized in a hierarchy, so we cannot use their</p>
        <p>Through large-scale SQIP experiments, we demon- method for SQIP where classes (intents) are not
orgastrate that the model trained with data generated by nized in a hierarchy. Karamanolakis et al. [26] proposed
our proposed method outperforms a model trained with a method that works with weak supervision such as
lexquery-click logs only and a model trained with data cre- icons, regular expressions, and knowledge bases of the
ated by a competitive training data generation method target domain. However, such weak supervision would
based on data programming [14]. become obsolete quickly in SQIP, as discussed in Section 1.</p>
        <p>All the data used in this study were obtained from Zhang et al. [27] proposed a teacher-student network
an online shopping service, Rakuten, and written in method which utilizes weakly labeled behaviour data for
Japanese. However, the ideas and methods in this paper SQIP. However, they do use strongly labeled data in their
are independent of particular languages, and examples in training methodology to train the teacher network.
this paper are written in English for ease of explanation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Shopping Query Intent Prediction</title>
        <sec id="sec-2-1-1">
          <title>Previous methods for SQIP can be categorized into</title>
          <p>classification-based methods [ 9, 10] and
sequencelabeling-based methods [22].</p>
          <p>In this study, our proposed method generates
training data for the classification-based methods for the
following two reasons: First, with sequence-labeling-based
methods, it would be more dificult to deal with tens of
thousands of classes, while, for classification-based
methods, there have recently been many excellent extreme
classification algorithms that can handle a huge number
of classes. Second, sequence-labeling-based methods deal
with only intents that are explicitly written in queries.
However, valid intents are not always explicit in queries;
e.g., “nicole down jacket” has Filling: Feather as its valid
intent.</p>
          <p>Our study is diferent from previous ones because we
focus on how to obtain a huge volume of high-quality
training data for SQIP, rather than how to classify queries.
Previous studies simply used query-click logs to obtain
pseudo-labeled data [9, 10], which tends to be noisy
and unreliable. We will demonstrate that our proposed
method can generate better training data in Section 4.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Learning with Weak Supervision</title>
        <sec id="sec-2-2-1">
          <title>Our study can be seen as answering the research question of how to train supervised models without relying on manual annotation, and therefore studies on learning with weak supervision are quite relevant. As</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Extreme Multi-Label Classification</title>
        <p>
          SQIP is an extreme multi-label classification (XML),
which tags a data point with the most relevant subset of
labels from an extremely large label set, that has gained
much attention recently [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3, 7, 8</xref>
          ]. While many
classification algorithms have been proposed, training data
generation for XML has not been well studied. Zhang
et al. [28] addressed data augmentation for XML, which
assumed the existence of training data and thus cannot
be applied to our setting. Our study therefore difers
from previous XML studies since we directly tackle the
task of training data generation, though our method is
specifically designed for SQIP.
        </p>
        <p>For a more comprehensive overview of
classification algorithms and data sets for XML, refer to http://
manikvarma.org/downloads/XC/XMLRepository.html.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <sec id="sec-3-1">
        <title>In this section, we describe each component of our</title>
        <p>method as illustrated in Figures 1 and 2; catalog data, the
base SQIP model, query-click logs, the labeling model,
unlabeled queries, candidate training data, and the final
training data.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Catalog Data</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Catalog data contains various information of products</title>
        <p>sold at the shopping service, including product titles,
descriptions, prices, various attribute values such as brands,
sizes, and colors, among others. We use product titles
and attribute values to train the base SQIP model, since
product titles are usually a set of words that indicate
the features of products and can consequently be seen
as lengthy, detailed queries about the products. Table 2
shows examples of product titles and their attribute
values in our catalog data, and indicates diferences between
product titles and real queries. First, product titles
sometimes contain tokens that would not appear in queries
usually, such as “[Unused]” and “[In-stock, May 15].”
Second, real queries are usually much shorter than product
titles. Third, attribute values might not always mean
intent. For example, color: red is not intent if we consider
product title as shopping query in first example of table
2. Catalog data is a useful data source for training a SQIP
model but is not suficiently reliable by itself due to these
diferences.</p>
        <p>To train the base SQIP model, we used 117 million
product titles and their associated attribute values. The
number of diferent attribute values was 19,416.</p>
        <sec id="sec-3-2-1">
          <title>3.3. Query-Click Logs</title>
          <p>We used one year of query-click logs, which contained
72 million unique queries. As illustrated in Figure 2, the
query-click logs are used to generate (query, intent) pairs
as part of training data for the labeling model. We simply
enumerated all possible (query, intent) pairs such that a
query is associated with an intent (attribute-value) via
click relations in the logs.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.4. Labeling Model</title>
          <p>The labeling model takes a pair comprising a query (e.g.,
“lv bag zebra”) and an intent (e.g., Brand: Louis Vuitton)
as input and predicts whether the intent is valid for the
query.
3.4.1. Model Architecture
3.2. Base SQIP Model BERT[11]-based models have been very promising for
text pair classification and regression tasks, such as
natThe base SQIP model takes unlabeled queries, such as ural language inference (NLI) [32] and semantic textual
“lv bag zebra” as input and predicts their intents such as similarity (STS) [33]. Since the task of the labeling model
Brand: Louis Vuitton and Pattern: Zebra. We had to is binary classification, we used
BertForSequenceClasdeal with hundreds of millions of training instances in sification 2 where we use pretrained BERT model for
our experiments (Section 4) and chose extremeText [29]. Japanese3.</p>
          <p>
            It was the only extreme multi-label classification method We intentionally adopted a very simple approach
that we experimented with that could handle all training so that we could demonstrate the efectiveness of our
instances in our environment. Other extreme multi-label method.
classification methods we experimented with include
Parabel [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], Bonsai [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], LightXML [7], XR-Linear [30], 3.4.2. Training Data
and XR-Transformer [8].
          </p>
          <p>The classification algorithm of extremeText is based on As shown in Figure 2, we automatically generate training
probabilistic label trees (PLT) [31], in which leaf nodes data for the labeling model which is the intersection of
represent the target labels and the other nodes are logistic two sets of (query, intent) pairs; one set is generated with
regression classifiers. PLT guides data points from the the base SQIP model4 and another is from the query-click
root node into their appropriate leaf nodes (labels) with logs. Although each of these two kinds of supervision
sigthe logistic regression classifiers. For training the model, nals is weak by itself, we can accurately obtain a number
we did not conduct extensive hyper-parameter tuning; we of valid (query, intent) pairs by combining them.
used its default hyper-parameters, except that we chose
PLT as the loss function, and used the TF-IDF weights
for words.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2https://huggingface.co/transformers/model_doc/bert.html.</title>
        <p>3https://huggingface.co/cl-tohoku/bert-base-japanese-whole-wordmasking
4The input to the base SQIP model is the queries in the query click
logs.</p>
        <p>Intents
Brand: Alpha Industries
Color: Orange
Fruit taste: Orange
Series: Tropicana
Scent: Orange, Brand: Gres
Capacity: 600 - 899ml</p>
        <p>Event/Holiday: Mother’s Day
3. Our proposed method that exploits both catalog
data and query-click logs can generate even better
training data.
4. Without the labeling model, the performance of
our method degrades, indicating the efectiveness
of the labeling model.
5. Our proposed method outperforms the
competitive training data generation method based on
data programming called Snorkel [14].
3.4.3. Training Detail</p>
      </sec>
      <sec id="sec-3-4">
        <title>The labeling model has been built with the training data and the model architecture, as described above. Training is done for one epoch with batch size 32 using AdamW [34] optimizer.</title>
        <sec id="sec-3-4-1">
          <title>3.5. Unlabeled Queries, Candidate Training Data, and Final Training Data</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>The second phase starts with predicting intents for unla</title>
        <p>beled queries from query logs with the base SQIP model
to generate candidate training data. We then filter out
erroneous intents with the labeling model to generate 4.1. Experimental Conditions
the final training data. In our experiments, we compared our proposed method</p>
        <p>Unlabeled queries were obtained from seven years of with four baseline methods described in Section 4.2. All
query logs, which contained more than 1.5 billion unique the compared methods difer only in how they obtain
queries. training data. For classification, they use the same
archi</p>
        <p>Candidate training data were generated under the fol- tecture, extremeText; specifically, all the methods trained
lowing condition: =5, meaning that the base SQIP model their SQIP model with the PLT loss function and the
TFpredicted the most probable five intents for a query at IDF weights for words; the other hyper-parameters were
most, and threshold=1.0, i.e., only those intents whose set to the default values.
probability was 1.0 were outputted. As a result, we ob- Test data has been manually created by a human
antained 377 million (query, intent) pairs. The number of notator (who is not an author). The annotator was asked
unique queries was 264 million. to check (query, intent) pairs that were automatically</p>
        <p>The final training data were those (query, intent) pairs generated by pairing a query and an intent, such that
whose probability given by the labeling model was at at least one token in the query was semantically similar
least 0.99. Consequently, we obtained 169 million (query, or relevant to the intent in order to exclude obviously
intent) pairs. The number of unique queries was 145 erroneous (query, intent) pairs from all possible pairs in
million. We trained and evaluated the SQIP model with advance of manual annotation.5 As a result, 5,615
diferthis final training data, as reported in Section 4. Table 3 ent queries with at least one intent were obtained as test
shows examples of the final training data. data, and 2.57 intents were given to a query, on average.
Evaluation was based on precision and recall, which
4. Experiments were calculated with extremeText’s test command.
Precision and recall were calculated for top  outputs (i.e.,
In this section, through large-scale SQIP experiments in intents) with  being 1, 3, and 5, and we drew
precisionwhich one predicts intents of a given query, we claim the recall curves for the compared methods for each  with
following: the probability threshold of extremeText changing from
0.0 to 1.0 with the interval of 0.01.</p>
      </sec>
      <sec id="sec-3-6">
        <title>To be specific, we obtained (query, intent) pairs such Table 3</title>
        <p>that the query is associated with the intent in the query- Examples of the final training data
click logs and also, given the query as input, the base
SQIP model predicted the intent with probability 1.0. As Query
a result, we generated 5.3 million (query, intent) pairs “alpha ma-1”
as positive examples for training of the labeling model. “orange t-shirt”
We then generated 5.3 million (query, intent) pairs by “tropicana orange”
randomly pairing queries and intents, which we used as
negative examples.
“gres perfume orange”
“washbowl 750”
“original message carnation”
1. Simply using query-click logs for training SQIP</p>
        <p>models delivers poor performance.
2. Using catalog data for training leads to better
performance than simply using query-click logs
but is still unsatisfactory.
5The semantic similarity was measured by the cosine similarity
between their sentence embeddings. We use fastText embeddings [35],
which had been learned from the query logs. The threshold for the
cosine similarity was set to 0.8.
years of query-click logs and obtained (query, intent)
pairs in which product pages that had the intent (i.e.,
attribute value) were clicked through the query at least
ten times in the logs. The purpose of this was to reduce
the inherent noise in the query-click logs. As a result, we
obtained more than 670 million (query, intent) pairs. The
number of unique queries was 7,962,605, which indicated
that each query was given as many as 84 intents on
average. This number is obviously too large given that most
queries consist of less than ten tokens and supports our
claim that simply using query-click logs as training data
would be inadequate.
4.2.2. Base</p>
      </sec>
      <sec id="sec-3-7">
        <title>This is the base SQIP model, which uses only product titles and their associated attribute values for training.</title>
        <p>4.2.3. Proposed</p>
      </sec>
      <sec id="sec-3-8">
        <title>This is a SQIP model trained with the final training data generated with our proposed method, as described in Section 3.</title>
        <p>4.2.4. Proposed-LM</p>
      </sec>
      <sec id="sec-3-9">
        <title>This is the same as Proposed except that it does not</title>
        <p>use the labeling model. Proposed-LM is then trained
with the candidate training data in the second phase; its
training process is similar to self-training. Note that the
diference in performances between Proposed-LM and
Proposed can be seen as indicating the efectiveness of
the labeling model.
4.2.5. Snorkel</p>
      </sec>
      <sec id="sec-3-10">
        <title>This baseline is the same as Proposed, except that the</title>
        <p>labeling model is replaced with Snorkel [14], a
training data generation method based on data programming
[13]. Like Proposed’s labeling model, Snorkel’s
labeling model can be learned without manual supervision.</p>
        <p>However, Snorkel requires labeling functions that
implement a variety of domain knowledge, heuristics, and
any kind of weak supervision that would be useful for a
Figure 3: Precision-recall curves for all Experiments given task. Each labeling function takes unlabeled data
points as input and predicts their class labels. Snorkel
then uses these weakly-labeled data points to train a
generative labeling model which is supposed to be able
4.2. Compared Methods to label each data point more accurately than the
labelWe compared the following five methods: ing functions. Snorkel has influenced subsequent
studies on training data generation [17], and has also been
adopted by the world’s leading organizations as described
4.2.1. QueryClick in https://www.snorkel.org/. We therefore think that
The simplest baseline is QueryClick, which uses query- comparing with the Snorkel-based baseline would
efecclick logs to generate training data in a similar way to tively show Proposed’s performance.
the previous methods [9, 10]. Specifically, we used seven
1. If given intent is associated with the given query
in query-click logs, return valid; otherwise return Table 5 illustrates examples of wrong prediction made
invalid. by Proposed (=1, threshold=0.21). Most of the errors
were due to the class imbalance in the training data; i.e.,
2. If output probability of the base SQIP model, the distribution of training instances across the intents is
given (query, intent) pair is 1.0, return valid; oth- biased or skewed, and intents for which we have few or
erwise abstain. no instances tend to be dificult to predict [ 36].
Regard3. Return invalid if the output probability is not ing the first example in Table 5, “ prince” can be Brand:
greater than 0.995; otherwise abstain. Prince and be part of Brand: Glen Prince. However,
Snorkel’s labeling model was trained with 11 million the frequency of the former intent in the final training
(query, intent) pairs that had been weakly-labeled with data was 119,972, whereas that of the latter was only 33,
the three labeling functions. which caused the SQIP model to choose the former for</p>
        <p>Proposed’s labeling model was trained with 10.6 mil- the query. Regarding the second example, the frequency
lion (query, intent) pairs as described in Section 3.4.2. of Color: Red was 1,486,315, while that of Brand: Red
Wing was 15,592. For the last one, there was no
train4.3. Results ing instance for Memory Standards: DDR3 in our final
training data and thus, the SQIP model could not predict
it.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Future Direction</title>
      <sec id="sec-4-1">
        <title>For training data generation, one possible direction is</title>
        <p>to use product genre/category information. If we could</p>
      </sec>
      <sec id="sec-4-2">
        <title>The relatively low performances of QueryClick and Base and the relatively high performances of Proposed and Snorkel indicate that query-click logs and catalog</title>
        <p>4.5. Efect of Training Data Size
1. QueryClick’s precision decreases sharply as we Figure 4 shows F1 scores of Proposed built with the
try to increase recall. ifnal training data of diferent sizes (’K’ and ’M’ stand
2. Base generally outperforms QueryClick, for ’thousand’ and ’million’). The  and the threshold
though its performance is still unsatisfactory. of extremeText were set to 1 and 0.21 uniformly. The
3. Proposed outperforms all the other methods. Ta- graph indicates that increasing the data size leads to
ble 4 shows Proposed’s best F1 scores and their better performances and that our final training data is
corresponding precision, recall, and threshold val- efective for SQIP. Although the improvement from 10M
ues for each . to 145M is small, it is noteworthy that additional data
4. Proposed-LM’s performance is worse than that could improve the model trained already with as many
of Proposed. as 10M instances.
5. Snorkel can deliver good performances but
can</p>
        <p>not outperform Proposed.
create query to product genre mapping of reliable
quality, we can filter (query, intent) pairs further and create
higher quality training data. Also, we could utilize
neighbor signals, since similar queries should have more labels
in common, to remove noise from the dataset further.</p>
        <p>For the classification model, one possibility is to use
label (i.e. intent) context information to create
embedding vector of input text (i.e. shopping query). Similar
previous work is by Chen et al. [37] who uses
LGuidedLearn [38] for Product Item Category Classification.</p>
        <p>Another possible method could be Label-Specific
Document Representation for Multi-Label Text Classification
by Xiao et al. [39]. Also, Cai et al. [40] proposes a hybrid
neural network model to simultaneously take advantage
of both label semantics and fine-grained text
information. Another possibilities are to consider Contrastive
Learning and KNN based methods [41, 42].</p>
        <p>Another direction is to extend our proposed method
in other domains. If we could find a way to exploit
weak supervision signals readily available in a domain
for building the labeling model, we can easily apply our
approach to the domain. In the case of text classification
into Wikipedia categories [43], for instance, not only the
category information in Wikipedia articles but also the
links among corresponding articles in diferent languages
and the class hierarchy in Wikidata [44] can be exploited.</p>
        <p>As we have seen in section 4.4 that data imbalance is
an issue, in the future we aim to address this.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>In this paper, we proposed the novel two-phased
training data generation method for SQIP. The idea is to first
build a labeling model that checks whether an intent
is valid for a query. The model then works as an
"annotator" who checks a number of pairs comprising an
intent and a query to generate training data for SQIP. We
presented how to train such a model without manual
supervision by utilizing a huge amount of online shopping
data. Through the series of large-scale experiments with
the data from a real online shopping service, we have
demonstrated the efectiveness of our proposed method.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We thank our annotator Saki Hiraga-san for helping us in creation of evaluation dataset. We thank all the researchers in RIT for their support for this project.</title>
        <p>ifcation with label features, in: Proceedings of the language explanations, Proceedings of The 56th
An14th ACM International Conference on Web Search nual Meeting of the Association for Computational
and Data Mining, WSDM ’21, 2021, p. 49–57. Linguistics 2018 (2018) 1884–1895.
[6] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, [16] Y. Meng, J. Shen, C. Zhang, J. Han,
WeaklyP. Kar, M. Varma, Eclare: Extreme classification supervised neural text classification, in:
Proceedwith label graph correlations, in: Proceedings ings of the 27th ACM International Conference on
of the Web Conference 2021, WWW ’21, 2021, p. Information and Knowledge Management, CIKM
3721–3732. ’18, 2018, p. 983–992.
[7] T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao, [17] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi,
LearnF. Zhuang, Lightxml: Transformer with dynamic ing from rules generalizing labeled exemplars, in:
negative sampling for high-performance extreme 8th International Conference on Learning
Repremulti-label text classification, in: Thirty-Fifth AAAI sentations, ICLR 2020, Addis Ababa, Ethiopia, April
Conference on Artificial Intelligence, AAAI 2021, 26-30, 2020, OpenReview.net, 2020. URL: https:
2021, pp. 7987–7994. //openreview.net/forum?id=SkeuexBtDr.
[8] J. Zhang, W.-C. Chang, H.-F. Yu, I. S. Dhillon, Fast [18] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji,
multi-resolution transformer fine-tuning for ex- C. Zhang, J. Han, Text classification using label
treme multi-label text classification, in: 35th Con- names only: A language model self-training
apference on Neural Information Processing Systems, proach, in: Proceedings of the 2020 Conference on
NeurIPS 2021, 2021. Empirical Methods in Natural Language Processing,
[9] C. Wu, A. Ahmed, G. R. Kumar, R. Datta, Predicting EMNLP ’20, 2020, pp. 9006–9017.
latent structured intents from shopping queries, in: [19] G. Patrini, A. Rozza, A. K. Menon, R. Nock, L. Qu,
Proceedings of the 26th International Conference Making deep neural networks robust to label noise:
on World Wide Web, WWW ’17, 2017, pp. 1133– A loss correction approach, in: 2017 IEEE
Confer1141. ence on Computer Vision and Pattern Recognition,
[10] J. Zhao, H. Chen, D. Yin, A dynamic product-aware CVPR ’17, 2017, pp. 2233–2241.
learning model for e-commerce query intent under- [20] D. Hendrycks, M. Mazeika, D. Wilson, K. Gimpel,
standing, in: Proceedings of the 28th ACM Interna- Using trusted data to train deep networks on
lational Conference on Information and Knowledge bels corrupted by severe noise, in: Proceedings
Management, CIKM ’19, 2019, p. 1843–1852. of the 32nd International Conference on Neural
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Information Processing Systems, NIPS’18, 2018, p.</p>
        <p>Pre-training of deep bidirectional transformers for 10477–10486.
language understanding, in: Proceedings of the [21] G. Zheng, A. H. Awadallah, S. Dumais, Meta label
2019 Conference of the North American Chapter correction for noisy label learning, in: Proceedings
of the Association for Computational Linguistics: of the AAAI Conference on Artificial Intelligence,
Human Language Technologies, NAACL-HLT ’19, volume 35 of AAAI ’21, 2021.</p>
        <p>2019, pp. 4171–4186. [22] X. Li, Y.-Y. Wang, A. Acero, Extracting structured
in[12] M. Ben Noach, Y. Goldberg, Transfer learning be- formation from user queries with semi-supervised
tween related tasks using expected label propor- conditional random fields, in: Proceedings of the
tions, in: Proceedings of the 2019 Conference on 32nd International ACM SIGIR Conference on
ReEmpirical Methods in Natural Language Process- search and Development in Information Retrieval,
ing and the 9th International Joint Conference on SIGIR ’09, 2009, p. 572–579.</p>
        <p>Natural Language Processing, EMNLP-IJCNLP ’19, [23] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant
2019, pp. 31–42. supervision for relation extraction without labeled
[13] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, data, in: Proceedings of the Joint Conference of
Data programming: Creating large training sets, the 47th Annual Meeting of the ACL and the 4th
Inquickly, in: D. Lee, M. Sugiyama, U. Luxburg, ternational Joint Conference on Natural Language
I. Guyon, R. Garnett (Eds.), Advances in Neural Processing of the AFNLP: Volume 2 - Volume 2,
Information Processing Systems, volume 29 of ACL ’09, 2009, p. 1003–1011.</p>
        <p>NeurIPS ’16, 2016. [24] F. Brahman, V. Shwartz, R. Rudinger, Y. Choi,
Learn[14] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, ing to rationalize for nonmonotonic reasoning with
C. Ré, Snorkel: Rapid training data creation with distant supervision, in: The Thirty-Fifth AAAI
Conweak supervision, Proceedings of the VLDB En- ference on Artificial Intelligence, AAAI ’21, AAAI
dowment 11 (2017) 269–282. Press, 2021, pp. 12592–12601.
[15] B. Hancock, M. Bringmann, P. Varma, P. Liang, [25] J. Shen, W. Qiu, Y. Meng, J. Shang, X. Ren, J. Han,
S. Wang, C. Ré, Training classifiers with natural TaxoClass: Hierarchical multi-label text
classifica</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harsola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Parabel:
          <article-title>Partitioned label trees for extreme classification with application to dynamic search advertising</article-title>
          ,
          <source>in: Proceedings of the 2018 World Wide Web Conference, WWW '18</source>
          ,
          <year>2018</year>
          , p.
          <fpage>993</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandagale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Babbar</surname>
          </string-name>
          ,
          <article-title>Bonsai - diverse and shallow trees for extreme multi-label classification</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1904</year>
          .08249.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Dhillon</surname>
          </string-name>
          ,
          <article-title>Taming Pretrained Transformers for Extreme Multi-Label Text Classification</article-title>
          ,
          <source>KDD '20</source>
          ,
          <year>2020</year>
          , p.
          <fpage>3163</fpage>
          -
          <lpage>3171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Dahiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Deepxml: A deep extreme multi-label learning framework applied to short text documents</article-title>
          ,
          <source>in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM '21</source>
          ,
          <year>2021</year>
          , p.
          <fpage>31</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dahiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Decaf: Deep extreme classi-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>