A Two-Phased Approach to Training Data Generation for Shopping Query Intent Prediction Gautam Kumar1,† , Chikara Hashimoto1,† 1 Rakuten Institute of Technology (RIT), Rakuten Group Inc., 1-chōme-14 Tamagawa, Setagaya City, Tokyo, Japan 158-0094 Abstract Shopping Query Intent Prediction (SQIP) is, given an online shopping user’s search query, e.g., “lv bag”, to predict their intents, e.g., Brand: Louis Vuitton. SQIP is an extreme multi-label classification task for which many excellent algorithms have been developed. However, little attention has been paid to how to create training data for SQIP. Previous studies used pseudo-labeled data derived from query-click logs for training and suffered from the noise in the logs. Although there are more sophisticated training data generation methods, they cannot be directly applied to SQIP. In this paper, we propose a novel training data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is valid for a query. The model then works as an "annotator" who checks a number of pairs comprising an intent and a query to generate training data for SQIP. We show that such a model can be trained without manual supervision by utilizing a huge amount of online shopping data. We demonstrate that the SQIP model trained with data generated by our labeling model outperforms a model trained with query-click logs only and a model trained with data created by a competitive data-programming-based method. Keywords training data generation, data-centric ai, shopping query intent, text classification, query attribute value extraction, online shopping, e-commerce query intent 1. Introduction Table 1 Examples of queries and their intents Online shoppers use search queries to search for products, and most queries have search intents that indicate what Query Intents products shoppers want. For example, the query “lv bag “lv bag zebra” Brand: Louis Vuitton zebra” has Brand: Louis Vuitton and Pattern: Zebra Pattern: Zebra as its intents, as shown in Table 1.1 “100% orange juice” Fruit taste: Orange In this study, we assume that queries’ intents are rep- “cologne orange blossom” Scent: Orange resented with attribute values of products defined in an “sneaker mens orange” Color: Orange online shopping service. Notice that simple string match- “wheel 19inch” Tire size: 18 - 19.9inch “nicole down jacket” Brand: Nicole ing between queries and intents would not work since Filling: Feather queries are written in natural languages; they can be represented with abbreviations, e.g., “lv” for “Louis Vuit- ton”, and ambiguous words, e.g., “orange”, as indicated in have attribute values such as Brand: Louis Vuitton. Table 1. Moreover, intents might not always be explic- If we aggregate these intents in bulk, they will be very itly written in queries, as the last example in the table useful in understanding trend of different attributes e.g. illustrates. shoes of which brand and color the users wanted the These intents, once correctly predicted, would be uti- most in last month. Also, they will be very helpful in lized by a search system to retrieve relevant products, understanding the overall market demand which could since most products sold at an online shopping service help the merchants and the manufacturing companies. DL4SR’22: Workshop on Deep Learning for Search and Recommen- Shopping query intent prediction (SQIP), given a query, dation, co-located with the 31st ACM International Conference on predicts its intents by selecting the most relevant subset Information and Knowledge Management (CIKM), October 17-21, 2022, of attribute values from the attribute value inventory de- Atlanta, USA † fined in an online shopping service. In other words, SQIP These authors contributed equally. gives a natural language query a structure to facilitate $ gautam.kumar@rakuten.com (G. Kumar); chikara.hashimoto@rakuten.com (C. Hashimoto) the retrieval of products. € https://chikarahashimoto.wixsite.com/home (C. Hashimoto) In brief, our proposed method has following two © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). phases: CEUR CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org 1. Making of Labeling Model: Our labeling model Workshop ISSN 1613-0073 Proceedings 1 Intents are represented in the form of Attribute-name: Attribute-value in this paper. We also represent attribute values is a binary classification model which predicts of products in a similar way. whether given a (query, intent) pair is valid or not? For this we generate good quality training logs. Despite the notable difficulty of obtaining high- data and train a BERT Sequence Classification quality training data, little attention has been paid to the model. For the data generation, we follow follow- problem in previous SQIP studies. Due to the success of ing steps: pre-trained models [11], transfer learning has also been a) Create Base SQIP model trained on prod- popular recently [12], where pre-trained models can be uct catalog data with input "product title" seen as providing weak supervision. With this approach, (could be considered as long pseudo shop- one fine-tunes a model that has been trained on a relevant ping query) and output "attribute values" task for the purpose of the target task using a reasonable (could be considered as the pseudo shop- amount of quality training data, which we cannot expect ping query’s intents) in SQIP. b) Generate (query, intent) pairs by getting There have also been many studies on combining weak intents of queries from Query Click Logs supervision signals to dispense with manually annotated using the Base SQIP model and take in- training data [13, 14, 15, 16, 17, 18], which would be useful tersection with (query, intent) pairs from if we may devise more than one kind of weak supervision Query Click Logs signal for a given task. For SQIP, however, it would be 2. Training Data Generation for SQIP: From raw infeasible to assume that labeling functions [14, 15, 17] queries, get intents using the Base SQIP model or keywords [16, 18] for target classes can be frequently and filter these intents using the labeling model. applied to or matched against queries since queries are usually very short and diverse. It would also be infeasible The contributions of this paper are following: to prepare labeling functions or keywords for each class 1. We present a novel two-phased approach to train- since the number of classes in SQIP amounts to tens of ing data generation for SQIP that requires no man- thousands and also the classes can be changed over time. ual supervision. Automatically correcting corrupted labels has also 2. We present how to build the labeling model, the gained much attention recently [19, 20, 21]. These meth- key module of our two-phased approach, by com- ods learn label corruption matrices, which would be pro- bining weak supervision signals readily available hibitively large in SQIP since it has to deal with tens of in online shopping services. thousands of classes. 3. We empirically demonstrate that our two-phased approach is effective through large-scale experi- 1.2. Preview of the Proposed Method ments. What makes training data generation for SQIP difficult? We think it is the large number of classes; considering 1.1. Background many classes for a query at once tends to be difficult. We SQIP is an extreme multi-label text classification task for therefore propose to decompose the task into two phases. which many excellent algorithms have been developed In the first phase, we build a labeling model that checks recently [1, 2, 3, 4, 5, 6, 7, 8]. These classification algo- whether an intent is valid for a query. In the second phase, rithms can be used for SQIP once high-quality training we use this labeling model to verify each pair comprising data is available. a query and an intent on a large scale. Here, the labeling However, obtaining high-quality training data for SQIP model can be seen as an annotator who is asked to create is not straightforward. First of all, manual creation of training data for SQIP. Refer to figures 1 and 2 for more a sufficient volume of training data would be infeasible details. because there are tens of thousands of predefined intents How can we build the labeling model? We propose to and understanding shopping query intents would require utilize catalog data and query-click logs since they are deep knowledge of a large number of product domains. readily available in online shopping services and provide Accordingly, previous studies [9, 10] used query-click weak but different supervision signals so that they would logs to automatically generate training data by assuming reinforce each other, as we will demonstrate in Section 4. that if a product has an attribute-value like Brand: Louis Base SQIP model is a weak SQIP model that takes Vuitton and the page of the product is clicked by a queries as input and predicts their intents, from which user who issued a query like “lv bag zebra,” an intent we generate a set of (query, intent) pairs. The base SQIP of the query is Brand: Louis Vuitton. This heuristic model is trained with catalog data, the database of prod- suffers from the inherent noise in query-click logs due ucts sold at an online shopping service, where various in- to, for instance, inconsistent click behaviors of fickle formation about products such as product titles and their users or erroneous retrieval results. Besides, it cannot attribute values are registered. Product titles are usually utilize a number of queries that are absent in query-click a set of words that describe the features of products such as “Louis Vuitton Shoulder bag Leather Zebra print,” which Phase 1 Phase 2 Catalog Data Query-Click Logs Labeling Model Builder Final Unlabeled Base SQIP Candidate Labeling Training Queries Model Training Data Model Data “lv bag zebra” “lv bag zebra” (query, intent) “lv bag zebra” • Brand: Louis Vuitton • Brand: Louis Vuitton • Nation: Latvia Labeling Model • Nation: Latvia • Pattern: Zebra • Pattern: Zebra Valid / Invalid “lv bag zebra”, Brand: Louis Vuitton “lv bag zebra”, Nation: Latvia “lv bag zebra”, Pattern: Zebra Labeling Model Labeling Model Labeling Model Valid Invalid Valid Figure 1: Overview of our training data generation method. In first phase we build the labeling model, which is depicted in Figure 2 in detail. In second phase, we generate candidate training data from unlabeled queries by using the base SQIP model. Afterwards, the labeling model filters out invalid (query, intent) pairs to generate the final training data. Catalog Data Product Title Attribute-Value “Louis Vuitton • Brand: Louis Vuitton Bag Zebra print” • Pattern: Zebra Generated (query, intent) pairs “Nike Air Jordan • Brand: Nike Men’s” • Target: Men’s (“louis vuitton bag”, Brand: Louis Vuitton) Training (“lv bag zebra”, Brand: Louis Vuitton) Queries (“lv bag zebra”, Pattern: Zebra) “louis vuitton bag” Base SQIP Model (“lv bag zebra”, Nation: Latvia) “lv bag zebra” Labeling Intersection Training Model Query-click logs (“louis vuitton bag”, Brand: Louis Vuitton) Clicked products and attribute values Product Title Attribute-Value (“louis vuitton bag”, Pattern: Zebra) Queries (“louis vuitton bag”, Material: Leather) “louis “Louis Vuitton • Brand: Louis Vuitton vuitton bag” Bag Zebra print” • Pattern: Zebra (“lv bag zebra”, Brand: Louis Vuitton) “LV Louis Vuitton • Brand: Louis Vuitton (“lv bag zebra”, Pattern: Zebra) “lv bag zebra” Leather Bag” • Material: Leather (“lv bag zebra”, Material: Leather) Figure 2: Closer look at the labeling model builder. Training data for the labeling model is the intersection of two sets of pairs comprising a query and an intent. Each set is generated by one of two weak generators; the base SQIP model and query-click logs. can be seen as lengthy, detailed, merchant-made pseudo queries and clicked item’s product attribute values (i.e. queries about the products. Since these titles (i.e., pseudo intents). We generate another set of (query, intent) pairs queries) are associated with attribute values of products based on this association. (i.e., intents) we can use the catalog data to train the base Catalog data provides the direct evidence of the associ- SQIP model without manual annotation. ation between product titles and attribute values (intents), Query-click logs indicate the association between but the titles are not real queries. In contrast, click logs show the association between real queries and intents, we discussed in Section 1, most of the previous weak- but it is only indicated indirectly and tends to be noisy. supervision methods are not appropriate for SQIP since However, these two data sources can generate reliable they require external knowledge bases [23, 24], a reason- training data for the labeling model in tandem. able amount of quality training data [12], labeling func- In summary, our proposed method creates a "ma- tions or keywords for target classes [14, 15, 16, 17, 18], chine annotator" namely, the labeling model, using huge or label corruption matrices to be learned [19, 20, 21]. amount of online shopping data to generate training data Shen et al. proposed learning classifiers with only class for SQIP on a large scale without requiring any manual names [25]. However, their method assumes that classes labor. are organized in a hierarchy, so we cannot use their Through large-scale SQIP experiments, we demon- method for SQIP where classes (intents) are not orga- strate that the model trained with data generated by nized in a hierarchy. Karamanolakis et al. [26] proposed our proposed method outperforms a model trained with a method that works with weak supervision such as lex- query-click logs only and a model trained with data cre- icons, regular expressions, and knowledge bases of the ated by a competitive training data generation method target domain. However, such weak supervision would based on data programming [14]. become obsolete quickly in SQIP, as discussed in Section 1. All the data used in this study were obtained from Zhang et al. [27] proposed a teacher-student network an online shopping service, Rakuten, and written in method which utilizes weakly labeled behaviour data for Japanese. However, the ideas and methods in this paper SQIP. However, they do use strongly labeled data in their are independent of particular languages, and examples in training methodology to train the teacher network. this paper are written in English for ease of explanation. 2.3. Extreme Multi-Label Classification 2. Related Work SQIP is an extreme multi-label classification (XML), which tags a data point with the most relevant subset of 2.1. Shopping Query Intent Prediction labels from an extremely large label set, that has gained Previous methods for SQIP can be categorized into much attention recently [1, 2, 3, 7, 8]. While many clas- classification-based methods [9, 10] and sequence- sification algorithms have been proposed, training data labeling-based methods [22]. generation for XML has not been well studied. Zhang In this study, our proposed method generates train- et al. [28] addressed data augmentation for XML, which ing data for the classification-based methods for the fol- assumed the existence of training data and thus cannot lowing two reasons: First, with sequence-labeling-based be applied to our setting. Our study therefore differs methods, it would be more difficult to deal with tens of from previous XML studies since we directly tackle the thousands of classes, while, for classification-based meth- task of training data generation, though our method is ods, there have recently been many excellent extreme specifically designed for SQIP. classification algorithms that can handle a huge number For a more comprehensive overview of classifica- of classes. Second, sequence-labeling-based methods deal tion algorithms and data sets for XML, refer to http:// with only intents that are explicitly written in queries. manikvarma.org/downloads/XC/XMLRepository.html. However, valid intents are not always explicit in queries; e.g., “nicole down jacket” has Filling: Feather as its valid 3. Proposed Method intent. Our study is different from previous ones because we In this section, we describe each component of our focus on how to obtain a huge volume of high-quality method as illustrated in Figures 1 and 2; catalog data, the training data for SQIP, rather than how to classify queries. base SQIP model, query-click logs, the labeling model, Previous studies simply used query-click logs to obtain unlabeled queries, candidate training data, and the final pseudo-labeled data [9, 10], which tends to be noisy training data. and unreliable. We will demonstrate that our proposed method can generate better training data in Section 4. 3.1. Catalog Data 2.2. Learning with Weak Supervision Catalog data contains various information of products sold at the shopping service, including product titles, de- Our study can be seen as answering the research ques- scriptions, prices, various attribute values such as brands, tion of how to train supervised models without rely- sizes, and colors, among others. We use product titles ing on manual annotation, and therefore studies on and attribute values to train the base SQIP model, since learning with weak supervision are quite relevant. As product titles are usually a set of words that indicate Table 2 Examples of product titles and attribute values Product title Attribute values “[Next-day delivery] Nike Women’s Zoom Vaper 9.5 Tour 631475-602 Lady’s Shoes” Brand: Nike, Color: Red “TIFFANY&CO. tiffany envelope charm [NEW] SILVER 270000487012x” Brand: Tiffany & Co., Color: Silver “[Kids clothes/STUSSY] Classic Logo Strapback Cap black a118a” Clothing fabric: Cotton “Fitty Closely-attached mask Pleated type Slightly small size Five-pack” Mask shape: Pleated “[Unused] iQOS 2.4PLUS IQOS White Electric cigarette Main body 58KK0100180” Color: White “NIKE AIR MAX 90 ESSENTIAL Sneaker Men’s 537384-090 Black [In-stock, May 15]” Shoe upper material: Leather, Brand: Nike the features of products and can consequently be seen 3.3. Query-Click Logs as lengthy, detailed queries about the products. Table 2 We used one year of query-click logs, which contained shows examples of product titles and their attribute val- 72 million unique queries. As illustrated in Figure 2, the ues in our catalog data, and indicates differences between query-click logs are used to generate (query, intent) pairs product titles and real queries. First, product titles some- as part of training data for the labeling model. We simply times contain tokens that would not appear in queries enumerated all possible (query, intent) pairs such that a usually, such as “[Unused]” and “[In-stock, May 15].” Sec- query is associated with an intent (attribute-value) via ond, real queries are usually much shorter than product click relations in the logs. titles. Third, attribute values might not always mean in- tent. For example, color: red is not intent if we consider product title as shopping query in first example of table 3.4. Labeling Model 2. Catalog data is a useful data source for training a SQIP The labeling model takes a pair comprising a query (e.g., model but is not sufficiently reliable by itself due to these “lv bag zebra”) and an intent (e.g., Brand: Louis Vuitton) differences. as input and predicts whether the intent is valid for the To train the base SQIP model, we used 117 million query. product titles and their associated attribute values. The number of different attribute values was 19,416. 3.4.1. Model Architecture 3.2. Base SQIP Model BERT[11]-based models have been very promising for text pair classification and regression tasks, such as nat- The base SQIP model takes unlabeled queries, such as ural language inference (NLI) [32] and semantic textual “lv bag zebra” as input and predicts their intents such as similarity (STS) [33]. Since the task of the labeling model Brand: Louis Vuitton and Pattern: Zebra. We had to is binary classification, we used BertForSequenceClas- deal with hundreds of millions of training instances in sification2 where we use pretrained BERT model for our experiments (Section 4) and chose extremeText [29]. Japanese3 . It was the only extreme multi-label classification method We intentionally adopted a very simple approach that we experimented with that could handle all training so that we could demonstrate the effectiveness of our instances in our environment. Other extreme multi-label method. classification methods we experimented with include Parabel [1], Bonsai [2], LightXML [7], XR-Linear [30], 3.4.2. Training Data and XR-Transformer [8]. The classification algorithm of extremeText is based on As shown in Figure 2, we automatically generate training probabilistic label trees (PLT) [31], in which leaf nodes data for the labeling model which is the intersection of represent the target labels and the other nodes are logistic two sets of (query, intent) pairs; one set is generated with 4 regression classifiers. PLT guides data points from the the base SQIP model and another is from the query-click root node into their appropriate leaf nodes (labels) with logs. Although each of these two kinds of supervision sig- the logistic regression classifiers. For training the model, nals is weak by itself, we can accurately obtain a number we did not conduct extensive hyper-parameter tuning; we of valid (query, intent) pairs by combining them. used its default hyper-parameters, except that we chose PLT as the loss function, and used the TF-IDF weights 2 https://huggingface.co/transformers/model_doc/bert.html. for words. 3 https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word- masking 4 The input to the base SQIP model is the queries in the query click logs. To be specific, we obtained (query, intent) pairs such Table 3 that the query is associated with the intent in the query- Examples of the final training data click logs and also, given the query as input, the base Query Intents SQIP model predicted the intent with probability 1.0. As a result, we generated 5.3 million (query, intent) pairs “alpha ma-1” Brand: Alpha Industries as positive examples for training of the labeling model. “orange t-shirt” Color: Orange We then generated 5.3 million (query, intent) pairs by “tropicana orange” Fruit taste: Orange randomly pairing queries and intents, which we used as Series: Tropicana “gres perfume orange” Scent: Orange, Brand: Gres negative examples. “washbowl 750” Capacity: 600 - 899ml “original message carnation” Event/Holiday: Mother’s Day 3.4.3. Training Detail The labeling model has been built with the training data and the model architecture, as described above. Training 3. Our proposed method that exploits both catalog is done for one epoch with batch size 32 using AdamW data and query-click logs can generate even better [34] optimizer. training data. 4. Without the labeling model, the performance of 3.5. Unlabeled Queries, Candidate our method degrades, indicating the effectiveness of the labeling model. Training Data, and Final Training 5. Our proposed method outperforms the compet- Data itive training data generation method based on The second phase starts with predicting intents for unla- data programming called Snorkel [14]. beled queries from query logs with the base SQIP model to generate candidate training data. We then filter out erroneous intents with the labeling model to generate 4.1. Experimental Conditions the final training data. In our experiments, we compared our proposed method Unlabeled queries were obtained from seven years of with four baseline methods described in Section 4.2. All query logs, which contained more than 1.5 billion unique the compared methods differ only in how they obtain queries. training data. For classification, they use the same archi- Candidate training data were generated under the fol- tecture, extremeText; specifically, all the methods trained lowing condition: 𝑘=5, meaning that the base SQIP model their SQIP model with the PLT loss function and the TF- predicted the most probable five intents for a query at IDF weights for words; the other hyper-parameters were most, and threshold=1.0, i.e., only those intents whose set to the default values. probability was 1.0 were outputted. As a result, we ob- Test data has been manually created by a human an- tained 377 million (query, intent) pairs. The number of notator (who is not an author). The annotator was asked unique queries was 264 million. to check (query, intent) pairs that were automatically The final training data were those (query, intent) pairs generated by pairing a query and an intent, such that whose probability given by the labeling model was at at least one token in the query was semantically similar least 0.99. Consequently, we obtained 169 million (query, or relevant to the intent in order to exclude obviously intent) pairs. The number of unique queries was 145 erroneous (query, intent) pairs from all possible pairs in million. We trained and evaluated the SQIP model with advance of manual annotation.5 As a result, 5,615 differ- this final training data, as reported in Section 4. Table 3 ent queries with at least one intent were obtained as test shows examples of the final training data. data, and 2.57 intents were given to a query, on average. Evaluation was based on precision and recall, which 4. Experiments were calculated with extremeText’s test command. Pre- cision and recall were calculated for top 𝑘 outputs (i.e., In this section, through large-scale SQIP experiments in intents) with 𝑘 being 1, 3, and 5, and we drew precision- which one predicts intents of a given query, we claim the recall curves for the compared methods for each 𝑘 with following: the probability threshold of extremeText changing from 0.0 to 1.0 with the interval of 0.01. 1. Simply using query-click logs for training SQIP models delivers poor performance. 5 The semantic similarity was measured by the cosine similarity be- 2. Using catalog data for training leads to better tween their sentence embeddings. We use fastText embeddings [35], performance than simply using query-click logs which had been learned from the query logs. The threshold for the but is still unsatisfactory. cosine similarity was set to 0.8. years of query-click logs and obtained (query, intent) pairs in which product pages that had the intent (i.e., attribute value) were clicked through the query at least ten times in the logs. The purpose of this was to reduce the inherent noise in the query-click logs. As a result, we obtained more than 670 million (query, intent) pairs. The number of unique queries was 7,962,605, which indicated that each query was given as many as 84 intents on aver- age. This number is obviously too large given that most queries consist of less than ten tokens and supports our claim that simply using query-click logs as training data would be inadequate. 4.2.2. Base This is the base SQIP model, which uses only product titles and their associated attribute values for training. 4.2.3. Proposed This is a SQIP model trained with the final training data generated with our proposed method, as described in Section 3. 4.2.4. Proposed-LM This is the same as Proposed except that it does not use the labeling model. Proposed-LM is then trained with the candidate training data in the second phase; its training process is similar to self-training. Note that the difference in performances between Proposed-LM and Proposed can be seen as indicating the effectiveness of the labeling model. 4.2.5. Snorkel This baseline is the same as Proposed, except that the labeling model is replaced with Snorkel [14], a train- ing data generation method based on data programming [13]. Like Proposed’s labeling model, Snorkel’s label- ing model can be learned without manual supervision. However, Snorkel requires labeling functions that imple- ment a variety of domain knowledge, heuristics, and any kind of weak supervision that would be useful for a given task. Each labeling function takes unlabeled data Figure 3: Precision-recall curves for all Experiments points as input and predicts their class labels. Snorkel then uses these weakly-labeled data points to train a generative labeling model which is supposed to be able 4.2. Compared Methods to label each data point more accurately than the label- We compared the following five methods: ing functions. Snorkel has influenced subsequent stud- ies on training data generation [17], and has also been adopted by the world’s leading organizations as described 4.2.1. QueryClick in https://www.snorkel.org/. We therefore think that The simplest baseline is QueryClick, which uses query- comparing with the Snorkel-based baseline would effec- click logs to generate training data in a similar way to tively show Proposed’s performance. the previous methods [9, 10]. Specifically, we used seven Table 4 data alone can only provide weak supervision. However, Proposed’s best F1 scores combining them can lead to higher performances. Comparing Proposed with Snorkel shows the supe- 𝑘 F1 Precision Recall Threshold riority of our labeling model over Snorkel. We think 1 0.537 0.678 0.444 0.21 this is because labeling functions of Snorkel or learning 3 0.535 0.620 0.470 0.26 methods with weak heuristic rules in general have been 5 0.531 0.608 0.471 0.26 known to suffer from a low coverage [26]; rules tend to be applied to only a small subset of instances. In fact, the first labeling function for Snorkel covered only 1.94% Our Snorkel baseline, to be specific, was imple- of the training instances. The second and third label- mented in the following way: The input and output of ing functions covered 60.61% and 12.73%, respectively. Snorkel’s labeling model are the same as Proposed’s On the other hand, the labeling model of Proposed is labeling model; the input (query, intent) pairs are gen- learned with natural language words and phrases, which erated with the base SQIP model; the output is whether BERT makes the maximum use of; that is to say, the la- given (query, intent) pairs are valid or not. We defined beling model of Proposed does not waste the training following labeling functions that utilized the same two data. kinds of weak supervision as Proposed, i.e., the query- click logs and the base SQIP model which are following: 4.4. Error Analysis 1. If given intent is associated with the given query Table 5 illustrates examples of wrong prediction made in query-click logs, return valid; otherwise return by Proposed (𝑘=1, threshold=0.21). Most of the errors invalid. were due to the class imbalance in the training data; i.e., 2. If output probability of the base SQIP model, the distribution of training instances across the intents is given (query, intent) pair is 1.0, return valid; oth- biased or skewed, and intents for which we have few or erwise abstain. no instances tend to be difficult to predict [36]. Regard- 3. Return invalid if the output probability is not ing the first example in Table 5, “prince” can be Brand: greater than 0.995; otherwise abstain. Prince and be part of Brand: Glen Prince. However, Snorkel’s labeling model was trained with 11 million the frequency of the former intent in the final training (query, intent) pairs that had been weakly-labeled with data was 119,972, whereas that of the latter was only 33, the three labeling functions. which caused the SQIP model to choose the former for Proposed’s labeling model was trained with 10.6 mil- the query. Regarding the second example, the frequency lion (query, intent) pairs as described in Section 3.4.2. of Color: Red was 1,486,315, while that of Brand: Red Wing was 15,592. For the last one, there was no train- ing instance for Memory Standards: DDR3 in our final 4.3. Results training data and thus, the SQIP model could not predict Figure 3 shows precision-recall curves for the compared it. methods and from them we can make the following ob- servations: 4.5. Effect of Training Data Size 1. QueryClick’s precision decreases sharply as we Figure 4 shows F1 scores of Proposed built with the try to increase recall. final training data of different sizes (’K’ and ’M’ stand 2. Base generally outperforms QueryClick, for ’thousand’ and ’million’). The 𝑘 and the threshold though its performance is still unsatisfactory. of extremeText were set to 1 and 0.21 uniformly. The 3. Proposed outperforms all the other methods. Ta- graph indicates that increasing the data size leads to ble 4 shows Proposed’s best F1 scores and their better performances and that our final training data is corresponding precision, recall, and threshold val- effective for SQIP. Although the improvement from 10M ues for each 𝑘. to 145M is small, it is noteworthy that additional data 4. Proposed-LM’s performance is worse than that could improve the model trained already with as many of Proposed. as 10M instances. 5. Snorkel can deliver good performances but can- not outperform Proposed. 5. Future Direction The relatively low performances of QueryClick and Base and the relatively high performances of Proposed For training data generation, one possible direction is and Snorkel indicate that query-click logs and catalog to use product genre/category information. If we could Table 5 Examples of wrong prediction made by Proposed Query True Intents Predicted Intents “glen prince” Brand: Glen Prince Brand: Prince “red wing engineer boots us 7.5” Brand: Red Wing Color: Red “pc 3 12800 ddr 3 sdram” Memory Standards: DDR3 − an issue, in the future we aim to address this. 6. Conclusion In this paper, we proposed the novel two-phased train- ing data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is valid for a query. The model then works as an "an- notator" who checks a number of pairs comprising an intent and a query to generate training data for SQIP. We presented how to train such a model without manual su- pervision by utilizing a huge amount of online shopping data. Through the series of large-scale experiments with the data from a real online shopping service, we have Figure 4: Changes in F1 due to different training data sizes demonstrated the effectiveness of our proposed method. for proposed Acknowledgments create query to product genre mapping of reliable qual- We thank our annotator Saki Hiraga-san for helping us ity, we can filter (query, intent) pairs further and create in creation of evaluation dataset. We thank all the re- higher quality training data. Also, we could utilize neigh- searchers in RIT for their support for this project. bor signals, since similar queries should have more labels in common, to remove noise from the dataset further. For the classification model, one possibility is to use References label (i.e. intent) context information to create embed- ding vector of input text (i.e. shopping query). Similar [1] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, M. Varma, previous work is by Chen et al. [37] who uses LGuid- Parabel: Partitioned label trees for extreme classifi- edLearn [38] for Product Item Category Classification. cation with application to dynamic search advertis- Another possible method could be Label-Specific Docu- ing, in: Proceedings of the 2018 World Wide Web ment Representation for Multi-Label Text Classification Conference, WWW ’18, 2018, p. 993–1002. by Xiao et al. [39]. Also, Cai et al. [40] proposes a hybrid [2] S. Khandagale, H. Xiao, R. Babbar, Bonsai – diverse neural network model to simultaneously take advantage and shallow trees for extreme multi-label classifica- of both label semantics and fine-grained text informa- tion, 2019. arXiv:1904.08249. tion. Another possibilities are to consider Contrastive [3] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Learning and KNN based methods [41, 42]. Dhillon, Taming Pretrained Transformers for Ex- Another direction is to extend our proposed method treme Multi-Label Text Classification, KDD ’20, in other domains. If we could find a way to exploit 2020, p. 3163–3171. weak supervision signals readily available in a domain [4] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, for building the labeling model, we can easily apply our A. Soni, H. Jain, S. Agarwal, M. Varma, Deepxml: approach to the domain. In the case of text classification A deep extreme multi-label learning framework into Wikipedia categories [43], for instance, not only the applied to short text documents, in: Proceedings category information in Wikipedia articles but also the of the 14th ACM International Conference on Web links among corresponding articles in different languages Search and Data Mining, WSDM ’21, 2021, p. 31–39. and the class hierarchy in Wikidata [44] can be exploited. [5] A. Mittal, K. Dahiya, S. Agrawal, D. Saini, S. Agar- As we have seen in section 4.4 that data imbalance is wal, P. Kar, M. Varma, Decaf: Deep extreme classi- fication with label features, in: Proceedings of the language explanations, Proceedings of The 56th An- 14th ACM International Conference on Web Search nual Meeting of the Association for Computational and Data Mining, WSDM ’21, 2021, p. 49–57. Linguistics 2018 (2018) 1884–1895. [6] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, [16] Y. Meng, J. Shen, C. Zhang, J. Han, Weakly- P. Kar, M. Varma, Eclare: Extreme classification supervised neural text classification, in: Proceed- with label graph correlations, in: Proceedings ings of the 27th ACM International Conference on of the Web Conference 2021, WWW ’21, 2021, p. Information and Knowledge Management, CIKM 3721–3732. ’18, 2018, p. 983–992. [7] T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao, [17] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi, Learn- F. Zhuang, Lightxml: Transformer with dynamic ing from rules generalizing labeled exemplars, in: negative sampling for high-performance extreme 8th International Conference on Learning Repre- multi-label text classification, in: Thirty-Fifth AAAI sentations, ICLR 2020, Addis Ababa, Ethiopia, April Conference on Artificial Intelligence, AAAI 2021, 26-30, 2020, OpenReview.net, 2020. URL: https: 2021, pp. 7987–7994. //openreview.net/forum?id=SkeuexBtDr. [8] J. Zhang, W.-C. Chang, H.-F. Yu, I. S. Dhillon, Fast [18] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji, multi-resolution transformer fine-tuning for ex- C. Zhang, J. Han, Text classification using label treme multi-label text classification, in: 35th Con- names only: A language model self-training ap- ference on Neural Information Processing Systems, proach, in: Proceedings of the 2020 Conference on NeurIPS 2021, 2021. Empirical Methods in Natural Language Processing, [9] C. Wu, A. Ahmed, G. R. Kumar, R. Datta, Predicting EMNLP ’20, 2020, pp. 9006–9017. latent structured intents from shopping queries, in: [19] G. Patrini, A. Rozza, A. K. Menon, R. Nock, L. Qu, Proceedings of the 26th International Conference Making deep neural networks robust to label noise: on World Wide Web, WWW ’17, 2017, pp. 1133– A loss correction approach, in: 2017 IEEE Confer- 1141. ence on Computer Vision and Pattern Recognition, [10] J. Zhao, H. Chen, D. Yin, A dynamic product-aware CVPR ’17, 2017, pp. 2233–2241. learning model for e-commerce query intent under- [20] D. Hendrycks, M. Mazeika, D. Wilson, K. Gimpel, standing, in: Proceedings of the 28th ACM Interna- Using trusted data to train deep networks on la- tional Conference on Information and Knowledge bels corrupted by severe noise, in: Proceedings Management, CIKM ’19, 2019, p. 1843–1852. of the 32nd International Conference on Neural [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Information Processing Systems, NIPS’18, 2018, p. Pre-training of deep bidirectional transformers for 10477–10486. language understanding, in: Proceedings of the [21] G. Zheng, A. H. Awadallah, S. Dumais, Meta label 2019 Conference of the North American Chapter correction for noisy label learning, in: Proceedings of the Association for Computational Linguistics: of the AAAI Conference on Artificial Intelligence, Human Language Technologies, NAACL-HLT ’19, volume 35 of AAAI ’21, 2021. 2019, pp. 4171–4186. [22] X. Li, Y.-Y. Wang, A. Acero, Extracting structured in- [12] M. Ben Noach, Y. Goldberg, Transfer learning be- formation from user queries with semi-supervised tween related tasks using expected label propor- conditional random fields, in: Proceedings of the tions, in: Proceedings of the 2019 Conference on 32nd International ACM SIGIR Conference on Re- Empirical Methods in Natural Language Process- search and Development in Information Retrieval, ing and the 9th International Joint Conference on SIGIR ’09, 2009, p. 572–579. Natural Language Processing, EMNLP-IJCNLP ’19, [23] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant 2019, pp. 31–42. supervision for relation extraction without labeled [13] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, data, in: Proceedings of the Joint Conference of Data programming: Creating large training sets, the 47th Annual Meeting of the ACL and the 4th In- quickly, in: D. Lee, M. Sugiyama, U. Luxburg, ternational Joint Conference on Natural Language I. Guyon, R. Garnett (Eds.), Advances in Neural Processing of the AFNLP: Volume 2 - Volume 2, Information Processing Systems, volume 29 of ACL ’09, 2009, p. 1003–1011. NeurIPS ’16, 2016. [24] F. Brahman, V. Shwartz, R. Rudinger, Y. Choi, Learn- [14] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, ing to rationalize for nonmonotonic reasoning with C. Ré, Snorkel: Rapid training data creation with distant supervision, in: The Thirty-Fifth AAAI Con- weak supervision, Proceedings of the VLDB En- ference on Artificial Intelligence, AAAI ’21, AAAI dowment 11 (2017) 269–282. Press, 2021, pp. 12592–12601. [15] B. Hancock, M. Bringmann, P. Varma, P. Liang, [25] J. Shen, W. Qiu, Y. Meng, J. Shang, X. Ren, J. Han, S. Wang, C. Ré, Training classifiers with natural TaxoClass: Hierarchical multi-label text classifica- tion using only class names, in: Proceedings of the riching word vectors with subword information, 2021 Conference of the North American Chapter arXiv preprint arXiv:1607.04606 (2016). of the Association for Computational Linguistics: [36] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Human Language Technologies, NAACL-HLT ’21, Kegelmeyer, Smote: Synthetic minority over- 2021, pp. 4239–4249. sampling technique, Journal of Artificial Intelli- [26] G. Karamanolakis, S. Mukherjee, G. Zheng, A. H. gence Research 16 (2002) 321–357. Awadallah, Self-training with weak supervision, in: [37] L. Chen, H. Miyake, Label-guided learning for item Proceedings of the 2021 Conference of the North categorization in e-commerce, in: NAACL, 2021. American Chapter of the Association for Computa- [38] X. Liu, S. Wang, X. Zhang, X. You, J. Wu, tional Linguistics: Human Language Technologies, D. Dou, Label-guided learning for text classifica- NAACL ’21, 2021, pp. 845–863. tion, 2020. URL: https://arxiv.org/abs/2002.10772. [27] D. Zhang, Z. Li, T. Cao, C. Luo, T. Wu, H. Lu, doi:10.48550/ARXIV.2002.10772. Y. Song, B. Yin, T. Zhao, Q. Yang, Queaco: Borrow- [39] L. Xiao, X. Huang, B. Chen, L. Jing, Label-specific ing treasures from weakly-labeled behavior data document representation for multi-label text classi- for query attribute value extraction, in: Proceed- fication, in: Proceedings of the 2019 Conference on ings of the 30th ACM International Conference on Empirical Methods in Natural Language Processing Information and Knowledge Management, CIKM and the 9th International Joint Conference on Nat- ’21, Association for Computing Machinery, New ural Language Processing (EMNLP-IJCNLP), Asso- York, NY, USA, 2021, p. 4362–4372. URL: https: ciation for Computational Linguistics, Hong Kong, //doi.org/10.1145/3459637.3481946. doi:10.1145/ China, 2019, pp. 466–475. URL: https://aclanthology. 3459637.3481946. org/D19-1044. doi:10.18653/v1/D19-1044. [28] D. Zhang, T. Li, H. Zhang, B. Yin, On data augmen- [40] L. Cai, Y. Song, T. Liu, K. Zhang, A hybrid bert tation for extreme multi-label classification, CoRR model that incorporates label semantics via ad- abs/2009.10778 (2020). URL: https://arxiv.org/abs/ justive attention for multi-label text classification, 2009.10778. IEEE Access 8 (2020) 152183–152192. doi:10.1109/ [29] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa- ACCESS.2020.3017382. Fekete, K. Dembczyński, A no-regret generalization [41] L. Zhu, H. Chen, C. Wei, W. Zhang, Enhanced rep- of hierarchical softmax to extreme multi-label clas- resentation with contrastive loss for long-tail query sification, in: Proceedings of the 32nd International classification in e-commerce, in: Proceedings of Conference on Neural Information Processing Sys- The Fifth Workshop on e-Commerce and NLP (EC- tems, NeurIPS ’18, 2018, p. 6358–6368. NLP 5), Association for Computational Linguistics, [30] H.-F. Yu, K. Zhong, I. S. Dhillon, Pecos: Prediction Dublin, Ireland, 2022, pp. 141–150. URL: https:// for enormous and correlated output spaces, arXiv aclanthology.org/2022.ecnlp-1.17. doi:10.18653/ preprint arXiv:2010.05878 (2020). v1/2022.ecnlp-1.17. [31] K. Jasinska, K. Dembczynski, R. Busa-Fekete, [42] X. Su, R. Wang, X. Dai, Contrastive learning- K. Pfannschmidt, T. Klerx, E. Hullermeier, Extreme enhanced nearest neighbor mechanism for multi- f-measure maximization using sparse probability label text classification, in: Proceedings of estimates, in: M. F. Balcan, K. Q. Weinberger (Eds.), the 60th Annual Meeting of the Association for Proceedings of The 33rd International Conference Computational Linguistics (Volume 2: Short Pa- on Machine Learning, volume 48 of Proceedings of pers), Association for Computational Linguistics, Machine Learning Research, 2016, pp. 1435–1444. Dublin, Ireland, 2022, pp. 672–679. URL: https:// [32] Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, aclanthology.org/2022.acl-short.75. doi:10.18653/ X. Zhou, Semantics-aware BERT for language un- v1/2022.acl-short.75. derstanding, in: the Thirty-Fourth AAAI Confer- [43] O. Dekel, O. Shamir, Multiclass-multilabel classifica- ence on Artificial Intelligence, 2020. tion with more classes than examples, in: Proceed- [33] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, ings of the Thirteenth International Conference on SemEval-2017 task 1: Semantic textual similarity Artificial Intelligence and Statistics, volume 9 of multilingual and crosslingual focused evaluation, Proceedings of Machine Learning Research, 2010, pp. in: Proceedings of the 11th International Workshop 137–144. on Semantic Evaluation (SemEval-2017), 2017, pp. [44] D. Vrandečić, M. Krötzsch, Wikidata: A free collab- 1–14. orative knowledge base, Communications of the [34] I. Loshchilov, F. Hutter, Decoupled weight decay ACM 57 (2014) 78–85. URL: http://cacm.acm.org/ regularization, in: 7th International Conference on magazines/2014/10/178785-wikidata/fulltext. Learning Representations, ICLR ’19, 2019. [35] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-