1. Introduction

Journal of Artificial Intelli

10.48550/ARXIV.2002.10772

A Two-Phased Approach to Training Data Generation for Shopping Query Intent Prediction

Gautam Kumar

Chikara Hashimoto

0 0 Rakuten Institute of Technology (RIT), Rakuten Group Inc. , 1-cho ̄me-14 Tamagawa, Setagaya City, Tokyo , Japan

158-0094

2021

2 4239 4249

Shopping Query Intent Prediction (SQIP) is, given an online shopping user's search query, e.g., “lv bag”, to predict their intents, e.g., Brand: Louis Vuitton. SQIP is an extreme multi-label classification task for which many excellent algorithms have been developed. However, little attention has been paid to how to create training data for SQIP. Previous studies used pseudo-labeled data derived from query-click logs for training and sufered from the noise in the logs. Although there are more sophisticated training data generation methods, they cannot be directly applied to SQIP. In this paper, we propose a novel training data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is valid for a query. The model then works as an "annotator" who checks a number of pairs comprising an intent and a query to generate training data for SQIP. We show that such a model can be trained without manual supervision by utilizing a huge amount of online shopping data. We demonstrate that the SQIP model trained with data generated by our labeling model outperforms a model trained with query-click logs only and a model trained with data created by a competitive data-programming-based method.

eol>training data generation data-centric ai shopping query intent text classification query attribute value extraction online shopping e-commerce query intent

1. Introduction Online shoppers use search queries to search for products,

and most queries have search intents that indicate what Query Intents products shoppers want. For example, the query “lv bag “lv bag zebra” Brand: Louis Vuitton zebra” has Brand: Louis Vuitton and Pattern: Zebra Pattern: Zebra as its intents, as shown in Table 1.1 “100% orange juice” Fruit taste: Orange

In this study, we assume that queries’ intents are rep- “cologne orange blossom” Scent: Orange resented with attribute values of products defined in an “sneaker mens orange” Color: Orange online shopping service. Notice that simple string match- “wheel 19inch” Tire size: 18 - 19.9inch ing between queries and intents would not work since “nicole down jacket” FBirlalnindg::NFiecaotlheer queries are written in natural languages; they can be represented with abbreviations, e.g., “lv” for “Louis Vuitton”, and ambiguous words, e.g., “orange”, as indicated in Table 1. Moreover, intents might not always be explic- have attribute values such as Brand: Louis Vuitton. itly written in queries, as the last example in the table If we aggregate these intents in bulk, they will be very illustrates. useful in understanding trend of diferent attributes e.g.

These intents, once correctly predicted, would be uti- shoes of which brand and color the users wanted the lized by a search system to retrieve relevant products, most in last month. Also, they will be very helpful in since most products sold at an online shopping service understanding the overall market demand which could help the merchants and the manufacturing companies.

DL4SR’22: Workshop on Deep Learning for Search and Recommen- Shopping query intent prediction (SQIP), given a query, dation, co-located with the 31st ACM International Conference on predicts its intents by selecting the most relevant subset Information and Knowledge Management (CIKM), October 17-21, 2022, of attribute values from the attribute value inventory deAtlanta, USA ifned in an online shopping service. In other words, SQIP † These authors contributed equally. gives a natural language query a structure to facilitate c$hikgaaruat.ahmas.khuimmoatro@@rarakkuutetenn.c.coomm(G(C. .KHuamsharim); oto) the retrieval of products. https://chikarahashimoto.wixsite.com/home (C. Hashimoto) In brief, our proposed method has following two © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License phases: 1ICPWnrEooUrckReshtdoinpgesnIhStpN:/c1e6u1tr3-w-0s.os7r3g ACatrtEreibUutRironeW4p.0roIenrtsekernsnahttiooenpdal (PCCrinoBYce4t.0eh).deingfosr(mCEUoRf-WASt.torrgib)ute-name: 1. Making of Labeling Model: Our labeling model Attribute-value in this paper. We also represent attribute values is a binary classification model which predicts of products in a similar way. whether given a (query, intent) pair is valid or The contributions of this paper are following: not? For this we generate good quality training logs. Despite the notable dificulty of obtaining highdata and train a BERT Sequence Classification quality training data, little attention has been paid to the model. For the data generation, we follow follow- problem in previous SQIP studies. Due to the success of ing steps: pre-trained models [11], transfer learning has also been a) Create Base SQIP model trained on prod- popular recently [12], where pre-trained models can be uct catalog data with input "product title" seen as providing weak supervision. With this approach, (could be considered as long pseudo shop- one fine-tunes a model that has been trained on a relevant ping query) and output "attribute values" task for the purpose of the target task using a reasonable (could be considered as the pseudo shop- amount of quality training data, which we cannot expect ping query’s intents) in SQIP. b) Generate (query, intent) pairs by getting There have also been many studies on combining weak intents of queries from Query Click Logs supervision signals to dispense with manually annotated using the Base SQIP model and take in- training data [13, 14, 15, 16, 17, 18], which would be useful tersection with (query, intent) pairs from if we may devise more than one kind of weak supervision Query Click Logs signal for a given task. For SQIP, however, it would be 2. Training Data Generation for SQIP: From raw infeasible to assume that labeling functions [14, 15, 17] queries, get intents using the Base SQIP model or keywords [16, 18] for target classes can be frequently and filter these intents using the labeling model. applied to or matched against queries since queries are usually very short and diverse. It would also be infeasible to prepare labeling functions or keywords for each class 1. We present a novel two-phased approach to train- since the number of classes in SQIP amounts to tens of ing data generation for SQIP that requires no man- thousands and also the classes can be changed over time. ual supervision. Automatically correcting corrupted labels has also 2. We present how to build the labeling model, the gained much attention recently [19, 20, 21]. These methkey module of our two-phased approach, by com- ods learn label corruption matrices, which would be probining weak supervision signals readily available hibitively large in SQIP since it has to deal with tens of in online shopping services. thousands of classes. 3. We empirically demonstrate that our two-phased approach is efective through large-scale experi- 1.2. Preview of the Proposed Method ments.

1.1. Background SQIP is an extreme multi-label text classification task for

which many excellent algorithms have been developed recently [ 1, 2, 3, 4, 5, 6, 7, 8 ]. These classification algorithms can be used for SQIP once high-quality training data is available.

However, obtaining high-quality training data for SQIP is not straightforward. First of all, manual creation of a suficient volume of training data would be infeasible because there are tens of thousands of predefined intents and understanding shopping query intents would require deep knowledge of a large number of product domains. Accordingly, previous studies [9, 10] used query-click logs to automatically generate training data by assuming that if a product has an attribute-value like Brand: Louis Vuitton and the page of the product is clicked by a user who issued a query like “lv bag zebra,” an intent of the query is Brand: Louis Vuitton. This heuristic sufers from the inherent noise in query-click logs due to, for instance, inconsistent click behaviors of fickle users or erroneous retrieval results. Besides, it cannot utilize a number of queries that are absent in query-click

What makes training data generation for SQIP dificult?

We think it is the large number of classes; considering many classes for a query at once tends to be dificult. We therefore propose to decompose the task into two phases.

In the first phase, we build a labeling model that checks whether an intent is valid for a query. In the second phase, we use this labeling model to verify each pair comprising a query and an intent on a large scale. Here, the labeling model can be seen as an annotator who is asked to create training data for SQIP. Refer to figures 1 and 2 for more details.

How can we build the labeling model? We propose to utilize catalog data and query-click logs since they are readily available in online shopping services and provide weak but diferent supervision signals so that they would reinforce each other, as we will demonstrate in Section 4.

Base SQIP model is a weak SQIP model that takes queries as input and predicts their intents, from which we generate a set of (query, intent) pairs. The base SQIP model is trained with catalog data, the database of products sold at an online shopping service, where various information about products such as product titles and their attribute values are registered. Product titles are usually a set of words that describe the features of products such as “Louis Vuitton Shoulder bag Leather Zebra print,” which Phase 1

Phase 2 Unlabeled

Queries “lv bag zebra”

Query-Click Logs Base SQIP

Model

Candidate Training Data

Labeling

Model “lv bag zebra” • Brand: Louis Vuitton • Nation: Latvia • Pattern: Zebra (query, intent) Labeling Model Valid / Invalid “lv bag zebra” • Brand: Louis Vuitton • Nation: Latvia • Pattern: Zebra

Final Training

Data “lv bag zebra”, Brand: Louis Vuitton “lv bag zebra”, Nation: Latvia “lv bag zebra”, Pattern: Zebra Labeling Model

Valid

Labeling Model

Invalid

Labeling Model

Valid can be seen as lengthy, detailed, merchant-made pseudo queries about the products. Since these titles (i.e., pseudo queries) are associated with attribute values of products (i.e., intents) we can use the catalog data to train the base SQIP model without manual annotation.

Query-click logs indicate the association between queries and clicked item’s product attribute values (i.e. intents). We generate another set of (query, intent) pairs based on this association.

Catalog data provides the direct evidence of the association between product titles and attribute values (intents), but the titles are not real queries. In contrast, click logs show the association between real queries and intents, we discussed in Section 1, most of the previous weakbut it is only indicated indirectly and tends to be noisy. supervision methods are not appropriate for SQIP since However, these two data sources can generate reliable they require external knowledge bases [23, 24], a reasontraining data for the labeling model in tandem. able amount of quality training data [12], labeling func

In summary, our proposed method creates a "ma- tions or keywords for target classes [14, 15, 16, 17, 18], chine annotator" namely, the labeling model, using huge or label corruption matrices to be learned [19, 20, 21]. amount of online shopping data to generate training data Shen et al. proposed learning classifiers with only class for SQIP on a large scale without requiring any manual names [25]. However, their method assumes that classes labor. are organized in a hierarchy, so we cannot use their

Through large-scale SQIP experiments, we demon- method for SQIP where classes (intents) are not orgastrate that the model trained with data generated by nized in a hierarchy. Karamanolakis et al. [26] proposed our proposed method outperforms a model trained with a method that works with weak supervision such as lexquery-click logs only and a model trained with data cre- icons, regular expressions, and knowledge bases of the ated by a competitive training data generation method target domain. However, such weak supervision would based on data programming [14]. become obsolete quickly in SQIP, as discussed in Section 1.

All the data used in this study were obtained from Zhang et al. [27] proposed a teacher-student network an online shopping service, Rakuten, and written in method which utilizes weakly labeled behaviour data for Japanese. However, the ideas and methods in this paper SQIP. However, they do use strongly labeled data in their are independent of particular languages, and examples in training methodology to train the teacher network. this paper are written in English for ease of explanation.

2. Related Work 2.1. Shopping Query Intent Prediction Previous methods for SQIP can be categorized into

classification-based methods [ 9, 10] and sequencelabeling-based methods [22].

In this study, our proposed method generates training data for the classification-based methods for the following two reasons: First, with sequence-labeling-based methods, it would be more dificult to deal with tens of thousands of classes, while, for classification-based methods, there have recently been many excellent extreme classification algorithms that can handle a huge number of classes. Second, sequence-labeling-based methods deal with only intents that are explicitly written in queries. However, valid intents are not always explicit in queries; e.g., “nicole down jacket” has Filling: Feather as its valid intent.

Our study is diferent from previous ones because we focus on how to obtain a huge volume of high-quality training data for SQIP, rather than how to classify queries. Previous studies simply used query-click logs to obtain pseudo-labeled data [9, 10], which tends to be noisy and unreliable. We will demonstrate that our proposed method can generate better training data in Section 4.

2.2. Learning with Weak Supervision Our study can be seen as answering the research question of how to train supervised models without relying on manual annotation, and therefore studies on learning with weak supervision are quite relevant. As 2.3. Extreme Multi-Label Classification

SQIP is an extreme multi-label classification (XML), which tags a data point with the most relevant subset of labels from an extremely large label set, that has gained much attention recently [ 1, 2, 3, 7, 8 ]. While many classification algorithms have been proposed, training data generation for XML has not been well studied. Zhang et al. [28] addressed data augmentation for XML, which assumed the existence of training data and thus cannot be applied to our setting. Our study therefore difers from previous XML studies since we directly tackle the task of training data generation, though our method is specifically designed for SQIP.

For a more comprehensive overview of classification algorithms and data sets for XML, refer to http:// manikvarma.org/downloads/XC/XMLRepository.html.

3. Proposed Method In this section, we describe each component of our

method as illustrated in Figures 1 and 2; catalog data, the base SQIP model, query-click logs, the labeling model, unlabeled queries, candidate training data, and the final training data.

3.1. Catalog Data Catalog data contains various information of products

sold at the shopping service, including product titles, descriptions, prices, various attribute values such as brands, sizes, and colors, among others. We use product titles and attribute values to train the base SQIP model, since product titles are usually a set of words that indicate the features of products and can consequently be seen as lengthy, detailed queries about the products. Table 2 shows examples of product titles and their attribute values in our catalog data, and indicates diferences between product titles and real queries. First, product titles sometimes contain tokens that would not appear in queries usually, such as “[Unused]” and “[In-stock, May 15].” Second, real queries are usually much shorter than product titles. Third, attribute values might not always mean intent. For example, color: red is not intent if we consider product title as shopping query in first example of table 2. Catalog data is a useful data source for training a SQIP model but is not suficiently reliable by itself due to these diferences.

To train the base SQIP model, we used 117 million product titles and their associated attribute values. The number of diferent attribute values was 19,416.

3.3. Query-Click Logs

We used one year of query-click logs, which contained 72 million unique queries. As illustrated in Figure 2, the query-click logs are used to generate (query, intent) pairs as part of training data for the labeling model. We simply enumerated all possible (query, intent) pairs such that a query is associated with an intent (attribute-value) via click relations in the logs.

3.4. Labeling Model

The labeling model takes a pair comprising a query (e.g., “lv bag zebra”) and an intent (e.g., Brand: Louis Vuitton) as input and predicts whether the intent is valid for the query. 3.4.1. Model Architecture 3.2. Base SQIP Model BERT[11]-based models have been very promising for text pair classification and regression tasks, such as natThe base SQIP model takes unlabeled queries, such as ural language inference (NLI) [32] and semantic textual “lv bag zebra” as input and predicts their intents such as similarity (STS) [33]. Since the task of the labeling model Brand: Louis Vuitton and Pattern: Zebra. We had to is binary classification, we used BertForSequenceClasdeal with hundreds of millions of training instances in sification 2 where we use pretrained BERT model for our experiments (Section 4) and chose extremeText [29]. Japanese3.

It was the only extreme multi-label classification method We intentionally adopted a very simple approach that we experimented with that could handle all training so that we could demonstrate the efectiveness of our instances in our environment. Other extreme multi-label method. classification methods we experimented with include Parabel [ 1 ], Bonsai [ 2 ], LightXML [7], XR-Linear [30], 3.4.2. Training Data and XR-Transformer [8].

The classification algorithm of extremeText is based on As shown in Figure 2, we automatically generate training probabilistic label trees (PLT) [31], in which leaf nodes data for the labeling model which is the intersection of represent the target labels and the other nodes are logistic two sets of (query, intent) pairs; one set is generated with regression classifiers. PLT guides data points from the the base SQIP model4 and another is from the query-click root node into their appropriate leaf nodes (labels) with logs. Although each of these two kinds of supervision sigthe logistic regression classifiers. For training the model, nals is weak by itself, we can accurately obtain a number we did not conduct extensive hyper-parameter tuning; we of valid (query, intent) pairs by combining them. used its default hyper-parameters, except that we chose PLT as the loss function, and used the TF-IDF weights for words.

2https://huggingface.co/transformers/model_doc/bert.html.

3https://huggingface.co/cl-tohoku/bert-base-japanese-whole-wordmasking 4The input to the base SQIP model is the queries in the query click logs.

Intents Brand: Alpha Industries Color: Orange Fruit taste: Orange Series: Tropicana Scent: Orange, Brand: Gres Capacity: 600 - 899ml

Event/Holiday: Mother’s Day 3. Our proposed method that exploits both catalog data and query-click logs can generate even better training data. 4. Without the labeling model, the performance of our method degrades, indicating the efectiveness of the labeling model. 5. Our proposed method outperforms the competitive training data generation method based on data programming called Snorkel [14]. 3.4.3. Training Detail

The labeling model has been built with the training data and the model architecture, as described above. Training is done for one epoch with batch size 32 using AdamW [34] optimizer. 3.5. Unlabeled Queries, Candidate Training Data, and Final Training Data The second phase starts with predicting intents for unla

beled queries from query logs with the base SQIP model to generate candidate training data. We then filter out erroneous intents with the labeling model to generate 4.1. Experimental Conditions the final training data. In our experiments, we compared our proposed method

Unlabeled queries were obtained from seven years of with four baseline methods described in Section 4.2. All query logs, which contained more than 1.5 billion unique the compared methods difer only in how they obtain queries. training data. For classification, they use the same archi

Candidate training data were generated under the fol- tecture, extremeText; specifically, all the methods trained lowing condition: =5, meaning that the base SQIP model their SQIP model with the PLT loss function and the TFpredicted the most probable five intents for a query at IDF weights for words; the other hyper-parameters were most, and threshold=1.0, i.e., only those intents whose set to the default values. probability was 1.0 were outputted. As a result, we ob- Test data has been manually created by a human antained 377 million (query, intent) pairs. The number of notator (who is not an author). The annotator was asked unique queries was 264 million. to check (query, intent) pairs that were automatically

The final training data were those (query, intent) pairs generated by pairing a query and an intent, such that whose probability given by the labeling model was at at least one token in the query was semantically similar least 0.99. Consequently, we obtained 169 million (query, or relevant to the intent in order to exclude obviously intent) pairs. The number of unique queries was 145 erroneous (query, intent) pairs from all possible pairs in million. We trained and evaluated the SQIP model with advance of manual annotation.5 As a result, 5,615 diferthis final training data, as reported in Section 4. Table 3 ent queries with at least one intent were obtained as test shows examples of the final training data. data, and 2.57 intents were given to a query, on average. Evaluation was based on precision and recall, which 4. Experiments were calculated with extremeText’s test command. Precision and recall were calculated for top outputs (i.e., In this section, through large-scale SQIP experiments in intents) with being 1, 3, and 5, and we drew precisionwhich one predicts intents of a given query, we claim the recall curves for the compared methods for each with following: the probability threshold of extremeText changing from 0.0 to 1.0 with the interval of 0.01.

To be specific, we obtained (query, intent) pairs such Table 3

that the query is associated with the intent in the query- Examples of the final training data click logs and also, given the query as input, the base SQIP model predicted the intent with probability 1.0. As Query a result, we generated 5.3 million (query, intent) pairs “alpha ma-1” as positive examples for training of the labeling model. “orange t-shirt” We then generated 5.3 million (query, intent) pairs by “tropicana orange” randomly pairing queries and intents, which we used as negative examples. “gres perfume orange” “washbowl 750” “original message carnation” 1. Simply using query-click logs for training SQIP

models delivers poor performance. 2. Using catalog data for training leads to better performance than simply using query-click logs but is still unsatisfactory. 5The semantic similarity was measured by the cosine similarity between their sentence embeddings. We use fastText embeddings [35], which had been learned from the query logs. The threshold for the cosine similarity was set to 0.8. years of query-click logs and obtained (query, intent) pairs in which product pages that had the intent (i.e., attribute value) were clicked through the query at least ten times in the logs. The purpose of this was to reduce the inherent noise in the query-click logs. As a result, we obtained more than 670 million (query, intent) pairs. The number of unique queries was 7,962,605, which indicated that each query was given as many as 84 intents on average. This number is obviously too large given that most queries consist of less than ten tokens and supports our claim that simply using query-click logs as training data would be inadequate. 4.2.2. Base

This is the base SQIP model, which uses only product titles and their associated attribute values for training.

4.2.3. Proposed

This is a SQIP model trained with the final training data generated with our proposed method, as described in Section 3.

4.2.4. Proposed-LM

This is the same as Proposed except that it does not

use the labeling model. Proposed-LM is then trained with the candidate training data in the second phase; its training process is similar to self-training. Note that the diference in performances between Proposed-LM and Proposed can be seen as indicating the efectiveness of the labeling model. 4.2.5. Snorkel

This baseline is the same as Proposed, except that the

labeling model is replaced with Snorkel [14], a training data generation method based on data programming [13]. Like Proposed’s labeling model, Snorkel’s labeling model can be learned without manual supervision.

However, Snorkel requires labeling functions that implement a variety of domain knowledge, heuristics, and any kind of weak supervision that would be useful for a Figure 3: Precision-recall curves for all Experiments given task. Each labeling function takes unlabeled data points as input and predicts their class labels. Snorkel then uses these weakly-labeled data points to train a generative labeling model which is supposed to be able 4.2. Compared Methods to label each data point more accurately than the labelWe compared the following five methods: ing functions. Snorkel has influenced subsequent studies on training data generation [17], and has also been adopted by the world’s leading organizations as described 4.2.1. QueryClick in https://www.snorkel.org/. We therefore think that The simplest baseline is QueryClick, which uses query- comparing with the Snorkel-based baseline would efecclick logs to generate training data in a similar way to tively show Proposed’s performance. the previous methods [9, 10]. Specifically, we used seven 1. If given intent is associated with the given query in query-click logs, return valid; otherwise return Table 5 illustrates examples of wrong prediction made invalid. by Proposed (=1, threshold=0.21). Most of the errors were due to the class imbalance in the training data; i.e., 2. If output probability of the base SQIP model, the distribution of training instances across the intents is given (query, intent) pair is 1.0, return valid; oth- biased or skewed, and intents for which we have few or erwise abstain. no instances tend to be dificult to predict [ 36]. Regard3. Return invalid if the output probability is not ing the first example in Table 5, “ prince” can be Brand: greater than 0.995; otherwise abstain. Prince and be part of Brand: Glen Prince. However, Snorkel’s labeling model was trained with 11 million the frequency of the former intent in the final training (query, intent) pairs that had been weakly-labeled with data was 119,972, whereas that of the latter was only 33, the three labeling functions. which caused the SQIP model to choose the former for

Proposed’s labeling model was trained with 10.6 mil- the query. Regarding the second example, the frequency lion (query, intent) pairs as described in Section 3.4.2. of Color: Red was 1,486,315, while that of Brand: Red Wing was 15,592. For the last one, there was no train4.3. Results ing instance for Memory Standards: DDR3 in our final training data and thus, the SQIP model could not predict it.

5. Future Direction For training data generation, one possible direction is

to use product genre/category information. If we could

The relatively low performances of QueryClick and Base and the relatively high performances of Proposed and Snorkel indicate that query-click logs and catalog

4.5. Efect of Training Data Size 1. QueryClick’s precision decreases sharply as we Figure 4 shows F1 scores of Proposed built with the try to increase recall. ifnal training data of diferent sizes (’K’ and ’M’ stand 2. Base generally outperforms QueryClick, for ’thousand’ and ’million’). The and the threshold though its performance is still unsatisfactory. of extremeText were set to 1 and 0.21 uniformly. The 3. Proposed outperforms all the other methods. Ta- graph indicates that increasing the data size leads to ble 4 shows Proposed’s best F1 scores and their better performances and that our final training data is corresponding precision, recall, and threshold val- efective for SQIP. Although the improvement from 10M ues for each . to 145M is small, it is noteworthy that additional data 4. Proposed-LM’s performance is worse than that could improve the model trained already with as many of Proposed. as 10M instances. 5. Snorkel can deliver good performances but can

not outperform Proposed. create query to product genre mapping of reliable quality, we can filter (query, intent) pairs further and create higher quality training data. Also, we could utilize neighbor signals, since similar queries should have more labels in common, to remove noise from the dataset further.

For the classification model, one possibility is to use label (i.e. intent) context information to create embedding vector of input text (i.e. shopping query). Similar previous work is by Chen et al. [37] who uses LGuidedLearn [38] for Product Item Category Classification.

Another possible method could be Label-Specific Document Representation for Multi-Label Text Classification by Xiao et al. [39]. Also, Cai et al. [40] proposes a hybrid neural network model to simultaneously take advantage of both label semantics and fine-grained text information. Another possibilities are to consider Contrastive Learning and KNN based methods [41, 42].

Another direction is to extend our proposed method in other domains. If we could find a way to exploit weak supervision signals readily available in a domain for building the labeling model, we can easily apply our approach to the domain. In the case of text classification into Wikipedia categories [43], for instance, not only the category information in Wikipedia articles but also the links among corresponding articles in diferent languages and the class hierarchy in Wikidata [44] can be exploited.

As we have seen in section 4.4 that data imbalance is an issue, in the future we aim to address this.

6. Conclusion

In this paper, we proposed the novel two-phased training data generation method for SQIP. The idea is to first build a labeling model that checks whether an intent is valid for a query. The model then works as an "annotator" who checks a number of pairs comprising an intent and a query to generate training data for SQIP. We presented how to train such a model without manual supervision by utilizing a huge amount of online shopping data. Through the series of large-scale experiments with the data from a real online shopping service, we have demonstrated the efectiveness of our proposed method.

Acknowledgments We thank our annotator Saki Hiraga-san for helping us in creation of evaluation dataset. We thank all the researchers in RIT for their support for this project.

ifcation with label features, in: Proceedings of the language explanations, Proceedings of The 56th An14th ACM International Conference on Web Search nual Meeting of the Association for Computational and Data Mining, WSDM ’21, 2021, p. 49–57. Linguistics 2018 (2018) 1884–1895. [6] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, [16] Y. Meng, J. Shen, C. Zhang, J. Han, WeaklyP. Kar, M. Varma, Eclare: Extreme classification supervised neural text classification, in: Proceedwith label graph correlations, in: Proceedings ings of the 27th ACM International Conference on of the Web Conference 2021, WWW ’21, 2021, p. Information and Knowledge Management, CIKM 3721–3732. ’18, 2018, p. 983–992. [7] T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao, [17] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi, LearnF. Zhuang, Lightxml: Transformer with dynamic ing from rules generalizing labeled exemplars, in: negative sampling for high-performance extreme 8th International Conference on Learning Repremulti-label text classification, in: Thirty-Fifth AAAI sentations, ICLR 2020, Addis Ababa, Ethiopia, April Conference on Artificial Intelligence, AAAI 2021, 26-30, 2020, OpenReview.net, 2020. URL: https: 2021, pp. 7987–7994. //openreview.net/forum?id=SkeuexBtDr. [8] J. Zhang, W.-C. Chang, H.-F. Yu, I. S. Dhillon, Fast [18] Y. Meng, Y. Zhang, J. Huang, C. Xiong, H. Ji, multi-resolution transformer fine-tuning for ex- C. Zhang, J. Han, Text classification using label treme multi-label text classification, in: 35th Con- names only: A language model self-training apference on Neural Information Processing Systems, proach, in: Proceedings of the 2020 Conference on NeurIPS 2021, 2021. Empirical Methods in Natural Language Processing, [9] C. Wu, A. Ahmed, G. R. Kumar, R. Datta, Predicting EMNLP ’20, 2020, pp. 9006–9017. latent structured intents from shopping queries, in: [19] G. Patrini, A. Rozza, A. K. Menon, R. Nock, L. Qu, Proceedings of the 26th International Conference Making deep neural networks robust to label noise: on World Wide Web, WWW ’17, 2017, pp. 1133– A loss correction approach, in: 2017 IEEE Confer1141. ence on Computer Vision and Pattern Recognition, [10] J. Zhao, H. Chen, D. Yin, A dynamic product-aware CVPR ’17, 2017, pp. 2233–2241. learning model for e-commerce query intent under- [20] D. Hendrycks, M. Mazeika, D. Wilson, K. Gimpel, standing, in: Proceedings of the 28th ACM Interna- Using trusted data to train deep networks on lational Conference on Information and Knowledge bels corrupted by severe noise, in: Proceedings Management, CIKM ’19, 2019, p. 1843–1852. of the 32nd International Conference on Neural [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Information Processing Systems, NIPS’18, 2018, p.

Pre-training of deep bidirectional transformers for 10477–10486. language understanding, in: Proceedings of the [21] G. Zheng, A. H. Awadallah, S. Dumais, Meta label 2019 Conference of the North American Chapter correction for noisy label learning, in: Proceedings of the Association for Computational Linguistics: of the AAAI Conference on Artificial Intelligence, Human Language Technologies, NAACL-HLT ’19, volume 35 of AAAI ’21, 2021.

2019, pp. 4171–4186. [22] X. Li, Y.-Y. Wang, A. Acero, Extracting structured in[12] M. Ben Noach, Y. Goldberg, Transfer learning be- formation from user queries with semi-supervised tween related tasks using expected label propor- conditional random fields, in: Proceedings of the tions, in: Proceedings of the 2019 Conference on 32nd International ACM SIGIR Conference on ReEmpirical Methods in Natural Language Process- search and Development in Information Retrieval, ing and the 9th International Joint Conference on SIGIR ’09, 2009, p. 572–579.

Natural Language Processing, EMNLP-IJCNLP ’19, [23] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant 2019, pp. 31–42. supervision for relation extraction without labeled [13] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, data, in: Proceedings of the Joint Conference of Data programming: Creating large training sets, the 47th Annual Meeting of the ACL and the 4th Inquickly, in: D. Lee, M. Sugiyama, U. Luxburg, ternational Joint Conference on Natural Language I. Guyon, R. Garnett (Eds.), Advances in Neural Processing of the AFNLP: Volume 2 - Volume 2, Information Processing Systems, volume 29 of ACL ’09, 2009, p. 1003–1011.

NeurIPS ’16, 2016. [24] F. Brahman, V. Shwartz, R. Rudinger, Y. Choi, Learn[14] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, ing to rationalize for nonmonotonic reasoning with C. Ré, Snorkel: Rapid training data creation with distant supervision, in: The Thirty-Fifth AAAI Conweak supervision, Proceedings of the VLDB En- ference on Artificial Intelligence, AAAI ’21, AAAI dowment 11 (2017) 269–282. Press, 2021, pp. 12592–12601. [15] B. Hancock, M. Bringmann, P. Varma, P. Liang, [25] J. Shen, W. Qiu, Y. Meng, J. Shang, X. Ren, J. Han, S. Wang, C. Ré, Training classifiers with natural TaxoClass: Hierarchical multi-label text classifica

[1]

Prabhu ,

Kag ,

Harsola ,

Agrawal ,

Varma , Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising , in: Proceedings of the 2018 World Wide Web Conference, WWW '18 , 2018 , p. 993 - 1002 .

[2]

Khandagale ,

Xiao ,

Babbar , Bonsai - diverse and shallow trees for extreme multi-label classification , 2019 . arXiv: 1904 .08249.

[3]

W.-C.

Chang ,

H.-F.

Yu ,

Zhong ,

Yang ,

I. S.

Dhillon , Taming Pretrained Transformers for Extreme Multi-Label Text Classification , KDD '20 , 2020 , p. 3163 - 3171 .

[4]

Dahiya ,

Saini ,

Mittal ,

Shaw ,

Dave ,

Soni ,

Jain ,

Agarwal ,

Varma , Deepxml: A deep extreme multi-label learning framework applied to short text documents , in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM '21 , 2021 , p. 31 - 39 .

[5]

Mittal ,

Dahiya ,

Agrawal ,

Saini ,

Agarwal ,

Kar ,

Varma , Decaf: Deep extreme classi-