Flexible Job Classification with Zero-Shot Learning Thom Lake Indeed Abstract Using a taxonomy to organize information requires classifying objects (documents, images, etc) with appropriate taxonomic classes. The flexible nature of zero-shot learning is appealing for this task because it allows classifiers to naturally adapt to taxonomy modifications. This work studies zero-shot multi-label document classification with fine-tuned language models under realistic taxonomy expansion scenarios in the human resource domain. Experiments show that zero-shot learning can be highly effective in this setting. When controlling for training data budget, zero-shot classifiers achieve a 12% relative increase in macro-AP when compared to a traditional multi-label classifier trained on all classes. Counterintuitively, these results suggest in some settings it would be preferable to adopt zero-shot techniques and spend resources annotating more documents with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional classification techniques. Additional experiments demonstrate that adopting the well-known filter/re-rank decomposition from the recommender systems literature can significantly reduce the computational burden of high-performance zero-shot classifiers, empirically resulting in a 98% reduction in computational overhead for only a 2% relative decrease in performance. The evidence presented here demonstrates that zero-shot learning has the potential to significantly increase the flexibility of taxonomies and highlights directions for future research. Keywords Taxonomy, zero-shot learning, multi-label classification, natural language processing 1. Introduction performance of ZSL techniques for document classifica- tion in the HR domain. Experiments designed to simulate Taxonomies used to organize information must fre- realistic taxonomy expansion scenarios show that ZSL quently be adapted to reflect external changes such as the is highly effective, outperforming standard supervised introduction of new markets, the creation of specialized classifiers in low-resource settings. Further experiments segments, or the addition of new features. This is espe- demonstrate that adopting well-known techniques can cially true in the human resource (HR) domain, where significantly reduce the computational overhead of high- new job, skill, and license categories must be created to performance zero-shot classifiers. accommodate a constantly evolving marketplace. Un- fortunately, the techniques commonly used to label real- world objects (documents, images, etc) with taxonomy 2. Related Work classes are tightly coupled to the specific set of classes available when the classification system is developed. In There is a large body of previous work on ZSL [7, 8, 9, 10]. order to add new classes, rule-based systems [1, 2] re- Early work in the computer vision domain [11] repre- quire the creation of new rules, and supervised machine sented classes with pre-trained word embeddings [12] learning techniques [3, 4, 5, 6] require labeling data with and trained models to align them with image embeddings the new classes and training a new model. These re- in a shared vector space. Much of the subsequent work quirements make operationalizing modifications of the in ZSL has followed a similar embedding-based approach underlying taxonomy cumbersome. [13, 14, 15, 16]. Unlike traditional supervised classification techniques, A common assumption in ZSL is that the set train and zero-shot learning (ZSL) techniques are able to gener- test classes are disjoint. Noting that this is somewhat alize to new classes with minimal guidance [7, 8]. Ap- unrealistic, [17] proposed generalized zero-shot learn- plying ZSL to taxonomic classification has the potential ing (GZSL), which assumes training classes are a strict to increase the flexibility of organizational data struc- subset of test classes [18, 19]. As this work is primarily tures while retaining the performance benefits of ma- concerned with classifiers that can adapt to a changing chine learning techniques. taxonomy, experiments are conducted within the GZSL Within this context, this work empirically studies the framework. While there has been less explicit research on ZSL for RecSys in HR’22: The 2nd Workshop on Recommender Systems for NLP, as noted by [20], most techniques for ad-hoc doc- Human Resources, in conjunction with the 16th ACM Conference on ument retrieval [21, 22] can be leveraged for zero-shot Recommender Systems, September 18–23, 2022, Seattle, USA document classification by treating the labels as queries. Envelope-Open tlake@indeed.com (T. Lake) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License In [23], a standard classifier was applied to a combined Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 representation of a document and label, produced with Traditional Zero-Shot Class Probabilities Match Probability Match Probability Multi-Label Bi-Encoder Cross-Encoder Language Model Language Model Language Model Language Model Input Tokens Input Tokens Class Tokens Input + Class Tokens Figure 1: Graphical representation of models used in experiments. Traditional multi-label classifiers (left) output a probability for each class. Zero-shot classifiers (right) model compatibility between an input and class description. word embeddings or LSTMs [24]. [20] apply convolution generalize in realistic zero-shot settings when orders of neural networks [25] over features derived from interac- magnitude less background training data are available. tions between token and class embeddings. Following the rise of transfer learning via fine-tuning for NLP [26, 27], recent approaches to zero-shot docu- 3. Problem Formulation ment class classification have adopted similar techniques. Taxonomy classification is formulated in terms of a multi- In [28] zero-shot document classification was formulated label text classification problem. Let π‘Œ be a set of classes, as an entailment task. Pre-trained language models were π‘₯𝑖 ∈ 𝑋 a document, and y𝑖 ∈ {0, 1}|π‘Œ | a corresponding either fine-tuned on a dataset containing a subset of binary label vector where 𝑦𝑖𝑗 = 1 if document π‘₯𝑖 is labeled classes, or datasets for natural language inference (NLI) with class 𝑗 and 0 otherwise. A common probabilistic ap- [29]. An identical entailment formulation was used in proach to multi-label text classification [38] is to assume [30], which studied zero-shot transfer between datasets. conditional independence among labels, Pre-trained language models were also used for zero- shot document classification in [31], which explored the 𝑦 𝑝(y𝑖 ∣ π‘₯𝑖 ) = ∏ 𝑝(𝑦𝑖𝑗 ∣ π‘₯𝑖 ) = ∏ π‘žπ‘–π‘—π‘–π‘— (1 βˆ’ π‘žπ‘–π‘— )1βˆ’π‘¦π‘–π‘— , (1) use of cloze-style templates for zero-shot and few-shot 𝑗 𝑗 document classification. Autoregressive neural language models have been and approximate the parameters of the conditional shown to possess some ZSL capabilities with proper Bernoulli distributions, 0 ≀ π‘žπ‘–π‘— ≀ 1, using some model. A prompting [32]. Significantly larger models have im- common choice is π‘žπ‘–π‘— β‰ˆ 𝜎(π‘Ÿπ‘–π‘— ) = (1 + 𝑒 βˆ’π‘Ÿπ‘–π‘— )βˆ’1 , where proved these results [33]. However, the computational demands of such large models make them unsuitable for π‘Ÿπ‘–π‘— = w𝑇𝑗 π‘“πœƒ (π‘₯𝑖 ), (2) most practical applications. The benefit of fine-tuning for entailment-based ZSL w𝑗 ∈ ℝ𝑑 is a vector of parameters, and π‘“πœƒ ∢ 𝑋 β†’ ℝ𝑑 is a was studied in [34]. Their experiments showed fine- function with parameters πœƒ, e.g., a transformer neural net- tuning on generic NLI datasets often results in worse work [39]. In the remainder, the above is simply referred ZSL performance and hypothesize this is due to models to as the standard multi-label model. exploiting lexical patterns and other spurious statisti- Because each class 𝑗 is associated with a distinct vector cal cues [35, 36]. Experimental results presented here of parameters w𝑗 in (2), the multi-label model is unable complement those in [34], suggesting their observations to generalize to classes not observed during training. To do not apply when even a small amount of task-specific side-step this issue, ZSL assumes the existence of textual training data is available. class descriptions 𝑧𝑗 ∈ 𝑋 for each class 𝑗 ∈ π‘Œ which can The closely related work of [37] also studied GZSL be leveraged to break the explicit dependency between for multi-label text classification. Their focus was on model parameters and classes. This work considers two understanding the role of incorporating knowledge of standard architectures from the literature [40], described the hierarchical label structure into models in both the below and depicted graphically in Figure 1, which can few-shot and zero-shot settings. Instead, the work pre- incorporate class descriptions. Models are designed to sented here specifically designs experiments to better be relatively simple, reflective of common best practices, understand the ability of standard GZSL techniques to and as similar as possible to avoid confounding and draw clear inferences about general performance patterns. Bi-Encoder: This model replaces the vector w𝑗 with 3. Expand the Source Taxonomy by undoing the the output of an additional parameterized function taking modifications from Step 1 to obtain the Target class descriptions as input, Taxonomy. π‘Ÿπ‘–π‘— = π‘“πœƒ1 (𝑧𝑗 )𝑇 π‘“πœƒ2 (π‘₯𝑖 ). 4. Evaluate classifiers on a new dataset of instances labeled with classes from the Target Taxonomy. Cross-Encoder: A parameterized function that takes as input a concatenated document and class description Details of the taxonomy, datasets, and expansion types (denoted by βŠ”). The model has a single additional param- used in this work are given below. eter vector w ∈ ℝ𝑑 , 4.1. Indeed Occupations π‘Ÿπ‘–π‘— = w𝑇 π‘“πœƒ (π‘₯𝑖 βŠ” 𝑧𝑗 ). Indeed’s internal U.S occupation taxonomy was used as a 3.1. Loss representative source of structured knowledge. The tax- onomy contains over a thousand occupations arranged hi- Given a dataset 𝐷 = {(π‘₯1 , y1 ), … , (π‘₯|𝐷| , y|𝐷| )}, model pa- erarchically in a forest-like directed acyclic graph (DAG), rameters can be optimized by minimizing negative log- with root nodes being general occupations, Healthcare likelihood Occupations, and leaf nodes being the most specific, Nurse β„’ (𝐷) = |𝐷|βˆ’1 βˆ‘ β„“(𝑖), Practitioners. In addition to their placement within the 𝑖 hierarchy, occupations are also associated with a natural where language name and definition. Data formats are given in Table 1. β„“(𝑖) = βˆ’ βˆ‘(𝑦𝑖𝑗 log 𝜎(π‘Ÿπ‘–π‘— ) + (1 βˆ’ 𝑦𝑖𝑗 ) log (1 βˆ’ 𝜎(π‘Ÿπ‘–π‘— )) (3) 𝑗 Table 1 Due to zero-shot approaches conditioning on class de- The data representations used in this work. Jobs and occupa- scriptions, computing the sum over each class in Equation tions are converted to strings composed of multiple fields. (3) requires |π‘Œ | forward passes through the model. This Object Text results in significant computational overhead when train- Job Title: , Employer: , Description: ing. To alleviate this issue, the commonly used negative Occupation Name: , Definition: sampling [12] strategy is used to approximate the loss β„“(β‹…), Each job posted on Indeed is labeled with one or more Μ‚ = βˆ’ log π‘Ÿ 𝑒 π‘Ÿπ‘–π‘— occupations. The number of jobs per occupation for eval- β„“(𝑖) (4) 𝑒 𝑖𝑗 + βˆ‘π‘–β€² 𝑒 π‘Ÿπ‘–β€²π‘— + βˆ‘π‘— β€² 𝑒 π‘Ÿπ‘–π‘— β€² uation data is given in Table 2. Jobs were selected using stratified sampling by occupation. In particular, for each where 𝑖′ , 𝑗, 𝑗 β€² are uniformly sample such that 𝑦𝑖𝑗 = 1 and occupation 𝑁 jobs labeled with that occupation were ran- β€² 𝑦𝑖′ 𝑗 = 𝑦𝑖𝑗 β€² = 0. The number of negative documents 𝑖 domly sampled without replacement. It should be noted and classes 𝑗 β€² are treated as hyper-parameters. Initial that since jobs can be labeled with multiple occupations, experiments also explored a Bernoulli rather than a cate- this sampling strategy only guarantees datasets contain gorical version of β„“(β‹…) Μ‚ but found the categorical version at least 𝑁 jobs per occupation, not that there are exactly performed better. 𝑁 jobs per occupation. The same procedure was used to sample disjoints subsets of jobs for training, validation, and testing. 4. Experiments Experiments are designed to simulate real-world taxon- Table 2 omy expansion driven by domain experts. At a high level, Test jobs by numbers of labels. Five jobs were sampled for all experiments follow the same process. each occupation for evaluation. 1. Modify or remove classes to obtain the Source Labels Jobs Percent Taxonomy. Critically, this is done in a way that 1 2,527 55% incorporates the underlying structure of the tax- 2 1,567 34% onomy to ensure coherent modifications, rather 3 393 9% than simply removing classes at random. 4 68 1% Total 4,555 100% 2. Train classifiers using a dataset of instances la- beled with classes from the Source Taxonomy. Refine class is selected. Any appearances of this class or its descendants are removed. This process is repeated until a fixed percentage of classes have been removed. At the end of the process, any document that no longer has any labels is removed from the training dataset. Train Test 4.3. Evaluation Extend Train Test Performance is evaluated in terms of a model’s ability to rank relevant classes for a particular documents, and rank documents with respect to a class. In both cases, average precision (AP) is used to measure the quality of a predicted ordering relative to ground truth labels. The difference is whether AP is computed for all labels and averaged over documents, typically referred to as label ranking average Precision (LRAP) [41], or computed for Figure 2: Graphical representation of Refine (top) and Ex- all documents and averaged over labels, typically referred tend (bottom) taxonomy expansion operations. Each node to as macro-AP. Formally, for matrices Y ∈ {0, 1}|𝐷|Γ—|π‘Œ | of represents a class. Models are evaluated on all classes. White ground truth binary labels and R ∈ ℝ|𝐷|Γ—|π‘Œ | of predicted and teal classes are observed during training. Magenta classes scores, then are not observed during training. Teal classes replace their children during training. LRAP = |𝐷|βˆ’1 βˆ‘ AP (Y𝑖,∢ , R𝑖,∢ ) 𝑖 macro-AP = |π‘Œ |βˆ’1 βˆ‘ AP (Y∢,𝑗 , R∢,𝑗 ) 𝑗 4.2. Expansion Operations The two taxonomy expansion operations considered are where for vectors y ∈ {0, 1}𝑑 and r ∈ ℝ𝑑 described below and depicted graphically in Figure 2. Refine: This setting simulates the scenario where 1 |{π‘˜ ∣ π‘¦π‘˜ = 1 ∧ π‘Ÿπ‘˜ β‰₯ π‘Ÿπ‘– }| AP(y, r) = βˆ‘ 𝑦𝑖 . a subset of leaf classes are subdivided into more fine- βˆ‘π‘– 𝑦𝑖 𝑖 |{π‘˜ ∣ π‘Ÿπ‘˜ β‰₯ π‘Ÿπ‘– }| grained classes. This sort of refinement can occur when gaps in the taxonomy surface after use, or in situations 4.4. Training Details when the set of initial classes naturally diversifies over time. For example, an academic field of study may subdi- Following modern practices in NLP, models consist of vide into more specialized subfields as it matures. Zero- a pre-trained transformer [39] backbone which is fine- shot classifiers in this setting must generalize to classes tuned [26, 27] along with any additional parameters. that are more specific versions of those encountered dur- All models use BERT-base [27] as a backbone language ing training. model. Hyper-parameters were manually tuned on a To construct datasets in this setting, a random leaf small subset of the training data using the multi-label class is selected. Any appearances of this class or its model and fixed for all models and experiments. The siblings are replaced with the parent class. This process Adam [42] optimizer was used with a learning rate of is repeated until a fixed percentage of leaf classes have 2e-5 for pre-trained parameters and 2e-4 for randomly ini- been replaced. tialized parameters. Learning rate warm-up was applied Extend: This setting simulates the scenario where a for the first 10% of the updates and then linearly decayed set of classes are added from an unrelated domain. This to zero. The maximum gradient norm was clipped to 10 situation can occur when new use cases surface that [43]. All models are trained for 20 epochs with a batch require classes that were not previously necessary. For size of 64. Models are evaluated after each epoch and the example, if an e-commerce company that had historically final model is selected based on the LRAP on the valida- only sold goods like household items and clothing began tion dataset. The bi-encoder and cross-encoder models offering groceries, the previous product taxonomy would were trained using negative sampling with 8 negative not be useful for organizing the new items. Zero-shot classes and 4 negative inputs per positive training doc- classifiers in this setting must generalize to classes that ument (Equation 4). Experiments utilized the PyTorch are significantly different from those encountered during [44] and Huggingface Transformers [45] libraries. All training. hyper-parameters not listed explicitly above are left to To construct datasets in this setting, a random root their default values. Experiments were conducted using a single NVIDIA Tesla V100 GPU with 16GB of memory. Table 3 LRAP and Macro-AP for by model, class coverage, minimum documents per class, and number of training documents in the extend setting. Models denoted by † do not observe any task-specific training data. Model Class Coverage Documents Per Class Documents LRAP macro-AP Multi-Label 100% 3 2733 0.569 0.496 Multi-Label 50% 5 2500 0.294 0.249 Bi-Encoder 50% 5 2500 0.362 0.349 Cross-Encoder 50% 5 2500 0.645 0.553 Multi-Label 100% 4 3614 0.638 0.564 Multi-Label 75% 5 3628 0.493 0.438 Bi-Encoder 75% 5 3628 0.480 0.447 Cross-Encoder 75% 5 3628 0.654 0.590 Multi-Label 100% 5 4555 0.697 0.635 Bi-Encoder 100% 5 4555 0.570 0.521 Cross-Encoder 100% 5 4555 0.682 0.613 Cross-Encoder (NSP)† - - - 0.419 0.242 TF-IDF† - - - 0.397 0.294 5. Results 0.7 5.1. Generalizing to Novel Classes 0.6 Performance was evaluated for different percentages of LRAP observed classes during training (coverage) for both the 0.5 refine and extend expansion operation. LRAP and macro- AP are shown in Figure 3. The cross-encoder classifier 0.4 was robust to both taxonomy refinement and expansion. Minimal performance degradation was observed with 0.3 decreasing coverage, even in settings where over 50% of 40% 50% 60% 70% 80% 90% 100% the classes are new and approximately 60% of the jobs Percent of Jobs with Observed Occupations are labeled with a new occupation. The bi-encoder per- 0.65 formed significantly worse than the cross-encoder. This observation is consistent with prior-work in the retrieval 0.55 domain [40, 46]. However, the bi-encoder also suffered macro-AP more performance degradation with decreasing coverage. 0.45 For example, the bi-encoder’s macro-AP dropped by 36% when 50% of the classes are new (extend), whereas the Multi-Label 0.35 Bi-Encoder macro-AP cross-encoder’s only decreased by 5%. Perfor- Cross-Encoder mance of the multi-label classifier degraded rapidly as Refine 0.25 Extend coverage deceased, as it is unable to generalize to classes not observed during training. 50% 60% 70% 80% 90% 100% Percent of Observed Occupations 5.2. Learning on a Budget Figure 3: LRAP (top) and macro-AP (bottom) under different taxonomy expansion operations. Models are identified by Because the extend operation omits labels rather than color and symbol. Line styles reflect the expansion operation, relabeling them, zero-shot models had access to less train- with dashed lines for refinement and solid lines for extension. ing data in the previous experiments. To better under- stand the trade-off between fine-tuning and ZSL, experi- ments were conducted which controlled for the amount in Table 3. The ZSL cross-encoder with 50% coverage of data available for training. In particular, multi-label and five documents per class resulted in a 13% relative classifiers were trained on datasets where the number of increase in LRAP over the multi-label classifier with 100% documents was similar to ZSL approaches, but fewer doc- coverage and three documents per class (similar training uments per class are observed. Full results are presented Table 4 Zero-shot Macro-AP for novel domains in the challenging extend scenario with 50% class coverage. † Because the multi-label classifier is not capable of zero-shot generalization, it is trained with 100% class coverage, but fewer documents per class. Domain Classes Bi-Encoder Cross-Encoder Multi-Label† Personal Service 28 0.273 0.642 0.590 Food & Beverage 25 0.245 0.619 0.555 Cleaning & Grounds Maintenance 25 0.277 0.584 0.563 Marketing, Advertising & Public Relations 28 0.241 0.579 0.541 Repair, Maintenance & Installation 34 0.276 0.533 0.494 Healthcare 156 0.250 0.532 0.584 Protective & Security 27 0.302 0.529 0.509 Construction & Extraction 54 0.265 0.527 0.465 Architecture & Engineering 36 0.207 0.474 0.399 Sales, Retail & Customer Support 31 0.244 0.472 0.453 Supply Chain & Logistics 32 0.243 0.457 0.435 New Classes 478 0.251 0.534 0.523 Old Classes 433 0.457 0.575 0.465 All Classes 911 0.349 0.553 0.496 Training Documents 2500 2500 2733 Documents Per Class 5 5 3 Class Coverage 50% 50% 100% set size). This result was unexpected, as it suggests that 0.70 given a small document labeling budget (<4K here), in 0.65 some settings it would be preferable to adopt ZSL and spend resources annotating more documents with an in- 0.60 complete set of classes, rather than spreading the labeling LRAP budget uniformly over all classes and using traditional 0.55 classifiers. Further analysis of zero-shot performance is given 0.50 Cross-Encoder Bi-Encoder in Table 4, which presents macro-AP by root class for Bi-Encoder + Cross-Encoder unobserved classes in the extend setting with 50% cover- 0.45 24 8 16 32 64 age. Despite not being previously exposed to any classes Number of Candidates from these domains, in all cased the cross-encoder out- performed the multi-label classifier explicitly trained on Figure 4: LRAP for two-phase zero-shot classification for these classes. candidates set sizes from 2 to 64. Dashed lines depict the performance of standalone models. 5.3. Efficient Zero-Shot Inference As noted previously, there is a significant computational used to identify a small subset of potentially relevant cost associated with training the transformer-based zero- candidate classes. This smaller set of candidates was then shot learners due to the need to process each label for evaluated with the more computationally demanding, but each document. While this cost can be amortized for higher performance cross-encoder. Classes not selected the bi-encoder at inference time by pre-computing label in the first phase were implicitly assumed to receive a embeddings, this is not possible for the cross-encoder ar- score of zero. Results are shown in Figure 4 for candidate chitecture. Several works explore the architecture space set sizes from 2 to 64. Scoring only 16 candidates resulted between bi-encoders and cross-encoders to obtain a bet- in a small drop in LRAP (-2%) while resulting in a nearly ter trade-off between performance and latency [40, 46]. A 98% reduction in computational overhead. simpler technique was explored in this work inspired by the common decomposition of recommender systems into separate candidate retrieval and re-ranking [47] 6. Conclusion and Future Work phases. In the first phase, the more efficient bi-encoder was Taxonomies are widely used to organize knowledge and can easily incorporate important information from do- main experts that may be difficult to obtain in a purely [6] R. Ghani, K. Probst, Y. Liu, M. Krema, A. Fano, automated fashion. However, the ability to associate Text mining for product attribute extraction, ACM classes with real-world classes can be a bottleneck for the SIGKDD Explorations Newsletter (2006). rapid expansion of taxonomies. Experiments presented [7] H. Larochelle, D. Erhan, Y. Bengio, Zero-data learn- here demonstrate that modern zero-shot classification ing of new tasks., AAAI Conference on Artificial techniques can sidestep this issue by classifying objects Intelligence (2008). with novel classes using only minimal human guidance. [8] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Srikumar, Better understanding and overcoming the failure Importance of semantic representation: Dataless modes of the bi-encoder architecture would result in classification., AAAI Conference on Artificial Intel- more efficient systems capable of scaling larger tax- ligence (2008). onomies, either as stand-alone systems or as part of a [9] Y. Xian, B. Schiele, Z. Akata, Zero-shot learning-the multi-phase such as that described in Section 5.3. Related good, the bad and the ugly, IEEE Conference on work in the retrieval setting suggests adopting pretext Computer Vision and Pattern Recognition (2017). [48] tasks that are better aligned with the downstream [10] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z. Pan, task of interest could alleviate these issues [49]. Alter- H. Chen, Knowledge-aware zero-shot learning: Sur- natively, more elaborate negative sampling strategies vey and perspective, Joint Conference on Artificial [50, 51] could improve both zero-shot techniques stud- Intelligence (2021). ied in this work, and close any observed gaps between [11] R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero- zero-shot learners and traditional classifiers. Future work shot learning through cross-modal transfer, Ad- should explore zero-shot capabilities in more sophisti- vances in Neural Information Processing Systems cated knowledge bases (ontologies, knowledge graphs, (2013). etc), a larger variety of class types, and different domains. [12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, Lastly, further experimentation is needed to fully explain J. Dean, Distributed representations of words and observed differences between the results presented here phrases and their compositionality, Advances in and those in [34] in order to better understand the success Neural Information Processing Systems (2013). and failure modes of entailment-based ZSL. [13] B. Romera-Paredes, P. Torr, An embarrassingly sim- ple approach to zero-shot learning, International Conference on Machine Learning (2015). Acknowledgments [14] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, B. Schiele, Latent embeddings for zero-shot classi- Valuable insights, suggestions, and feedback was pro- fication, IEEE Conference on Computer Vision and vided by numerous individuals at Indeed. The author Pattern Recognition (2016). would especially like to thank Suyi Tu, Josh Levy, Ethan [15] R. Qiao, L. Liu, C. Shen, A. Van Den Hengel, Less is Handel, Arvi Sreenivasan, and Donal McMahon. more: zero-shot learning from online textual docu- ments with noise suppression, IEEE Conference on References Computer Vision and Pattern Recognition (2016). [16] L. Zhang, T. Xiang, S. Gong, Learning a deep em- [1] L. Chiticariu, Y. Li, F. Reiss, Rule-based information bedding model for zero-shot learning, IEEE Confer- extraction is dead! long live rule-based information ence on Computer Vision and Pattern Recognition extraction systems!, Empirical Methods in Natural (2017). Language Processing (2013). [17] W.-L. Chao, S. Changpinyo, B. Gong, F. Sha, An [2] M. Kejriwal, R. Shao, P. Szekely, Expert-guided empirical study and analysis of generalized zero- entity extraction using expressive rules, ACM SI- shot learning for object recognition in the wild, GIR Conference on Research and Development in European Conference on Computer Vision (2016). Information Retrieval (2019). [18] S. Liu, M. Long, J. Wang, M. I. Jordan, Generalized [3] S. Cucerzan, Large-scale named entity disambigua- zero-shot learning with deep calibration network, tion based on wikipedia data, Empirical Methods Advances in Neural Information Processing Sys- in Natural Language Processing (2007). tems (2018). [4] I. Karadeniz, A. Γ–zgΓΌr, Linking entities through [19] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, an ontology using word embeddings and syntactic C. P. Lim, X.-Z. Wang, A review of general- re-ranking, BMC Bioinformatics (2019). ized zero-shot learning methods, arXiv preprint [5] T. Lee, Z. Wang, H. Wang, S.-w. Hwang, Attribute arXiv:2011.08641 (2020). extraction and scoring: A probabilistic approach, [20] C. Li, W. Zhou, F. Ji, Y. Duan, H. Chen, A deep IEEE International Conference on Data Engineering relevance model for zero-shot document filtering, (2013). Association for Computational Linguistics (2018). [21] S. Robertson, H. Zaragoza, et al., The probabilistic [36] T. Niven, H.-Y. Kao, Probing neural network com- relevance framework: Bm25 and beyond, Founda- prehension of natural language arguments, Associ- tions and Trends in Information Retrieval 3 (2009). ation for Computational Linguistics (2019). [22] X. Wei, W. B. Croft, Lda-based document models [37] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malakasi- for ad-hoc retrieval, ACM SIGIR Conference on otis, N. Aletras, I. Androutsopoulos, An empirical Research and Development in Information Retrieval study on large-scale multi-label text classification (2006). including few and zero-shot labels, Empirical Meth- [23] P. K. Pushp, M. M. Srivastava, Train once, test ods in Natural Language Processing (2020). anywhere: Zero-shot learning for text classification, [38] K. P. Murphy, Machine Learning: A Probabilistic arXiv preprint arXiv:1712.05972 (2017). Perspective, 2012. [24] S. Hochreiter, J. Schmidhuber, Long short-term [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, memory, Neural computation (1997). L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- [25] Y. Kim, Convolutional neural networks for sen- tention is all you need, Advances in Neural Infor- tence classification, Empirical Methods in Natural mation Processing Systems (2017). Language Processing (2014). [40] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston, [26] J. Howard, S. Ruder, Universal language model Poly-encoders: Architectures and pre-training fine-tuning for text classification, Association for strategies for fast and accurate multi-sentence scor- Computational Linguistics (2018). ing, International Conference on Learning Repre- [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: sentations (2019). Pre-training of deep bidirectional transformers for [41] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi- language understanding, North American Chapter label data, Data Mining and Knowledge Discovery of the Association for Computational Linguistics Handbook (2009). (2019). [42] D. P. Kingma, J. Ba, Adam: A method for stochastic [28] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot optimization, International Conference on Learn- text classification: Datasets, evaluation and entail- ing Representations (2015). ment approach, Empirical Methods in Natural Lan- [43] J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient guage Processing (2019). clipping accelerates training: A theoretical justifi- [29] A. Williams, N. Nangia, S. Bowman, A broad- cation for adaptivity, International Conference on coverage challenge corpus for sentence understand- Learning Representations (2019). ing through inference, North American Chapter [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- of the Association for Computational Linguistics bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, (2018). L. Antiga, et al., Pytorch: An imperative style, high- [30] K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task- performance deep learning library, Advances in aware representation of sentences for generic text Neural Information Processing Systems (2019). classification, International Conference on Compu- [45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- tational Linguistics (2020). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- [31] T. Schick, H. SchΓΌtze, Exploiting cloze-questions for towicz, et al., Transformers: State-of-the-art nat- few-shot text classification and natural language ural language processing, Empirical Methods in inference, Conference of the European Chapter Natural Language Processing: System Demonstra- of the Association for Computational Linguistics tions (2020). (2021). [46] O. Khattab, M. Zaharia, Colbert: Efficient and effec- [32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tive passage search via contextualized late interac- I. Sutskever, et al., Language models are unsuper- tion over bert, ACM SIGIR Conference on Research vised multitask learners, OpenAI blog (2019). and Development in Information Retrieval (2020). [33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [47] P. Covington, J. Adams, E. Sargin, Deep neural net- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- works for youtube recommendations, ACM Con- try, A. Askell, et al., Language models are few-shot ference on Recommender Systems (2016). learners, Advances in Neural Information Process- [48] L. Jing, Y. Tian, Self-supervised visual feature learn- ing Systems (2020). ing with deep neural networks: A survey, IEEE [34] T. Ma, J.-G. Yao, C.-Y. Lin, T. Zhao, Issues with Transactions on Pattern Analysis and Machine In- entailment-based zero-shot text classification, As- telligence 43 (2020). sociation for Computational Linguistics (2021). [49] W.-C. Chang, X. Y. Felix, Y.-W. Chang, Y. Yang, [35] S. Feng, E. Wallace, J. Boyd-Graber, Misleading S. Kumar, Pre-training tasks for embedding-based failures of partial-input baselines, Association for large-scale retrieval, International Conference on Computational Linguistics (2019). Learning Representations (2019). [50] J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, Joint Conference on Artificial Intelligence (2011). [51] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Op- timizing dense retrieval model training with hard negatives, ACM SIGIR Conference on Research and Development in Information Retrieval (2021).