-

Recommender Systems, September

Zero-Shot

2022

1 8 23

Using a taxonomy to organize information requires classifying objects (documents, images, etc) with appropriate taxonomic classes. The flexible nature of zero-shot learning is appealing for this task because it allows classifiers to naturally adapt to taxonomy modifications. This work studies zero-shot multi-label document classification with fine-tuned language models under realistic taxonomy expansion scenarios in the human resource domain. Experiments show that zero-shot learning can be highly efective in this setting. When controlling for training data budget, zero-shot classifiers achieve a 12% relative increase in macro-AP when compared to a traditional multi-label classifier trained on all classes. Counterintuitively, these results suggest in some settings it would be preferable to adopt zero-shot techniques and spend resources annotating more documents with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional classification techniques. Additional experiments demonstrate that adopting the well-known filter/re-rank decomposition from the recommender systems literature can significantly reduce the computational burden of high-performance zero-shot classifiers, empirically resulting in a 98% reduction in computational overhead for only a 2% relative decrease in performance. The evidence presented here demonstrates that zero-shot learning has the potential to significantly increase the flexibility of taxonomies and highlights directions for future research. Taxonomy, zero-shot learning, multi-label classification, natural language processing Taxonomies used to organize information must fre- realistic taxonomy expansion scenarios show that ZSL tion in the HR domain. Experiments designed to simulate

1. Introduction quently be adapted to reflect external changes such as the introduction of new markets, the creation of specialized segments, or the addition of new features. This is especially true in the human resource (HR) domain, where new job, skill, and license categories must be created to accommodate a constantly evolving marketplace. Unfortunately, the techniques commonly used to label realworld objects (documents, images, etc) with taxonomy classes are tightly coupled to the specific set of classes available when the classification system is developed. In order to add new classes, rule-based systems [1, 2] require the creation of new rules, and supervised machine learning techniques [3, 4, 5, 6] require labeling data with quirements make operationalizing modifications of the underlying taxonomy cumbersome.

Unlike traditional supervised classification techniques, alize to new classes with minimal guidance [7, 8]. Applying ZSL to taxonomic classification has the potential to increase the flexibility of organizational data structures while retaining the performance benefits of machine learning techniques.

Within this context, this work empirically studies the RecSys in HR’22: The 2nd Workshop on Recommender Systems for Human Resources, in conjunction with the 16th ACM Conference on

2. Related Work

There is a large body of previous work on ZSL [7, 8, 9, 10].

Early work in the computer vision domain [11] represented classes with pre-trained word embeddings [12] and trained models to align them with image embeddings in ZSL has followed a similar embedding-based approach

A common assumption in ZSL is that the set train and unrealistic, [17] proposed generalized zero-shot learning (GZSL), which assumes training classes are a strict subset of test classes [18, 19]. As this work is primarily concerned with classifiers that can adapt to a changing taxonomy, experiments are conducted within the GZSL framework.

While there has been less explicit research on ZSL for NLP, as noted by [20], most techniques for ad-hoc document retrieval [21, 22] can be leveraged for zero-shot document classification by treating the labels as queries. representation of a document and label, produced with the new classes and training a new model. These re- in a shared vector space. Much of the subsequent work zero-shot learning (ZSL) techniques are able to gener- test classes are disjoint. Noting that this is somewhat © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License In [23], a standard classifier was applied to a combined Bi-Encoder

Match Probability

Match Probability neural networks [25] over features derived from interacmagnitude less background training data are available. few-shot and zero-shot settings. Instead, the work pre- incorporate class descriptions. Models are designed to tions between token and class embeddings.

Following the rise of transfer learning via fine-tuning for NLP [26, 27], recent approaches to zero-shot document class classification have adopted similar techniques.

In [28] zero-shot document classification was formulated as an entailment task. Pre-trained language models were either fine-tuned on a dataset containing a subset of classes, or datasets for natural language inference (NLI) [29]. An identical entailment formulation was used in [30], which studied zero-shot transfer between datasets.

Pre-trained language models were also used for zeroshot document classification in [ 31], which explored the use of cloze-style templates for zero-shot and few-shot document classification.

Autoregressive neural language models have been shown to possess some ZSL capabilities with proper prompting [32]. Significantly larger models have improved these results [33]. However, the computational demands of such large models make them unsuitable for most practical applications.

The benefit of fine-tuning for entailment-based ZSL was studied in [34]. Their experiments showed finetuning on generic NLI datasets often results in worse ZSL performance and hypothesize this is due to models exploiting lexical patterns and other spurious statistical cues [35, 36]. Experimental results presented here complement those in [34], suggesting their observations do not apply when even a small amount of task-specific training data is available.

The closely related work of [37] also studied GZSL for multi-label text classification. Their focus was on understanding the role of incorporating knowledge of the hierarchical label structure into models in both the sented here specifically designs experiments to better understand the ability of standard GZSL techniques to

3. Problem Formulation

Taxonomy classification is formulated in terms of a multilabel text classification problem. Let be a set of classes, ∈ a document, and y ∈ {0, 1}| | a corresponding binary label vector where = 1 if document is labeled with class and 0 otherwise. A common probabilistic approach to multi-label text classification [ 38] is to assume conditional independence among labels, ( y ∣ ) = ∏ ( ∣ ) = ∏ (1 − )1− , and approximate the parameters of the conditional Bernoulli distributions, 0 ≤ ≤ 1, using some model. A common choice is ≈ ( ) = (1 + − )−1, where = w ( ), w ∈ ℝ is a vector of parameters, and ∶ → ℝ is a function with parameters , e.g., a transformer neural network [39]. In the remainder, the above is simply referred to as the standard multi-label model.

Because each class is associated with a distinct vector of parameters w in (2), the multi-label model is unable to generalize to classes not observed during training. To side-step this issue, ZSL assumes the existence of textual class descriptions ∈ for each class ∈ be leveraged to break the explicit dependency between model parameters and classes. This work considers two standard architectures from the literature [40], described below and depicted graphically in Figure 1, which can which can be relatively simple, reflective of common best practices, and as similar as possible to avoid confounding and draw clear inferences about general performance patterns. (1) (2) Bi-Encoder: This model replaces the vector w with the output of an additional parameterized function taking class descriptions as input, = 1( )

2( ). modifications from Step 1 to obtain the

Target Taxonomy.

4. Evaluate classifiers on a new dataset of instances labeled with classes from the Target Taxonomy. eter vector w ∈ ℝ ,

Cross-Encoder: A parameterized function that takes as input a concatenated document and class description (denoted by ⊔). The model has a single additional param- used in this work are given below. Details of the taxonomy, datasets, and expansion types rameters can be optimized by minimizing negative logGiven a dataset = {(

1, y1), … , ( || , y|| )}, model pa- erarchically in a forest-like directed acyclic graph (DAG), 3.1. Loss likelihood where

= w

( ⊔ ). ℒ () = || −1 ∑ ℓ(),

ℓ() = − ∑( log ( ) + (1 − ) log (1 − ( )) (3) scriptions, computing the sum over each class in Equation (3) requires | | forward passes through the model. This results in significant computational overhead when training. To alleviate this issue, the commonly used negative sampling [12] strategy is used to approximate the loss ℓ(⋅), ℓ̂() = − log + ∑ ′ ′ + ∑ ′ ′ (4) where ′, , ′ are uniformly sample such that = 1 and ′ = ′ = 0. The number of negative documents ′ and classes ′ are treated as hyper-parameters. Initial experiments also explored a Bernoulli rather than a categorical version of ℓ̂(⋅) but found the categorical version performed better.

4. Experiments 4.1. Indeed Occupations

Indeed’s internal U.S occupation taxonomy was used as a representative source of structured knowledge. The taxonomy contains over a thousand occupations arranged hiwith root nodes being general occupations, Healthcare Occupations, and leaf nodes being the most specific, Nurse Practitioners. In addition to their placement within the hierarchy, occupations are also associated with a natural language name and definition . Data formats are given in Table 1.

Due to zero-shot approaches conditioning on class de- The data representations used in this work. Jobs and occupations are converted to strings composed of multiple fields.

Experiments are designed to simulate real-world taxonbeled with classes from the Source Taxonomy.

Labels 1 2 3 4 Total

Jobs 2,527 1,567 393 68 4,555

Percent 55% 34% 9% 1% 100% Extend

Train

Test

Train

Test tend (bottom) taxonomy expansion operations. Each node represents a class. Models are evaluated on all classes. White and teal classes are observed during training. Magenta classes are not observed during training. Teal classes replace their children during training.

4.2. Expansion Operations

The two taxonomy expansion operations considered are described below and depicted graphically in Figure 2.

Refine:

This setting simulates the scenario where a subset of leaf classes are subdivided into more finegrained classes. This sort of refinement can occur when gaps in the taxonomy surface after use, or in situations when the set of initial classes naturally diversifies over time. For example, an academic field of study may subdivide into more specialized subfields as it matures. Zeroshot classifiers in this setting must generalize to classes that are more specific versions of those encountered during training.

To construct datasets in this setting, a random leaf class is selected. Any appearances of this class or its siblings are replaced with the parent class. This process is repeated until a fixed percentage of leaf classes have been replaced.

Extend: This setting simulates the scenario where a set of classes are added from an unrelated domain. This situation can occur when new use cases surface that require classes that were not previously necessary. For example, if an e-commerce company that had historically only sold goods like household items and clothing began ofering groceries, the previous product taxonomy would not be useful for organizing the new items. Zero-shot classifiers in this setting must generalize to classes that are significantly diferent from those encountered during training.

To construct datasets in this setting, a random root a single NVIDIA Tesla V100 GPU with 16GB of memory.

5. Results 5.1. Generalizing to Novel Classes

Performance was evaluated for diferent percentages of observed classes during training (coverage) for both the refine and extend expansion operation. LRAP and macroAP are shown in Figure 3. The cross-encoder classifier was robust to both taxonomy refinement and expansion. Minimal performance degradation was observed with decreasing coverage, even in settings where over 50% of the classes are new and approximately 60% of the jobs are labeled with a new occupation. The bi-encoder performed significantly worse than the cross-encoder. This observation is consistent with prior-work in the retrieval domain [40, 46]. However, the bi-encoder also sufered more performance degradation with decreasing coverage. For example, the bi-encoder’s macro-AP dropped by 36% when 50% of the classes are new (extend), whereas the macro-AP cross-encoder’s only decreased by 5%. Performance of the multi-label classifier degraded rapidly as coverage deceased, as it is unable to generalize to classes not observed during training. 0.7 0.6 PAR0.5 L 0.4 0.3 0.65 0.55 P A ro0.45 c a m 0.35 0.25 40%

50% 60% 70% 80% 90% Percent of Jobs with Observed Occupations 100% Multi-Label Bi-Encoder Cross-Encoder Refine Extend 50% 60% 70% 80% 90% Percent of Observed Occupations 100%

5.2. Learning on a Budget

Because the extend operation omits labels rather than relabeling them, zero-shot models had access to less training data in the previous experiments. To better understand the trade-of between fine-tuning and ZSL, experiments were conducted which controlled for the amount of data available for training. In particular, multi-label in Table 3. The ZSL cross-encoder with 50% coverage classifiers were trained on datasets where the number of and five documents per class resulted in a 13% relative documents was similar to ZSL approaches, but fewer doc- increase in LRAP over the multi-label classifier with 100% uments per class are observed. Full results are presented coverage and three documents per class (similar training set size). This result was unexpected, as it suggests that given a small document labeling budget (<4K here), in some settings it would be preferable to adopt ZSL and spend resources annotating more documents with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional classifiers.

Further analysis of zero-shot performance is given in Table 4, which presents macro-AP by root class for unobserved classes in the extend setting with 50% coverage. Despite not being previously exposed to any classes from these domains, in all cased the cross-encoder outperformed the multi-label classifier explicitly trained on these classes. 5.3. Eficient Zero-Shot Inference

As noted previously, there is a significant computational used to identify a small subset of potentially relevant cost associated with training the transformer-based zero- candidate classes. This smaller set of candidates was then shot learners due to the need to process each label for evaluated with the more computationally demanding, but each document. While this cost can be amortized for higher performance cross-encoder. Classes not selected the bi-encoder at inference time by pre-computing label in the first phase were implicitly assumed to receive a embeddings, this is not possible for the cross-encoder ar- score of zero. Results are shown in Figure 4 for candidate chitecture. Several works explore the architecture space set sizes from 2 to 64. Scoring only 16 candidates resulted between bi-encoders and cross-encoders to obtain a bet- in a small drop in LRAP (-2%) while resulting in a nearly ter trade-of between performance and latency [ 40, 46]. A 98% reduction in computational overhead. simpler technique was explored in this work inspired by the common decomposition of recommender systems into separate candidate retrieval and re-ranking [47] 6. Conclusion and Future Work phases.

In the first phase, the more eficient bi-encoder was

Taxonomies are widely used to organize knowledge and can easily incorporate important information from do

Acknowledgments

Valuable insights, suggestions, and feedback was provided by numerous individuals at Indeed. The author would especially like to thank Suyi Tu, Josh Levy, Ethan Handel, Arvi Sreenivasan, and Donal McMahon. main experts that may be dificult to obtain in a purely automated fashion. However, the ability to associate classes with real-world classes can be a bottleneck for the rapid expansion of taxonomies. Experiments presented here demonstrate that modern zero-shot classification techniques can sidestep this issue by classifying objects with novel classes using only minimal human guidance.

Better understanding and overcoming the failure modes of the bi-encoder architecture would result in more eficient systems capable of scaling larger taxonomies, either as stand-alone systems or as part of a multi-phase such as that described in Section 5.3. Related work in the retrieval setting suggests adopting pretext [48] tasks that are better aligned with the downstream task of interest could alleviate these issues [49]. Alternatively, more elaborate negative sampling strategies [50, 51] could improve both zero-shot techniques studied in this work, and close any observed gaps between zero-shot learners and traditional classifiers. Future work should explore zero-shot capabilities in more sophisticated knowledge bases (ontologies, knowledge graphs, etc), a larger variety of class types, and diferent domains.

Lastly, further experimentation is needed to fully explain observed diferences between the results presented here and those in [34] in order to better understand the success and failure modes of entailment-based ZSL. [21] S. Robertson, H. Zaragoza, et al., The probabilistic [36] T. Niven, H.-Y. Kao, Probing neural network comrelevance framework: Bm25 and beyond, Founda- prehension of natural language arguments, Associtions and Trends in Information Retrieval 3 (2009). ation for Computational Linguistics (2019). [22] X. Wei, W. B. Croft, Lda-based document models [37] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malakasifor ad-hoc retrieval, ACM SIGIR Conference on otis, N. Aletras, I. Androutsopoulos, An empirical Research and Development in Information Retrieval study on large-scale multi-label text classification (2006). including few and zero-shot labels, Empirical Meth[23] P. K. Pushp, M. M. Srivastava, Train once, test ods in Natural Language Processing (2020). anywhere: Zero-shot learning for text classification, [38] K. P. Murphy, Machine Learning: A Probabilistic arXiv preprint arXiv:1712.05972 (2017). Perspective, 2012. [24] S. Hochreiter, J. Schmidhuber, Long short-term [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, memory, Neural computation (1997). L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At[25] Y. Kim, Convolutional neural networks for sen- tention is all you need, Advances in Neural Infortence classification, Empirical Methods in Natural mation Processing Systems (2017).

Language Processing (2014). [40] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston, [26] J. Howard, S. Ruder, Universal language model Poly-encoders: Architectures and pre-training ifne-tuning for text classification, Association for strategies for fast and accurate multi-sentence scorComputational Linguistics (2018). ing, International Conference on Learning Repre[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: sentations (2019).

Pre-training of deep bidirectional transformers for [41] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multilanguage understanding, North American Chapter label data, Data Mining and Knowledge Discovery of the Association for Computational Linguistics Handbook (2009).

(2019). [42] D. P. Kingma, J. Ba, Adam: A method for stochastic [28] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot optimization, International Conference on Learntext classification: Datasets, evaluation and entail- ing Representations (2015). ment approach, Empirical Methods in Natural Lan- [43] J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient guage Processing (2019). clipping accelerates training: A theoretical justifi[29] A. Williams, N. Nangia, S. Bowman, A broad- cation for adaptivity, International Conference on coverage challenge corpus for sentence understand- Learning Representations (2019). ing through inference, North American Chapter [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradof the Association for Computational Linguistics bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, (2018). L. Antiga, et al., Pytorch: An imperative style, high[30] K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task- performance deep learning library, Advances in aware representation of sentences for generic text Neural Information Processing Systems (2019). classification, International Conference on Compu- [45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Detational Linguistics (2020). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun[31] T. Schick, H. Schütze, Exploiting cloze-questions for towicz, et al., Transformers: State-of-the-art natfew-shot text classification and natural language ural language processing, Empirical Methods in inference, Conference of the European Chapter Natural Language Processing: System Demonstraof the Association for Computational Linguistics tions (2020).

(2021). [46] O. Khattab, M. Zaharia, Colbert: Eficient and efec[32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tive passage search via contextualized late interacI. Sutskever, et al., Language models are unsuper- tion over bert, ACM SIGIR Conference on Research vised multitask learners, OpenAI blog (2019). and Development in Information Retrieval (2020). [33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [47] P. Covington, J. Adams, E. Sargin, Deep neural netplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- works for youtube recommendations, ACM Contry, A. Askell, et al., Language models are few-shot ference on Recommender Systems (2016). learners, Advances in Neural Information Process- [48] L. Jing, Y. Tian, Self-supervised visual feature learning Systems (2020). ing with deep neural networks: A survey, IEEE [34] T. Ma, J.-G. Yao, C.-Y. Lin, T. Zhao, Issues with Transactions on Pattern Analysis and Machine Inentailment-based zero-shot text classification, As- telligence 43 (2020).

sociation for Computational Linguistics (2021). [49] W.-C. Chang, X. Y. Felix, Y.-W. Chang, Y. Yang, [35] S. Feng, E. Wallace, J. Boyd-Graber, Misleading S. Kumar, Pre-training tasks for embedding-based failures of partial-input baselines, Association for large-scale retrieval, International Conference on Computational Linguistics (2019). Learning Representations (2019). [50] J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, Joint

Conference on Artificial Intelligence (2011). [51] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing dense retrieval model training with hard negatives, ACM SIGIR Conference on Research and Development in Information Retrieval (2021).