<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Recommender Systems, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zero-Shot</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>Using a taxonomy to organize information requires classifying objects (documents, images, etc) with appropriate taxonomic classes. The flexible nature of zero-shot learning is appealing for this task because it allows classifiers to naturally adapt to taxonomy modifications. This work studies zero-shot multi-label document classification with fine-tuned language models under realistic taxonomy expansion scenarios in the human resource domain. Experiments show that zero-shot learning can be highly efective in this setting. When controlling for training data budget, zero-shot classifiers achieve a 12% relative increase in macro-AP when compared to a traditional multi-label classifier trained on all classes. Counterintuitively, these results suggest in some settings it would be preferable to adopt zero-shot techniques and spend resources annotating more documents with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional classification techniques. Additional experiments demonstrate that adopting the well-known filter/re-rank decomposition from the recommender systems literature can significantly reduce the computational burden of high-performance zero-shot classifiers, empirically resulting in a 98% reduction in computational overhead for only a 2% relative decrease in performance. The evidence presented here demonstrates that zero-shot learning has the potential to significantly increase the flexibility of taxonomies and highlights directions for future research. Taxonomy, zero-shot learning, multi-label classification, natural language processing Taxonomies used to organize information must fre- realistic taxonomy expansion scenarios show that ZSL tion in the HR domain. Experiments designed to simulate</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
quently be adapted to reflect external changes such as the
introduction of new markets, the creation of specialized
segments, or the addition of new features. This is
especially true in the human resource (HR) domain, where
new job, skill, and license categories must be created to
accommodate a constantly evolving marketplace.
Unfortunately, the techniques commonly used to label
realworld objects (documents, images, etc) with taxonomy
classes are tightly coupled to the specific set of classes
available when the classification system is developed. In
order to add new classes, rule-based systems [1, 2]
require the creation of new rules, and supervised machine
learning techniques [3, 4, 5, 6] require labeling data with
quirements make operationalizing modifications of the
underlying taxonomy cumbersome.</p>
      <p>Unlike traditional supervised classification techniques,
alize to new classes with minimal guidance [7, 8].
Applying ZSL to taxonomic classification has the potential
to increase the flexibility of organizational data
structures while retaining the performance benefits of
machine learning techniques.</p>
      <p>Within this context, this work empirically studies the
RecSys in HR’22: The 2nd Workshop on Recommender Systems for
Human Resources, in conjunction with the 16th ACM Conference on</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>There is a large body of previous work on ZSL [7, 8, 9, 10].</p>
      <p>Early work in the computer vision domain [11]
represented classes with pre-trained word embeddings [12]
and trained models to align them with image embeddings
in ZSL has followed a similar embedding-based approach</p>
      <p>A common assumption in ZSL is that the set train and
unrealistic, [17] proposed generalized zero-shot
learning (GZSL), which assumes training classes are a strict
subset of test classes [18, 19]. As this work is primarily
concerned with classifiers that can adapt to a changing
taxonomy, experiments are conducted within the GZSL
framework.</p>
      <p>While there has been less explicit research on ZSL for
NLP, as noted by [20], most techniques for ad-hoc
document retrieval [21, 22] can be leveraged for zero-shot
document classification by treating the labels as queries.
representation of a document and label, produced with
the new classes and training a new model. These re- in a shared vector space. Much of the subsequent work
zero-shot learning (ZSL) techniques are able to gener- test classes are disjoint. Noting that this is somewhat
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License In [23], a standard classifier was applied to a combined
Bi-Encoder</p>
      <p>Match Probability</p>
      <p>Match Probability
neural networks [25] over features derived from
interacmagnitude less background training data are available.
few-shot and zero-shot settings. Instead, the work pre- incorporate class descriptions. Models are designed to
tions between token and class embeddings.</p>
      <p>Following the rise of transfer learning via fine-tuning
for NLP [26, 27], recent approaches to zero-shot
document class classification have adopted similar techniques.</p>
      <p>In [28] zero-shot document classification was formulated
as an entailment task. Pre-trained language models were
either fine-tuned on a dataset containing a subset of
classes, or datasets for natural language inference (NLI)
[29]. An identical entailment formulation was used in
[30], which studied zero-shot transfer between datasets.</p>
      <p>Pre-trained language models were also used for
zeroshot document classification in [ 31], which explored the
use of cloze-style templates for zero-shot and few-shot
document classification.</p>
      <p>Autoregressive neural language models have been
shown to possess some ZSL capabilities with proper
prompting [32]. Significantly larger models have
improved these results [33]. However, the computational
demands of such large models make them unsuitable for
most practical applications.</p>
      <p>The benefit of fine-tuning for entailment-based ZSL
was studied in [34]. Their experiments showed
finetuning on generic NLI datasets often results in worse
ZSL performance and hypothesize this is due to models
exploiting lexical patterns and other spurious
statistical cues [35, 36]. Experimental results presented here
complement those in [34], suggesting their observations
do not apply when even a small amount of task-specific
training data is available.</p>
      <p>The closely related work of [37] also studied GZSL
for multi-label text classification. Their focus was on
understanding the role of incorporating knowledge of
the hierarchical label structure into models in both the
sented here specifically designs experiments to better
understand the ability of standard GZSL techniques to</p>
    </sec>
    <sec id="sec-4">
      <title>3. Problem Formulation</title>
      <p>Taxonomy classification is formulated in terms of a
multilabel text classification problem. Let  be a set of classes,
  ∈  a document, and y ∈ {0, 1}| | a corresponding
binary label vector where   = 1 if document   is labeled
with class  and 0 otherwise. A common probabilistic
approach to multi-label text classification [ 38] is to assume
conditional independence among labels,
( y ∣   ) = ∏ (  ∣   ) = ∏ 



  (1 −   )1−  ,
and approximate the parameters of the conditional
Bernoulli distributions, 0 ≤   ≤ 1, using some model. A
common choice is   ≈  (  ) = (1 +  −  )−1, where

  = w 

 (  ),
w ∈ ℝ is a vector of parameters, and   ∶  → ℝ  is a
function with parameters  , e.g., a transformer neural
network [39]. In the remainder, the above is simply referred
to as the standard multi-label model.</p>
      <p>Because each class  is associated with a distinct vector
of parameters w in (2), the multi-label model is unable
to generalize to classes not observed during training. To
side-step this issue, ZSL assumes the existence of textual
class descriptions   ∈  for each class  ∈ 
be leveraged to break the explicit dependency between
model parameters and classes. This work considers two
standard architectures from the literature [40], described
below and depicted graphically in Figure 1, which can
which can
be relatively simple, reflective of common best practices,
and as similar as possible to avoid confounding and draw
clear inferences about general performance patterns.
(1)
(2)
Bi-Encoder: This model replaces the vector w
 with
the output of an additional parameterized function taking
class descriptions as input,
  =   1(  )</p>
      <p>2(  ).

modifications from Step 1 to obtain the</p>
      <sec id="sec-4-1">
        <title>Target</title>
      </sec>
      <sec id="sec-4-2">
        <title>Taxonomy.</title>
        <p>4. Evaluate classifiers on a new dataset of instances
labeled with classes from the Target Taxonomy.
eter vector w ∈ ℝ ,</p>
        <p>Cross-Encoder: A parameterized function that takes
as input a concatenated document and class description
(denoted by ⊔). The model has a single additional param- used in this work are given below.
Details of the taxonomy, datasets, and expansion types
rameters can be optimized by minimizing negative
logGiven a dataset  = {(</p>
        <p>1, y1), … , ( || , y|| )}, model pa- erarchically in a forest-like directed acyclic graph (DAG),
3.1. Loss
likelihood
where</p>
        <p>= w</p>
        <p>(  ⊔   ).
ℒ () = ||
−1 ∑ ℓ(),</p>
        <p>ℓ() = − ∑(  log  (  ) + (1 −   ) log (1 −  (  )) (3)
scriptions, computing the sum over each class in Equation
(3) requires | | forward passes through the model. This
results in significant computational overhead when
training. To alleviate this issue, the commonly used negative
sampling [12] strategy is used to approximate the loss
ℓ(⋅),
ℓ̂() = − log    + ∑ ′    ′ + ∑ ′    ′

 
(4)
where  ′, ,  ′ are uniformly sample such that   = 1 and
  ′ =   ′ = 0. The number of negative documents  ′
and classes  ′ are treated as hyper-parameters. Initial
experiments also explored a Bernoulli rather than a
categorical version of ℓ̂(⋅) but found the categorical version
performed better.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Indeed Occupations</title>
        <p>Indeed’s internal U.S occupation taxonomy was used as a
representative source of structured knowledge. The
taxonomy contains over a thousand occupations arranged
hiwith root nodes being general occupations, Healthcare
Occupations, and leaf nodes being the most specific, Nurse
Practitioners. In addition to their placement within the
hierarchy, occupations are also associated with a natural
language name and definition . Data formats are given
in Table 1.</p>
        <p>Due to zero-shot approaches conditioning on class de- The data representations used in this work. Jobs and
occupations are converted to strings composed of multiple fields.</p>
        <p>Experiments are designed to simulate real-world
taxonbeled with classes from the Source Taxonomy.</p>
        <p>Labels
1
2
3
4
Total</p>
        <p>Jobs
2,527
1,567
393
68
4,555</p>
        <p>Percent
55%
34%
9%
1%
100%
Extend</p>
        <p>Train</p>
        <p>Test</p>
        <p>Train</p>
        <p>Test
tend (bottom) taxonomy expansion operations. Each node
represents a class. Models are evaluated on all classes. White
and teal classes are observed during training. Magenta classes
are not observed during training. Teal classes replace their
children during training.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Expansion Operations</title>
        <p>The two taxonomy expansion operations considered are
described below and depicted graphically in Figure 2.</p>
        <sec id="sec-5-2-1">
          <title>Refine:</title>
          <p>This setting simulates the scenario where
a subset of leaf classes are subdivided into more
finegrained classes. This sort of refinement can occur when
gaps in the taxonomy surface after use, or in situations
when the set of initial classes naturally diversifies over
time. For example, an academic field of study may
subdivide into more specialized subfields as it matures.
Zeroshot classifiers in this setting must generalize to classes
that are more specific versions of those encountered
during training.</p>
          <p>To construct datasets in this setting, a random leaf
class is selected. Any appearances of this class or its
siblings are replaced with the parent class. This process
is repeated until a fixed percentage of leaf classes have
been replaced.</p>
          <p>Extend: This setting simulates the scenario where a
set of classes are added from an unrelated domain. This
situation can occur when new use cases surface that
require classes that were not previously necessary. For
example, if an e-commerce company that had historically
only sold goods like household items and clothing began
ofering groceries, the previous product taxonomy would
not be useful for organizing the new items. Zero-shot
classifiers in this setting must generalize to classes that
are significantly diferent from those encountered during
training.</p>
          <p>To construct datasets in this setting, a random root
a single NVIDIA Tesla V100 GPU with 16GB of memory.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <sec id="sec-6-1">
        <title>5.1. Generalizing to Novel Classes</title>
        <p>Performance was evaluated for diferent percentages of
observed classes during training (coverage) for both the
refine and extend expansion operation. LRAP and
macroAP are shown in Figure 3. The cross-encoder classifier
was robust to both taxonomy refinement and expansion.
Minimal performance degradation was observed with
decreasing coverage, even in settings where over 50% of
the classes are new and approximately 60% of the jobs
are labeled with a new occupation. The bi-encoder
performed significantly worse than the cross-encoder. This
observation is consistent with prior-work in the retrieval
domain [40, 46]. However, the bi-encoder also sufered
more performance degradation with decreasing coverage.
For example, the bi-encoder’s macro-AP dropped by 36%
when 50% of the classes are new (extend), whereas the
macro-AP cross-encoder’s only decreased by 5%.
Performance of the multi-label classifier degraded rapidly as
coverage deceased, as it is unable to generalize to classes
not observed during training.
0.7
0.6
PAR0.5
L
0.4
0.3
0.65
0.55
P
A
ro0.45
c
a
m
0.35
0.25
40%</p>
        <p>50% 60% 70% 80% 90%
Percent of Jobs with Observed Occupations
100%
Multi-Label
Bi-Encoder
Cross-Encoder
Refine
Extend
50%
60% 70% 80% 90%
Percent of Observed Occupations
100%</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Learning on a Budget</title>
        <p>Because the extend operation omits labels rather than
relabeling them, zero-shot models had access to less
training data in the previous experiments. To better
understand the trade-of between fine-tuning and ZSL,
experiments were conducted which controlled for the amount
of data available for training. In particular, multi-label in Table 3. The ZSL cross-encoder with 50% coverage
classifiers were trained on datasets where the number of and five documents per class resulted in a 13% relative
documents was similar to ZSL approaches, but fewer doc- increase in LRAP over the multi-label classifier with 100%
uments per class are observed. Full results are presented coverage and three documents per class (similar training
set size). This result was unexpected, as it suggests that
given a small document labeling budget (&lt;4K here), in
some settings it would be preferable to adopt ZSL and
spend resources annotating more documents with an
incomplete set of classes, rather than spreading the labeling
budget uniformly over all classes and using traditional
classifiers.</p>
        <p>Further analysis of zero-shot performance is given
in Table 4, which presents macro-AP by root class for
unobserved classes in the extend setting with 50%
coverage. Despite not being previously exposed to any classes
from these domains, in all cased the cross-encoder
outperformed the multi-label classifier explicitly trained on
these classes.
5.3. Eficient Zero-Shot Inference</p>
        <p>As noted previously, there is a significant computational used to identify a small subset of potentially relevant
cost associated with training the transformer-based zero- candidate classes. This smaller set of candidates was then
shot learners due to the need to process each label for evaluated with the more computationally demanding, but
each document. While this cost can be amortized for higher performance cross-encoder. Classes not selected
the bi-encoder at inference time by pre-computing label in the first phase were implicitly assumed to receive a
embeddings, this is not possible for the cross-encoder ar- score of zero. Results are shown in Figure 4 for candidate
chitecture. Several works explore the architecture space set sizes from 2 to 64. Scoring only 16 candidates resulted
between bi-encoders and cross-encoders to obtain a bet- in a small drop in LRAP (-2%) while resulting in a nearly
ter trade-of between performance and latency [ 40, 46]. A 98% reduction in computational overhead.
simpler technique was explored in this work inspired by
the common decomposition of recommender systems
into separate candidate retrieval and re-ranking [47] 6. Conclusion and Future Work
phases.</p>
        <p>In the first phase, the more eficient bi-encoder was</p>
        <p>Taxonomies are widely used to organize knowledge and
can easily incorporate important information from
do</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Valuable insights, suggestions, and feedback was
provided by numerous individuals at Indeed. The author
would especially like to thank Suyi Tu, Josh Levy, Ethan
Handel, Arvi Sreenivasan, and Donal McMahon.
main experts that may be dificult to obtain in a purely
automated fashion. However, the ability to associate
classes with real-world classes can be a bottleneck for the
rapid expansion of taxonomies. Experiments presented
here demonstrate that modern zero-shot classification
techniques can sidestep this issue by classifying objects
with novel classes using only minimal human guidance.</p>
      <p>Better understanding and overcoming the failure
modes of the bi-encoder architecture would result in
more eficient systems capable of scaling larger
taxonomies, either as stand-alone systems or as part of a
multi-phase such as that described in Section 5.3. Related
work in the retrieval setting suggests adopting pretext
[48] tasks that are better aligned with the downstream
task of interest could alleviate these issues [49].
Alternatively, more elaborate negative sampling strategies
[50, 51] could improve both zero-shot techniques
studied in this work, and close any observed gaps between
zero-shot learners and traditional classifiers. Future work
should explore zero-shot capabilities in more
sophisticated knowledge bases (ontologies, knowledge graphs,
etc), a larger variety of class types, and diferent domains.</p>
      <p>Lastly, further experimentation is needed to fully explain
observed diferences between the results presented here
and those in [34] in order to better understand the success
and failure modes of entailment-based ZSL.
[21] S. Robertson, H. Zaragoza, et al., The probabilistic [36] T. Niven, H.-Y. Kao, Probing neural network
comrelevance framework: Bm25 and beyond, Founda- prehension of natural language arguments,
Associtions and Trends in Information Retrieval 3 (2009). ation for Computational Linguistics (2019).
[22] X. Wei, W. B. Croft, Lda-based document models [37] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P.
Malakasifor ad-hoc retrieval, ACM SIGIR Conference on otis, N. Aletras, I. Androutsopoulos, An empirical
Research and Development in Information Retrieval study on large-scale multi-label text classification
(2006). including few and zero-shot labels, Empirical
Meth[23] P. K. Pushp, M. M. Srivastava, Train once, test ods in Natural Language Processing (2020).
anywhere: Zero-shot learning for text classification, [38] K. P. Murphy, Machine Learning: A Probabilistic
arXiv preprint arXiv:1712.05972 (2017). Perspective, 2012.
[24] S. Hochreiter, J. Schmidhuber, Long short-term [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
memory, Neural computation (1997). L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
At[25] Y. Kim, Convolutional neural networks for sen- tention is all you need, Advances in Neural
Infortence classification, Empirical Methods in Natural mation Processing Systems (2017).</p>
      <p>Language Processing (2014). [40] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston,
[26] J. Howard, S. Ruder, Universal language model Poly-encoders: Architectures and pre-training
ifne-tuning for text classification, Association for strategies for fast and accurate multi-sentence
scorComputational Linguistics (2018). ing, International Conference on Learning
Repre[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: sentations (2019).</p>
      <p>Pre-training of deep bidirectional transformers for [41] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining
multilanguage understanding, North American Chapter label data, Data Mining and Knowledge Discovery
of the Association for Computational Linguistics Handbook (2009).</p>
      <p>(2019). [42] D. P. Kingma, J. Ba, Adam: A method for stochastic
[28] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot optimization, International Conference on
Learntext classification: Datasets, evaluation and entail- ing Representations (2015).
ment approach, Empirical Methods in Natural Lan- [43] J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient
guage Processing (2019). clipping accelerates training: A theoretical
justifi[29] A. Williams, N. Nangia, S. Bowman, A broad- cation for adaptivity, International Conference on
coverage challenge corpus for sentence understand- Learning Representations (2019).
ing through inference, North American Chapter [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradof the Association for Computational Linguistics bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
(2018). L. Antiga, et al., Pytorch: An imperative style,
high[30] K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task- performance deep learning library, Advances in
aware representation of sentences for generic text Neural Information Processing Systems (2019).
classification, International Conference on Compu- [45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
Detational Linguistics (2020). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Fun[31] T. Schick, H. Schütze, Exploiting cloze-questions for towicz, et al., Transformers: State-of-the-art
natfew-shot text classification and natural language ural language processing, Empirical Methods in
inference, Conference of the European Chapter Natural Language Processing: System
Demonstraof the Association for Computational Linguistics tions (2020).</p>
      <p>(2021). [46] O. Khattab, M. Zaharia, Colbert: Eficient and
efec[32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tive passage search via contextualized late
interacI. Sutskever, et al., Language models are unsuper- tion over bert, ACM SIGIR Conference on Research
vised multitask learners, OpenAI blog (2019). and Development in Information Retrieval (2020).
[33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [47] P. Covington, J. Adams, E. Sargin, Deep neural
netplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- works for youtube recommendations, ACM
Contry, A. Askell, et al., Language models are few-shot ference on Recommender Systems (2016).
learners, Advances in Neural Information Process- [48] L. Jing, Y. Tian, Self-supervised visual feature
learning Systems (2020). ing with deep neural networks: A survey, IEEE
[34] T. Ma, J.-G. Yao, C.-Y. Lin, T. Zhao, Issues with Transactions on Pattern Analysis and Machine
Inentailment-based zero-shot text classification, As- telligence 43 (2020).</p>
      <p>sociation for Computational Linguistics (2021). [49] W.-C. Chang, X. Y. Felix, Y.-W. Chang, Y. Yang,
[35] S. Feng, E. Wallace, J. Boyd-Graber, Misleading S. Kumar, Pre-training tasks for embedding-based
failures of partial-input baselines, Association for large-scale retrieval, International Conference on
Computational Linguistics (2019). Learning Representations (2019).
[50] J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling
up to large vocabulary image annotation, Joint</p>
      <p>Conference on Artificial Intelligence (2011).
[51] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma,
Optimizing dense retrieval model training with hard
negatives, ACM SIGIR Conference on Research and
Development in Information Retrieval (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>