=Paper= {{Paper |id=Vol-3218/paper8 |storemode=property |title=Flexible Job Classification with Zero-Shot Learning |pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_8.pdf |volume=Vol-3218 |authors=Thom Lake |dblpUrl=https://dblp.org/rec/conf/hr-recsys/Lake22 }} ==Flexible Job Classification with Zero-Shot Learning== https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_8.pdf
Flexible Job Classification with Zero-Shot Learning
Thom Lake
Indeed


                                    Abstract
                                    Using a taxonomy to organize information requires classifying objects (documents, images, etc) with appropriate taxonomic
                                    classes. The flexible nature of zero-shot learning is appealing for this task because it allows classifiers to naturally adapt to
                                    taxonomy modifications. This work studies zero-shot multi-label document classification with fine-tuned language models
                                    under realistic taxonomy expansion scenarios in the human resource domain. Experiments show that zero-shot learning can be
                                    highly effective in this setting. When controlling for training data budget, zero-shot classifiers achieve a 12% relative increase
                                    in macro-AP when compared to a traditional multi-label classifier trained on all classes. Counterintuitively, these results
                                    suggest in some settings it would be preferable to adopt zero-shot techniques and spend resources annotating more documents
                                    with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional
                                    classification techniques. Additional experiments demonstrate that adopting the well-known filter/re-rank decomposition
                                    from the recommender systems literature can significantly reduce the computational burden of high-performance zero-shot
                                    classifiers, empirically resulting in a 98% reduction in computational overhead for only a 2% relative decrease in performance.
                                    The evidence presented here demonstrates that zero-shot learning has the potential to significantly increase the flexibility of
                                    taxonomies and highlights directions for future research.

                                    Keywords
                                    Taxonomy, zero-shot learning, multi-label classification, natural language processing



1. Introduction                                                                                                  performance of ZSL techniques for document classifica-
                                                                                                                 tion in the HR domain. Experiments designed to simulate
Taxonomies used to organize information must fre- realistic taxonomy expansion scenarios show that ZSL
quently be adapted to reflect external changes such as the is highly effective, outperforming standard supervised
introduction of new markets, the creation of specialized classifiers in low-resource settings. Further experiments
segments, or the addition of new features. This is espe- demonstrate that adopting well-known techniques can
cially true in the human resource (HR) domain, where significantly reduce the computational overhead of high-
new job, skill, and license categories must be created to performance zero-shot classifiers.
accommodate a constantly evolving marketplace. Un-
fortunately, the techniques commonly used to label real-
world objects (documents, images, etc) with taxonomy 2. Related Work
classes are tightly coupled to the specific set of classes
available when the classification system is developed. In There is a large body of previous work on ZSL [7, 8, 9, 10].
order to add new classes, rule-based systems [1, 2] re- Early work in the computer vision domain [11] repre-
quire the creation of new rules, and supervised machine sented classes with pre-trained word embeddings [12]
learning techniques [3, 4, 5, 6] require labeling data with and trained models to align them with image embeddings
the new classes and training a new model. These re- in a shared vector space. Much of the subsequent work
quirements make operationalizing modifications of the in ZSL has followed a similar embedding-based approach
underlying taxonomy cumbersome.                                                                                  [13, 14, 15, 16].
              Unlike traditional supervised classification techniques,                                              A common assumption in ZSL is that the set train and
zero-shot learning (ZSL) techniques are able to gener-                                                           test classes are disjoint. Noting that this is somewhat
alize to new classes with minimal guidance [7, 8]. Ap-                                                           unrealistic,  [17] proposed generalized zero-shot learn-
plying ZSL to taxonomic classification has the potential ing (GZSL), which assumes training classes are a strict
to increase the flexibility of organizational data struc- subset of test classes [18, 19]. As this work is primarily
tures while retaining the performance benefits of ma- concerned with classifiers that can adapt to a changing
chine learning techniques.                                                                                       taxonomy, experiments are conducted within the GZSL
              Within this context, this work empirically studies the framework.
                                                                                                                    While there has been less explicit research on ZSL for
RecSys in HR’22: The 2nd Workshop on Recommender Systems for NLP, as noted by [20], most techniques for ad-hoc doc-
Human Resources, in conjunction with the 16th ACM Conference on ument retrieval [21, 22] can be leveraged for zero-shot
Recommender Systems, September 18–23, 2022, Seattle, USA
                                                                                                                 document classification by treating the labels as queries.
Envelope-Open tlake@indeed.com (T. Lake)
                    Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License In [23], a standard classifier was applied to a combined
                    Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings        CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                                 representation of a document and label, produced with
 Traditional                            Zero-Shot
               Class Probabilities                             Match Probability                                  Match Probability



     Multi-Label                         Bi-Encoder                                                     Cross-Encoder




               Language Model                    Language Model            Language Model                         Language Model




                   Input Tokens                       Input Tokens           Class Tokens                       Input + Class Tokens


Figure 1: Graphical representation of models used in experiments. Traditional multi-label classifiers (left) output a probability
for each class. Zero-shot classifiers (right) model compatibility between an input and class description.



word embeddings or LSTMs [24]. [20] apply convolution                   generalize in realistic zero-shot settings when orders of
neural networks [25] over features derived from interac-                magnitude less background training data are available.
tions between token and class embeddings.
   Following the rise of transfer learning via fine-tuning
for NLP [26, 27], recent approaches to zero-shot docu-                  3. Problem Formulation
ment class classification have adopted similar techniques.
                                                                        Taxonomy classification is formulated in terms of a multi-
In [28] zero-shot document classification was formulated
                                                                        label text classification problem. Let π‘Œ be a set of classes,
as an entailment task. Pre-trained language models were
                                                                        π‘₯𝑖 ∈ 𝑋 a document, and y𝑖 ∈ {0, 1}|π‘Œ | a corresponding
either fine-tuned on a dataset containing a subset of
                                                                        binary label vector where 𝑦𝑖𝑗 = 1 if document π‘₯𝑖 is labeled
classes, or datasets for natural language inference (NLI)
                                                                        with class 𝑗 and 0 otherwise. A common probabilistic ap-
[29]. An identical entailment formulation was used in
                                                                        proach to multi-label text classification [38] is to assume
[30], which studied zero-shot transfer between datasets.
                                                                        conditional independence among labels,
Pre-trained language models were also used for zero-
shot document classification in [31], which explored the                                                                𝑦
                                                                             𝑝(y𝑖 ∣ π‘₯𝑖 ) = ∏ 𝑝(𝑦𝑖𝑗 ∣ π‘₯𝑖 ) = ∏ π‘žπ‘–π‘—π‘–π‘— (1 βˆ’ π‘žπ‘–π‘— )1βˆ’π‘¦π‘–π‘— ,   (1)
use of cloze-style templates for zero-shot and few-shot                                     𝑗                    𝑗
document classification.
   Autoregressive neural language models have been                      and approximate the parameters of the conditional
shown to possess some ZSL capabilities with proper                      Bernoulli distributions, 0 ≀ π‘žπ‘–π‘— ≀ 1, using some model. A
prompting [32]. Significantly larger models have im-                    common choice is π‘žπ‘–π‘— β‰ˆ 𝜎(π‘Ÿπ‘–π‘— ) = (1 + 𝑒 βˆ’π‘Ÿπ‘–π‘— )βˆ’1 , where
proved these results [33]. However, the computational
demands of such large models make them unsuitable for                                             π‘Ÿπ‘–π‘— = w𝑇𝑗 π‘“πœƒ (π‘₯𝑖 ),                   (2)
most practical applications.
   The benefit of fine-tuning for entailment-based ZSL                  w𝑗 ∈ ℝ𝑑 is a vector of parameters, and π‘“πœƒ ∢ 𝑋 β†’ ℝ𝑑 is a
was studied in [34]. Their experiments showed fine-                     function with parameters πœƒ, e.g., a transformer neural net-
tuning on generic NLI datasets often results in worse                   work [39]. In the remainder, the above is simply referred
ZSL performance and hypothesize this is due to models                   to as the standard multi-label model.
exploiting lexical patterns and other spurious statisti-                   Because each class 𝑗 is associated with a distinct vector
cal cues [35, 36]. Experimental results presented here                  of parameters w𝑗 in (2), the multi-label model is unable
complement those in [34], suggesting their observations                 to generalize to classes not observed during training. To
do not apply when even a small amount of task-specific                  side-step this issue, ZSL assumes the existence of textual
training data is available.                                             class descriptions 𝑧𝑗 ∈ 𝑋 for each class 𝑗 ∈ π‘Œ which can
   The closely related work of [37] also studied GZSL                   be leveraged to break the explicit dependency between
for multi-label text classification. Their focus was on                 model parameters and classes. This work considers two
understanding the role of incorporating knowledge of                    standard architectures from the literature [40], described
the hierarchical label structure into models in both the                below and depicted graphically in Figure 1, which can
few-shot and zero-shot settings. Instead, the work pre-                 incorporate class descriptions. Models are designed to
sented here specifically designs experiments to better                  be relatively simple, reflective of common best practices,
understand the ability of standard GZSL techniques to                   and as similar as possible to avoid confounding and draw
                                                                        clear inferences about general performance patterns.
   Bi-Encoder: This model replaces the vector w𝑗 with              3. Expand the Source Taxonomy by undoing the
the output of an additional parameterized function taking             modifications from Step 1 to obtain the Target
class descriptions as input,                                          Taxonomy.

                    π‘Ÿπ‘–π‘— = π‘“πœƒ1 (𝑧𝑗 )𝑇 π‘“πœƒ2 (π‘₯𝑖 ).                    4. Evaluate classifiers on a new dataset of instances
                                                                      labeled with classes from the Target Taxonomy.
   Cross-Encoder: A parameterized function that takes
as input a concatenated document and class description Details of the taxonomy, datasets, and expansion types
(denoted by βŠ”). The model has a single additional param- used in this work are given below.
eter vector w ∈ ℝ𝑑 ,
                                                              4.1. Indeed Occupations
                    π‘Ÿπ‘–π‘— = w𝑇 π‘“πœƒ (π‘₯𝑖 βŠ” 𝑧𝑗 ).
                                                                 Indeed’s internal U.S occupation taxonomy was used as a
3.1. Loss                                                        representative source of structured knowledge. The tax-
                                                                 onomy contains over a thousand occupations arranged hi-
Given a dataset 𝐷 = {(π‘₯1 , y1 ), … , (π‘₯|𝐷| , y|𝐷| )}, model pa- erarchically in a forest-like directed acyclic graph (DAG),
rameters can be optimized by minimizing negative log- with root nodes being general occupations, Healthcare
likelihood                                                       Occupations, and leaf nodes being the most specific, Nurse
                  β„’ (𝐷) = |𝐷|βˆ’1 βˆ‘ β„“(𝑖),                          Practitioners. In addition to their placement within the
                                     𝑖                           hierarchy, occupations are also associated with a natural
where                                                            language name and definition. Data formats are given
                                                                 in Table 1.
   β„“(𝑖) = βˆ’ βˆ‘(𝑦𝑖𝑗 log 𝜎(π‘Ÿπ‘–π‘— ) + (1 βˆ’ 𝑦𝑖𝑗 ) log (1 βˆ’ 𝜎(π‘Ÿπ‘–π‘— )) (3)
            𝑗
                                                              Table 1
Due to zero-shot approaches conditioning on class de- The data representations used in this work. Jobs and occupa-
scriptions, computing the sum over each class in Equation tions are converted to strings composed of multiple fields.
(3) requires |π‘Œ | forward passes through the model. This           Object         Text
results in significant computational overhead when train-
                                                                   Job            Title:  , Employer:     , Description:
ing. To alleviate this issue, the commonly used negative
                                                                   Occupation Name:         , Definition:
sampling [12] strategy is used to approximate the loss
β„“(β‹…),
                                                                    Each job posted on Indeed is labeled with one or more
                Μ‚ = βˆ’ log π‘Ÿ             𝑒 π‘Ÿπ‘–π‘—                    occupations. The number of jobs per occupation for eval-
              β„“(𝑖)                                           (4)
                            𝑒 𝑖𝑗 + βˆ‘π‘–β€² 𝑒 π‘Ÿπ‘–β€²π‘— + βˆ‘π‘— β€² 𝑒 π‘Ÿπ‘–π‘— β€²     uation data is given in Table 2. Jobs were selected using
                                                                 stratified sampling by occupation. In particular, for each
where 𝑖′ , 𝑗, 𝑗 β€² are uniformly sample such that 𝑦𝑖𝑗 = 1 and occupation 𝑁 jobs labeled with that occupation were ran-
                                                               β€²
𝑦𝑖′ 𝑗 = 𝑦𝑖𝑗 β€² = 0. The number of negative documents 𝑖 domly sampled without replacement. It should be noted
and classes 𝑗 β€² are treated as hyper-parameters. Initial that since jobs can be labeled with multiple occupations,
experiments also explored a Bernoulli rather than a cate- this sampling strategy only guarantees datasets contain
gorical version of β„“(β‹…) Μ‚ but found the categorical version at least 𝑁 jobs per occupation, not that there are exactly
performed better.                                                𝑁 jobs per occupation. The same procedure was used to
                                                                 sample disjoints subsets of jobs for training, validation,
                                                                 and testing.
4. Experiments
Experiments are designed to simulate real-world taxon- Table 2
omy expansion driven by domain experts. At a high level, Test jobs by numbers of labels. Five jobs were sampled for
all experiments follow the same process.                 each occupation for evaluation.

    1. Modify or remove classes to obtain the Source                           Labels     Jobs    Percent
       Taxonomy. Critically, this is done in a way that                             1    2,527       55%
       incorporates the underlying structure of the tax-                            2    1,567       34%
       onomy to ensure coherent modifications, rather                               3      393        9%
       than simply removing classes at random.                                      4       68        1%
                                                                                Total    4,555      100%
    2. Train classifiers using a dataset of instances la-
       beled with classes from the Source Taxonomy.
 Refine                                                        class is selected. Any appearances of this class or its
                                                               descendants are removed. This process is repeated until
                                                               a fixed percentage of classes have been removed. At the
                                                               end of the process, any document that no longer has any
                                                               labels is removed from the training dataset.
                                          Train   Test


                                                               4.3. Evaluation
 Extend
                     Train         Test
                                                               Performance is evaluated in terms of a model’s ability
                                                               to rank relevant classes for a particular documents, and
                                                               rank documents with respect to a class. In both cases,
                                                               average precision (AP) is used to measure the quality of
                                                               a predicted ordering relative to ground truth labels. The
                                                               difference is whether AP is computed for all labels and
                                                               averaged over documents, typically referred to as label
                                                               ranking average Precision (LRAP) [41], or computed for
Figure 2: Graphical representation of Refine (top) and Ex- all documents and averaged over labels, typically referred
tend (bottom) taxonomy expansion operations. Each node to as macro-AP. Formally, for matrices Y ∈ {0, 1}|𝐷|Γ—|π‘Œ | of
represents a class. Models are evaluated on all classes. White ground truth binary labels and R ∈ ℝ|𝐷|Γ—|π‘Œ | of predicted
and teal classes are observed during training. Magenta classes scores, then
are not observed during training. Teal classes replace their
children during training.                                                    LRAP = |𝐷|βˆ’1 βˆ‘ AP (Y𝑖,∢ , R𝑖,∢ )
                                                                                                𝑖
                                                                         macro-AP = |π‘Œ |βˆ’1 βˆ‘ AP (Y∢,𝑗 , R∢,𝑗 )
                                                                                               𝑗
4.2. Expansion Operations
The two taxonomy expansion operations considered are           where for vectors y ∈ {0, 1}𝑑 and r ∈ ℝ𝑑
described below and depicted graphically in Figure 2.
   Refine: This setting simulates the scenario where                               1        |{π‘˜ ∣ π‘¦π‘˜ = 1 ∧ π‘Ÿπ‘˜ β‰₯ π‘Ÿπ‘– }|
                                                                     AP(y, r) =        βˆ‘ 𝑦𝑖                           .
a subset of leaf classes are subdivided into more fine-                           βˆ‘π‘– 𝑦𝑖 𝑖         |{π‘˜ ∣ π‘Ÿπ‘˜ β‰₯ π‘Ÿπ‘– }|
grained classes. This sort of refinement can occur when
gaps in the taxonomy surface after use, or in situations       4.4. Training Details
when the set of initial classes naturally diversifies over
time. For example, an academic field of study may subdi-       Following modern practices in NLP, models consist of
vide into more specialized subfields as it matures. Zero-      a pre-trained transformer [39] backbone which is fine-
shot classifiers in this setting must generalize to classes    tuned [26, 27] along with any additional parameters.
that are more specific versions of those encountered dur-      All models use BERT-base [27] as a backbone language
ing training.                                                  model. Hyper-parameters were manually tuned on a
   To construct datasets in this setting, a random leaf        small subset of the training data using the multi-label
class is selected. Any appearances of this class or its        model and fixed for all models and experiments. The
siblings are replaced with the parent class. This process      Adam [42] optimizer was used with a learning rate of
is repeated until a fixed percentage of leaf classes have      2e-5 for pre-trained parameters and 2e-4 for randomly ini-
been replaced.                                                 tialized parameters. Learning rate warm-up was applied
   Extend: This setting simulates the scenario where a         for the first 10% of the updates and then linearly decayed
set of classes are added from an unrelated domain. This        to zero. The maximum gradient norm was clipped to 10
situation can occur when new use cases surface that            [43]. All models are trained for 20 epochs with a batch
require classes that were not previously necessary. For        size of 64. Models are evaluated after each epoch and the
example, if an e-commerce company that had historically        final model is selected based on the LRAP on the valida-
only sold goods like household items and clothing began        tion dataset. The bi-encoder and cross-encoder models
offering groceries, the previous product taxonomy would        were trained using negative sampling with 8 negative
not be useful for organizing the new items. Zero-shot          classes and 4 negative inputs per positive training doc-
classifiers in this setting must generalize to classes that    ument (Equation 4). Experiments utilized the PyTorch
are significantly different from those encountered during      [44] and Huggingface Transformers [45] libraries. All
training.                                                      hyper-parameters not listed explicitly above are left to
   To construct datasets in this setting, a random root        their default values. Experiments were conducted using
                                                               a single NVIDIA Tesla V100 GPU with 16GB of memory.
Table 3
LRAP and Macro-AP for by model, class coverage, minimum documents per class, and number of training documents in the
extend setting. Models denoted by † do not observe any task-specific training data.

          Model                    Class Coverage     Documents Per Class                Documents         LRAP        macro-AP
          Multi-Label                        100%                                 3             2733       0.569             0.496
          Multi-Label                         50%                                 5             2500       0.294             0.249
          Bi-Encoder                          50%                                 5             2500       0.362             0.349
          Cross-Encoder                       50%                                 5             2500       0.645             0.553
          Multi-Label                        100%                                 4             3614       0.638             0.564
          Multi-Label                         75%                                 5             3628       0.493             0.438
          Bi-Encoder                          75%                                 5             3628       0.480             0.447
          Cross-Encoder                       75%                                 5             3628       0.654             0.590
          Multi-Label                        100%                                 5             4555       0.697             0.635
          Bi-Encoder                         100%                                 5             4555       0.570             0.521
          Cross-Encoder                      100%                                 5             4555       0.682             0.613
          Cross-Encoder (NSP)†                   -                                 -                -      0.419             0.242
          TF-IDF†                                -                                 -                -      0.397             0.294



5. Results                                                                 0.7

5.1. Generalizing to Novel Classes                                         0.6
Performance was evaluated for different percentages of
                                                                 LRAP



observed classes during training (coverage) for both the                   0.5
refine and extend expansion operation. LRAP and macro-
AP are shown in Figure 3. The cross-encoder classifier                     0.4
was robust to both taxonomy refinement and expansion.
Minimal performance degradation was observed with                          0.3
decreasing coverage, even in settings where over 50% of                          40%      50%      60%         70%     80%     90%     100%
the classes are new and approximately 60% of the jobs                                   Percent of Jobs with Observed Occupations
are labeled with a new occupation. The bi-encoder per-                    0.65
formed significantly worse than the cross-encoder. This
observation is consistent with prior-work in the retrieval
                                                                          0.55
domain [40, 46]. However, the bi-encoder also suffered
                                                               macro-AP




more performance degradation with decreasing coverage.
                                                                          0.45
For example, the bi-encoder’s macro-AP dropped by 36%
when 50% of the classes are new (extend), whereas the                                                                         Multi-Label
                                                                          0.35                                                Bi-Encoder
macro-AP cross-encoder’s only decreased by 5%. Perfor-                                                                        Cross-Encoder
mance of the multi-label classifier degraded rapidly as                                                                       Refine
                                                                          0.25                                                Extend
coverage deceased, as it is unable to generalize to classes
not observed during training.                                                     50%        60%         70%         80%      90%      100%
                                                                                            Percent of Observed Occupations
5.2. Learning on a Budget                                   Figure 3: LRAP (top) and macro-AP (bottom) under different
                                                            taxonomy expansion operations. Models are identified by
Because the extend operation omits labels rather than color and symbol. Line styles reflect the expansion operation,
relabeling them, zero-shot models had access to less train- with dashed lines for refinement and solid lines for extension.
ing data in the previous experiments. To better under-
stand the trade-off between fine-tuning and ZSL, experi-
ments were conducted which controlled for the amount
                                                              in Table 3. The ZSL cross-encoder with 50% coverage
of data available for training. In particular, multi-label
                                                              and five documents per class resulted in a 13% relative
classifiers were trained on datasets where the number of
                                                              increase in LRAP over the multi-label classifier with 100%
documents was similar to ZSL approaches, but fewer doc-
                                                              coverage and three documents per class (similar training
uments per class are observed. Full results are presented
Table 4
Zero-shot Macro-AP for novel domains in the challenging extend scenario with 50% class coverage. † Because the multi-label
classifier is not capable of zero-shot generalization, it is trained with 100% class coverage, but fewer documents per class.

          Domain                                         Classes       Bi-Encoder   Cross-Encoder    Multi-Label†
          Personal Service                                    28            0.273           0.642            0.590
          Food & Beverage                                     25            0.245           0.619            0.555
          Cleaning & Grounds Maintenance                      25            0.277           0.584            0.563
          Marketing, Advertising & Public Relations           28            0.241           0.579            0.541
          Repair, Maintenance & Installation                  34            0.276           0.533            0.494
          Healthcare                                         156            0.250           0.532            0.584
          Protective & Security                               27            0.302           0.529            0.509
          Construction & Extraction                           54            0.265           0.527            0.465
          Architecture & Engineering                          36            0.207           0.474            0.399
          Sales, Retail & Customer Support                    31            0.244           0.472            0.453
          Supply Chain & Logistics                            32            0.243           0.457            0.435
          New Classes                                        478            0.251           0.534            0.523
          Old Classes                                        433            0.457           0.575            0.465
          All Classes                                        911            0.349           0.553            0.496
          Training Documents                                                2500             2500            2733
          Documents Per Class                                                  5                5               3
          Class Coverage                                                     50%              50%            100%



set size). This result was unexpected, as it suggests that     0.70
given a small document labeling budget (<4K here), in
                                                               0.65
some settings it would be preferable to adopt ZSL and
spend resources annotating more documents with an in-          0.60
complete set of classes, rather than spreading the labeling
                                                                LRAP




budget uniformly over all classes and using traditional        0.55
classifiers.
   Further analysis of zero-shot performance is given          0.50                               Cross-Encoder
                                                                                                  Bi-Encoder
in Table 4, which presents macro-AP by root class for                                             Bi-Encoder + Cross-Encoder
unobserved classes in the extend setting with 50% cover-       0.45
                                                                     24 8       16          32                            64
age. Despite not being previously exposed to any classes                             Number of Candidates
from these domains, in all cased the cross-encoder out-
performed the multi-label classifier explicitly trained on Figure 4: LRAP for two-phase zero-shot classification for
these classes.                                              candidates set sizes from 2 to 64. Dashed lines depict the
                                                                performance of standalone models.

5.3. Efficient Zero-Shot Inference
As noted previously, there is a significant computational       used to identify a small subset of potentially relevant
cost associated with training the transformer-based zero-       candidate classes. This smaller set of candidates was then
shot learners due to the need to process each label for         evaluated with the more computationally demanding, but
each document. While this cost can be amortized for             higher performance cross-encoder. Classes not selected
the bi-encoder at inference time by pre-computing label         in the first phase were implicitly assumed to receive a
embeddings, this is not possible for the cross-encoder ar-      score of zero. Results are shown in Figure 4 for candidate
chitecture. Several works explore the architecture space        set sizes from 2 to 64. Scoring only 16 candidates resulted
between bi-encoders and cross-encoders to obtain a bet-         in a small drop in LRAP (-2%) while resulting in a nearly
ter trade-off between performance and latency [40, 46]. A       98% reduction in computational overhead.
simpler technique was explored in this work inspired by
the common decomposition of recommender systems
into separate candidate retrieval and re-ranking [47]           6. Conclusion and Future Work
phases.
   In the first phase, the more efficient bi-encoder was       Taxonomies are widely used to organize knowledge and
                                                               can easily incorporate important information from do-
main experts that may be difficult to obtain in a purely     [6] R. Ghani, K. Probst, Y. Liu, M. Krema, A. Fano,
automated fashion. However, the ability to associate             Text mining for product attribute extraction, ACM
classes with real-world classes can be a bottleneck for the      SIGKDD Explorations Newsletter (2006).
rapid expansion of taxonomies. Experiments presented         [7] H. Larochelle, D. Erhan, Y. Bengio, Zero-data learn-
here demonstrate that modern zero-shot classification            ing of new tasks., AAAI Conference on Artificial
techniques can sidestep this issue by classifying objects        Intelligence (2008).
with novel classes using only minimal human guidance.        [8] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Srikumar,
   Better understanding and overcoming the failure               Importance of semantic representation: Dataless
modes of the bi-encoder architecture would result in             classification., AAAI Conference on Artificial Intel-
more efficient systems capable of scaling larger tax-            ligence (2008).
onomies, either as stand-alone systems or as part of a       [9] Y. Xian, B. Schiele, Z. Akata, Zero-shot learning-the
multi-phase such as that described in Section 5.3. Related       good, the bad and the ugly, IEEE Conference on
work in the retrieval setting suggests adopting pretext          Computer Vision and Pattern Recognition (2017).
[48] tasks that are better aligned with the downstream      [10] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z. Pan,
task of interest could alleviate these issues [49]. Alter-       H. Chen, Knowledge-aware zero-shot learning: Sur-
natively, more elaborate negative sampling strategies            vey and perspective, Joint Conference on Artificial
[50, 51] could improve both zero-shot techniques stud-           Intelligence (2021).
ied in this work, and close any observed gaps between       [11] R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero-
zero-shot learners and traditional classifiers. Future work      shot learning through cross-modal transfer, Ad-
should explore zero-shot capabilities in more sophisti-          vances in Neural Information Processing Systems
cated knowledge bases (ontologies, knowledge graphs,             (2013).
etc), a larger variety of class types, and different domains.
                                                            [12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
Lastly, further experimentation is needed to fully explain       J. Dean, Distributed representations of words and
observed differences between the results presented here          phrases and their compositionality, Advances in
and those in [34] in order to better understand the success      Neural Information Processing Systems (2013).
and failure modes of entailment-based ZSL.                  [13] B. Romera-Paredes, P. Torr, An embarrassingly sim-
                                                                 ple approach to zero-shot learning, International
                                                                 Conference on Machine Learning (2015).
Acknowledgments                                             [14] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein,
                                                                 B. Schiele, Latent embeddings for zero-shot classi-
Valuable insights, suggestions, and feedback was pro-
                                                                 fication, IEEE Conference on Computer Vision and
vided by numerous individuals at Indeed. The author
                                                                 Pattern Recognition (2016).
would especially like to thank Suyi Tu, Josh Levy, Ethan
                                                            [15] R. Qiao, L. Liu, C. Shen, A. Van Den Hengel, Less is
Handel, Arvi Sreenivasan, and Donal McMahon.
                                                                 more: zero-shot learning from online textual docu-
                                                                 ments with noise suppression, IEEE Conference on
References                                                       Computer Vision and Pattern Recognition (2016).
                                                            [16] L. Zhang, T. Xiang, S. Gong, Learning a deep em-
 [1] L. Chiticariu, Y. Li, F. Reiss, Rule-based information      bedding model for zero-shot learning, IEEE Confer-
     extraction is dead! long live rule-based information        ence on Computer Vision and Pattern Recognition
     extraction systems!, Empirical Methods in Natural           (2017).
     Language Processing (2013).                            [17] W.-L. Chao, S. Changpinyo, B. Gong, F. Sha, An
 [2] M. Kejriwal, R. Shao, P. Szekely, Expert-guided             empirical study and analysis of generalized zero-
     entity extraction using expressive rules, ACM SI-           shot learning for object recognition in the wild,
     GIR Conference on Research and Development in               European Conference on Computer Vision (2016).
     Information Retrieval (2019).                          [18] S. Liu, M. Long, J. Wang, M. I. Jordan, Generalized
 [3] S. Cucerzan, Large-scale named entity disambigua-           zero-shot learning with deep calibration network,
     tion based on wikipedia data, Empirical Methods             Advances in Neural Information Processing Sys-
     in Natural Language Processing (2007).                      tems (2018).
 [4] I. Karadeniz, A. Γ–zgΓΌr, Linking entities through [19] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang,
     an ontology using word embeddings and syntactic             C. P. Lim, X.-Z. Wang, A review of general-
     re-ranking, BMC Bioinformatics (2019).                      ized zero-shot learning methods, arXiv preprint
 [5] T. Lee, Z. Wang, H. Wang, S.-w. Hwang, Attribute            arXiv:2011.08641 (2020).
     extraction and scoring: A probabilistic approach, [20] C. Li, W. Zhou, F. Ji, Y. Duan, H. Chen, A deep
     IEEE International Conference on Data Engineering           relevance model for zero-shot document filtering,
     (2013).                                                     Association for Computational Linguistics (2018).
[21] S. Robertson, H. Zaragoza, et al., The probabilistic    [36] T. Niven, H.-Y. Kao, Probing neural network com-
     relevance framework: Bm25 and beyond, Founda-                prehension of natural language arguments, Associ-
     tions and Trends in Information Retrieval 3 (2009).          ation for Computational Linguistics (2019).
[22] X. Wei, W. B. Croft, Lda-based document models          [37] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malakasi-
     for ad-hoc retrieval, ACM SIGIR Conference on                otis, N. Aletras, I. Androutsopoulos, An empirical
     Research and Development in Information Retrieval            study on large-scale multi-label text classification
     (2006).                                                      including few and zero-shot labels, Empirical Meth-
[23] P. K. Pushp, M. M. Srivastava, Train once, test              ods in Natural Language Processing (2020).
     anywhere: Zero-shot learning for text classification,   [38] K. P. Murphy, Machine Learning: A Probabilistic
     arXiv preprint arXiv:1712.05972 (2017).                      Perspective, 2012.
[24] S. Hochreiter, J. Schmidhuber, Long short-term          [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     memory, Neural computation (1997).                           L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
[25] Y. Kim, Convolutional neural networks for sen-               tention is all you need, Advances in Neural Infor-
     tence classification, Empirical Methods in Natural           mation Processing Systems (2017).
     Language Processing (2014).                             [40] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston,
[26] J. Howard, S. Ruder, Universal language model                Poly-encoders: Architectures and pre-training
     fine-tuning for text classification, Association for         strategies for fast and accurate multi-sentence scor-
     Computational Linguistics (2018).                            ing, International Conference on Learning Repre-
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:          sentations (2019).
     Pre-training of deep bidirectional transformers for     [41] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-
     language understanding, North American Chapter               label data, Data Mining and Knowledge Discovery
     of the Association for Computational Linguistics             Handbook (2009).
     (2019).                                                 [42] D. P. Kingma, J. Ba, Adam: A method for stochastic
[28] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot              optimization, International Conference on Learn-
     text classification: Datasets, evaluation and entail-        ing Representations (2015).
     ment approach, Empirical Methods in Natural Lan-        [43] J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient
     guage Processing (2019).                                     clipping accelerates training: A theoretical justifi-
[29] A. Williams, N. Nangia, S. Bowman, A broad-                  cation for adaptivity, International Conference on
     coverage challenge corpus for sentence understand-           Learning Representations (2019).
     ing through inference, North American Chapter           [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
     of the Association for Computational Linguistics             bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
     (2018).                                                      L. Antiga, et al., Pytorch: An imperative style, high-
[30] K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task-           performance deep learning library, Advances in
     aware representation of sentences for generic text           Neural Information Processing Systems (2019).
     classification, International Conference on Compu-      [45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     tational Linguistics (2020).                                 langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
[31] T. Schick, H. SchΓΌtze, Exploiting cloze-questions for        towicz, et al., Transformers: State-of-the-art nat-
     few-shot text classification and natural language            ural language processing, Empirical Methods in
     inference, Conference of the European Chapter                Natural Language Processing: System Demonstra-
     of the Association for Computational Linguistics             tions (2020).
     (2021).                                                 [46] O. Khattab, M. Zaharia, Colbert: Efficient and effec-
[32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,             tive passage search via contextualized late interac-
     I. Sutskever, et al., Language models are unsuper-           tion over bert, ACM SIGIR Conference on Research
     vised multitask learners, OpenAI blog (2019).                and Development in Information Retrieval (2020).
[33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-      [47] P. Covington, J. Adams, E. Sargin, Deep neural net-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-         works for youtube recommendations, ACM Con-
     try, A. Askell, et al., Language models are few-shot         ference on Recommender Systems (2016).
     learners, Advances in Neural Information Process-       [48] L. Jing, Y. Tian, Self-supervised visual feature learn-
     ing Systems (2020).                                          ing with deep neural networks: A survey, IEEE
[34] T. Ma, J.-G. Yao, C.-Y. Lin, T. Zhao, Issues with            Transactions on Pattern Analysis and Machine In-
     entailment-based zero-shot text classification, As-          telligence 43 (2020).
     sociation for Computational Linguistics (2021).         [49] W.-C. Chang, X. Y. Felix, Y.-W. Chang, Y. Yang,
[35] S. Feng, E. Wallace, J. Boyd-Graber, Misleading              S. Kumar, Pre-training tasks for embedding-based
     failures of partial-input baselines, Association for         large-scale retrieval, International Conference on
     Computational Linguistics (2019).                            Learning Representations (2019).
[50] J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling
     up to large vocabulary image annotation, Joint
     Conference on Artificial Intelligence (2011).
[51] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Op-
     timizing dense retrieval model training with hard
     negatives, ACM SIGIR Conference on Research and
     Development in Information Retrieval (2021).