=Paper= {{Paper |id=Vol-3218/paper8 |storemode=property |title=Flexible Job Classification with Zero-Shot Learning |pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_8.pdf |volume=Vol-3218 |authors=Thom Lake |dblpUrl=https://dblp.org/rec/conf/hr-recsys/Lake22 }} ==Flexible Job Classification with Zero-Shot Learning== https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_8.pdf

Flexible Job Classification with Zero-Shot Learning
Thom Lake
Indeed

Abstract
Using a taxonomy to organize information requires classifying objects (documents, images, etc) with appropriate taxonomic
classes. The flexible nature of zero-shot learning is appealing for this task because it allows classifiers to naturally adapt to
taxonomy modifications. This work studies zero-shot multi-label document classification with fine-tuned language models
under realistic taxonomy expansion scenarios in the human resource domain. Experiments show that zero-shot learning can be
highly effective in this setting. When controlling for training data budget, zero-shot classifiers achieve a 12% relative increase
in macro-AP when compared to a traditional multi-label classifier trained on all classes. Counterintuitively, these results
suggest in some settings it would be preferable to adopt zero-shot techniques and spend resources annotating more documents
with an incomplete set of classes, rather than spreading the labeling budget uniformly over all classes and using traditional
classification techniques. Additional experiments demonstrate that adopting the well-known filter/re-rank decomposition
from the recommender systems literature can significantly reduce the computational burden of high-performance zero-shot
classifiers, empirically resulting in a 98% reduction in computational overhead for only a 2% relative decrease in performance.
The evidence presented here demonstrates that zero-shot learning has the potential to significantly increase the flexibility of
taxonomies and highlights directions for future research.

Keywords
Taxonomy, zero-shot learning, multi-label classification, natural language processing

1. Introduction performance of ZSL techniques for document classifica-
tion in the HR domain. Experiments designed to simulate
Taxonomies used to organize information must fre- realistic taxonomy expansion scenarios show that ZSL
quently be adapted to reflect external changes such as the is highly effective, outperforming standard supervised
introduction of new markets, the creation of specialized classifiers in low-resource settings. Further experiments
segments, or the addition of new features. This is espe- demonstrate that adopting well-known techniques can
cially true in the human resource (HR) domain, where significantly reduce the computational overhead of high-
new job, skill, and license categories must be created to performance zero-shot classifiers.
accommodate a constantly evolving marketplace. Un-
fortunately, the techniques commonly used to label real-
world objects (documents, images, etc) with taxonomy 2. Related Work
classes are tightly coupled to the specific set of classes
available when the classification system is developed. In There is a large body of previous work on ZSL [7, 8, 9, 10].
order to add new classes, rule-based systems [1, 2] re- Early work in the computer vision domain [11] repre-
quire the creation of new rules, and supervised machine sented classes with pre-trained word embeddings [12]
learning techniques [3, 4, 5, 6] require labeling data with and trained models to align them with image embeddings
the new classes and training a new model. These re- in a shared vector space. Much of the subsequent work
quirements make operationalizing modifications of the in ZSL has followed a similar embedding-based approach
underlying taxonomy cumbersome. [13, 14, 15, 16].
Unlike traditional supervised classification techniques, A common assumption in ZSL is that the set train and
zero-shot learning (ZSL) techniques are able to gener- test classes are disjoint. Noting that this is somewhat
alize to new classes with minimal guidance [7, 8]. Ap- unrealistic, [17] proposed generalized zero-shot learn-
plying ZSL to taxonomic classification has the potential ing (GZSL), which assumes training classes are a strict
to increase the flexibility of organizational data struc- subset of test classes [18, 19]. As this work is primarily
tures while retaining the performance benefits of ma- concerned with classifiers that can adapt to a changing
chine learning techniques. taxonomy, experiments are conducted within the GZSL
Within this context, this work empirically studies the framework.
While there has been less explicit research on ZSL for
RecSys in HR’22: The 2nd Workshop on Recommender Systems for NLP, as noted by [20], most techniques for ad-hoc doc-
Human Resources, in conjunction with the 16th ACM Conference on ument retrieval [21, 22] can be leveraged for zero-shot
Recommender Systems, September 18–23, 2022, Seattle, USA
document classification by treating the labels as queries.
Envelope-Open tlake@indeed.com (T. Lake)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License In [23], a standard classifier was applied to a combined
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
representation of a document and label, produced with
Traditional Zero-Shot
Class Probabilities Match Probability Match Probability

Multi-Label Bi-Encoder Cross-Encoder

Language Model Language Model Language Model Language Model

Input Tokens Input Tokens Class Tokens Input + Class Tokens

Figure 1: Graphical representation of models used in experiments. Traditional multi-label classifiers (left) output a probability
for each class. Zero-shot classifiers (right) model compatibility between an input and class description.

word embeddings or LSTMs [24]. [20] apply convolution generalize in realistic zero-shot settings when orders of
neural networks [25] over features derived from interac- magnitude less background training data are available.
tions between token and class embeddings.
Following the rise of transfer learning via fine-tuning
for NLP [26, 27], recent approaches to zero-shot docu- 3. Problem Formulation
ment class classification have adopted similar techniques.
Taxonomy classification is formulated in terms of a multi-
In [28] zero-shot document classification was formulated
label text classification problem. Let 𝑌 be a set of classes,
as an entailment task. Pre-trained language models were
𝑥𝑖 ∈ 𝑋 a document, and y𝑖 ∈ {0, 1}|𝑌 | a corresponding
either fine-tuned on a dataset containing a subset of
binary label vector where 𝑦𝑖𝑗 = 1 if document 𝑥𝑖 is labeled
classes, or datasets for natural language inference (NLI)
with class 𝑗 and 0 otherwise. A common probabilistic ap-
[29]. An identical entailment formulation was used in
proach to multi-label text classification [38] is to assume
[30], which studied zero-shot transfer between datasets.
conditional independence among labels,
Pre-trained language models were also used for zero-
shot document classification in [31], which explored the 𝑦
𝑝(y𝑖 ∣ 𝑥𝑖 ) = ∏ 𝑝(𝑦𝑖𝑗 ∣ 𝑥𝑖 ) = ∏ 𝑞𝑖𝑗𝑖𝑗 (1 − 𝑞𝑖𝑗 )1−𝑦𝑖𝑗 , (1)
use of cloze-style templates for zero-shot and few-shot 𝑗 𝑗
document classification.
Autoregressive neural language models have been and approximate the parameters of the conditional
shown to possess some ZSL capabilities with proper Bernoulli distributions, 0 ≤ 𝑞𝑖𝑗 ≤ 1, using some model. A
prompting [32]. Significantly larger models have im- common choice is 𝑞𝑖𝑗 ≈ 𝜎(𝑟𝑖𝑗 ) = (1 + 𝑒 −𝑟𝑖𝑗 )−1 , where
proved these results [33]. However, the computational
demands of such large models make them unsuitable for 𝑟𝑖𝑗 = w𝑇𝑗 𝑓𝜃 (𝑥𝑖 ), (2)
most practical applications.
The benefit of fine-tuning for entailment-based ZSL w𝑗 ∈ ℝ𝑑 is a vector of parameters, and 𝑓𝜃 ∶ 𝑋 → ℝ𝑑 is a
was studied in [34]. Their experiments showed fine- function with parameters 𝜃, e.g., a transformer neural net-
tuning on generic NLI datasets often results in worse work [39]. In the remainder, the above is simply referred
ZSL performance and hypothesize this is due to models to as the standard multi-label model.
exploiting lexical patterns and other spurious statisti- Because each class 𝑗 is associated with a distinct vector
cal cues [35, 36]. Experimental results presented here of parameters w𝑗 in (2), the multi-label model is unable
complement those in [34], suggesting their observations to generalize to classes not observed during training. To
do not apply when even a small amount of task-specific side-step this issue, ZSL assumes the existence of textual
training data is available. class descriptions 𝑧𝑗 ∈ 𝑋 for each class 𝑗 ∈ 𝑌 which can
The closely related work of [37] also studied GZSL be leveraged to break the explicit dependency between
for multi-label text classification. Their focus was on model parameters and classes. This work considers two
understanding the role of incorporating knowledge of standard architectures from the literature [40], described
the hierarchical label structure into models in both the below and depicted graphically in Figure 1, which can
few-shot and zero-shot settings. Instead, the work pre- incorporate class descriptions. Models are designed to
sented here specifically designs experiments to better be relatively simple, reflective of common best practices,
understand the ability of standard GZSL techniques to and as similar as possible to avoid confounding and draw
clear inferences about general performance patterns.
Bi-Encoder: This model replaces the vector w𝑗 with 3. Expand the Source Taxonomy by undoing the
the output of an additional parameterized function taking modifications from Step 1 to obtain the Target
class descriptions as input, Taxonomy.

𝑟𝑖𝑗 = 𝑓𝜃1 (𝑧𝑗 )𝑇 𝑓𝜃2 (𝑥𝑖 ). 4. Evaluate classifiers on a new dataset of instances
labeled with classes from the Target Taxonomy.
Cross-Encoder: A parameterized function that takes
as input a concatenated document and class description Details of the taxonomy, datasets, and expansion types
(denoted by ⊔). The model has a single additional param- used in this work are given below.
eter vector w ∈ ℝ𝑑 ,
4.1. Indeed Occupations
𝑟𝑖𝑗 = w𝑇 𝑓𝜃 (𝑥𝑖 ⊔ 𝑧𝑗 ).
Indeed’s internal U.S occupation taxonomy was used as a
3.1. Loss representative source of structured knowledge. The tax-
onomy contains over a thousand occupations arranged hi-
Given a dataset 𝐷 = {(𝑥1 , y1 ), … , (𝑥|𝐷| , y|𝐷| )}, model pa- erarchically in a forest-like directed acyclic graph (DAG),
rameters can be optimized by minimizing negative log- with root nodes being general occupations, Healthcare
likelihood Occupations, and leaf nodes being the most specific, Nurse
ℒ (𝐷) = |𝐷|−1 ∑ ℓ(𝑖), Practitioners. In addition to their placement within the
𝑖 hierarchy, occupations are also associated with a natural
where language name and definition. Data formats are given
in Table 1.
ℓ(𝑖) = − ∑(𝑦𝑖𝑗 log 𝜎(𝑟𝑖𝑗 ) + (1 − 𝑦𝑖𝑗 ) log (1 − 𝜎(𝑟𝑖𝑗 )) (3)
𝑗
Table 1
Due to zero-shot approaches conditioning on class de- The data representations used in this work. Jobs and occupa-
scriptions, computing the sum over each class in Equation tions are converted to strings composed of multiple fields.
(3) requires |𝑌 | forward passes through the model. This Object Text
results in significant computational overhead when train-
Job Title: , Employer: , Description:
ing. To alleviate this issue, the commonly used negative
Occupation Name: , Definition:
sampling [12] strategy is used to approximate the loss
ℓ(⋅),
Each job posted on Indeed is labeled with one or more
̂ = − log 𝑟 𝑒 𝑟𝑖𝑗 occupations. The number of jobs per occupation for eval-
ℓ(𝑖) (4)
𝑒 𝑖𝑗 + ∑𝑖′ 𝑒 𝑟𝑖′𝑗 + ∑𝑗 ′ 𝑒 𝑟𝑖𝑗 ′ uation data is given in Table 2. Jobs were selected using
stratified sampling by occupation. In particular, for each
where 𝑖′ , 𝑗, 𝑗 ′ are uniformly sample such that 𝑦𝑖𝑗 = 1 and occupation 𝑁 jobs labeled with that occupation were ran-
′
𝑦𝑖′ 𝑗 = 𝑦𝑖𝑗 ′ = 0. The number of negative documents 𝑖 domly sampled without replacement. It should be noted
and classes 𝑗 ′ are treated as hyper-parameters. Initial that since jobs can be labeled with multiple occupations,
experiments also explored a Bernoulli rather than a cate- this sampling strategy only guarantees datasets contain
gorical version of ℓ(⋅) ̂ but found the categorical version at least 𝑁 jobs per occupation, not that there are exactly
performed better. 𝑁 jobs per occupation. The same procedure was used to
sample disjoints subsets of jobs for training, validation,
and testing.
4. Experiments
Experiments are designed to simulate real-world taxon- Table 2
omy expansion driven by domain experts. At a high level, Test jobs by numbers of labels. Five jobs were sampled for
all experiments follow the same process. each occupation for evaluation.

1. Modify or remove classes to obtain the Source Labels Jobs Percent
Taxonomy. Critically, this is done in a way that 1 2,527 55%
incorporates the underlying structure of the tax- 2 1,567 34%
onomy to ensure coherent modifications, rather 3 393 9%
than simply removing classes at random. 4 68 1%
Total 4,555 100%
2. Train classifiers using a dataset of instances la-
beled with classes from the Source Taxonomy.
Refine class is selected. Any appearances of this class or its
descendants are removed. This process is repeated until
a fixed percentage of classes have been removed. At the
end of the process, any document that no longer has any
labels is removed from the training dataset.
Train Test

4.3. Evaluation
Extend
Train Test
Performance is evaluated in terms of a model’s ability
to rank relevant classes for a particular documents, and
rank documents with respect to a class. In both cases,
average precision (AP) is used to measure the quality of
a predicted ordering relative to ground truth labels. The
difference is whether AP is computed for all labels and
averaged over documents, typically referred to as label
ranking average Precision (LRAP) [41], or computed for
Figure 2: Graphical representation of Refine (top) and Ex- all documents and averaged over labels, typically referred
tend (bottom) taxonomy expansion operations. Each node to as macro-AP. Formally, for matrices Y ∈ {0, 1}|𝐷|×|𝑌 | of
represents a class. Models are evaluated on all classes. White ground truth binary labels and R ∈ ℝ|𝐷|×|𝑌 | of predicted
and teal classes are observed during training. Magenta classes scores, then
are not observed during training. Teal classes replace their
children during training. LRAP = |𝐷|−1 ∑ AP (Y𝑖,∶ , R𝑖,∶ )
𝑖
macro-AP = |𝑌 |−1 ∑ AP (Y∶,𝑗 , R∶,𝑗 )
𝑗
4.2. Expansion Operations
The two taxonomy expansion operations considered are where for vectors y ∈ {0, 1}𝑑 and r ∈ ℝ𝑑
described below and depicted graphically in Figure 2.
Refine: This setting simulates the scenario where 1 |{𝑘 ∣ 𝑦𝑘 = 1 ∧ 𝑟𝑘 ≥ 𝑟𝑖 }|
AP(y, r) = ∑ 𝑦𝑖 .
a subset of leaf classes are subdivided into more fine- ∑𝑖 𝑦𝑖 𝑖 |{𝑘 ∣ 𝑟𝑘 ≥ 𝑟𝑖 }|
grained classes. This sort of refinement can occur when
gaps in the taxonomy surface after use, or in situations 4.4. Training Details
when the set of initial classes naturally diversifies over
time. For example, an academic field of study may subdi- Following modern practices in NLP, models consist of
vide into more specialized subfields as it matures. Zero- a pre-trained transformer [39] backbone which is fine-
shot classifiers in this setting must generalize to classes tuned [26, 27] along with any additional parameters.
that are more specific versions of those encountered dur- All models use BERT-base [27] as a backbone language
ing training. model. Hyper-parameters were manually tuned on a
To construct datasets in this setting, a random leaf small subset of the training data using the multi-label
class is selected. Any appearances of this class or its model and fixed for all models and experiments. The
siblings are replaced with the parent class. This process Adam [42] optimizer was used with a learning rate of
is repeated until a fixed percentage of leaf classes have 2e-5 for pre-trained parameters and 2e-4 for randomly ini-
been replaced. tialized parameters. Learning rate warm-up was applied
Extend: This setting simulates the scenario where a for the first 10% of the updates and then linearly decayed
set of classes are added from an unrelated domain. This to zero. The maximum gradient norm was clipped to 10
situation can occur when new use cases surface that [43]. All models are trained for 20 epochs with a batch
require classes that were not previously necessary. For size of 64. Models are evaluated after each epoch and the
example, if an e-commerce company that had historically final model is selected based on the LRAP on the valida-
only sold goods like household items and clothing began tion dataset. The bi-encoder and cross-encoder models
offering groceries, the previous product taxonomy would were trained using negative sampling with 8 negative
not be useful for organizing the new items. Zero-shot classes and 4 negative inputs per positive training doc-
classifiers in this setting must generalize to classes that ument (Equation 4). Experiments utilized the PyTorch
are significantly different from those encountered during [44] and Huggingface Transformers [45] libraries. All
training. hyper-parameters not listed explicitly above are left to
To construct datasets in this setting, a random root their default values. Experiments were conducted using
a single NVIDIA Tesla V100 GPU with 16GB of memory.
Table 3
LRAP and Macro-AP for by model, class coverage, minimum documents per class, and number of training documents in the
extend setting. Models denoted by † do not observe any task-specific training data.

Model Class Coverage Documents Per Class Documents LRAP macro-AP
Multi-Label 100% 3 2733 0.569 0.496
Multi-Label 50% 5 2500 0.294 0.249
Bi-Encoder 50% 5 2500 0.362 0.349
Cross-Encoder 50% 5 2500 0.645 0.553
Multi-Label 100% 4 3614 0.638 0.564
Multi-Label 75% 5 3628 0.493 0.438
Bi-Encoder 75% 5 3628 0.480 0.447
Cross-Encoder 75% 5 3628 0.654 0.590
Multi-Label 100% 5 4555 0.697 0.635
Bi-Encoder 100% 5 4555 0.570 0.521
Cross-Encoder 100% 5 4555 0.682 0.613
Cross-Encoder (NSP)† - - - 0.419 0.242
TF-IDF† - - - 0.397 0.294

5. Results 0.7

5.1. Generalizing to Novel Classes 0.6
Performance was evaluated for different percentages of
LRAP

observed classes during training (coverage) for both the 0.5
refine and extend expansion operation. LRAP and macro-
AP are shown in Figure 3. The cross-encoder classifier 0.4
was robust to both taxonomy refinement and expansion.
Minimal performance degradation was observed with 0.3
decreasing coverage, even in settings where over 50% of 40% 50% 60% 70% 80% 90% 100%
the classes are new and approximately 60% of the jobs Percent of Jobs with Observed Occupations
are labeled with a new occupation. The bi-encoder per- 0.65
formed significantly worse than the cross-encoder. This
observation is consistent with prior-work in the retrieval
0.55
domain [40, 46]. However, the bi-encoder also suffered
macro-AP

more performance degradation with decreasing coverage.
0.45
For example, the bi-encoder’s macro-AP dropped by 36%
when 50% of the classes are new (extend), whereas the Multi-Label
0.35 Bi-Encoder
macro-AP cross-encoder’s only decreased by 5%. Perfor- Cross-Encoder
mance of the multi-label classifier degraded rapidly as Refine
0.25 Extend
coverage deceased, as it is unable to generalize to classes
not observed during training. 50% 60% 70% 80% 90% 100%
Percent of Observed Occupations
5.2. Learning on a Budget Figure 3: LRAP (top) and macro-AP (bottom) under different
taxonomy expansion operations. Models are identified by
Because the extend operation omits labels rather than color and symbol. Line styles reflect the expansion operation,
relabeling them, zero-shot models had access to less train- with dashed lines for refinement and solid lines for extension.
ing data in the previous experiments. To better under-
stand the trade-off between fine-tuning and ZSL, experi-
ments were conducted which controlled for the amount
in Table 3. The ZSL cross-encoder with 50% coverage
of data available for training. In particular, multi-label
and five documents per class resulted in a 13% relative
classifiers were trained on datasets where the number of
increase in LRAP over the multi-label classifier with 100%
documents was similar to ZSL approaches, but fewer doc-
coverage and three documents per class (similar training
uments per class are observed. Full results are presented
Table 4
Zero-shot Macro-AP for novel domains in the challenging extend scenario with 50% class coverage. † Because the multi-label
classifier is not capable of zero-shot generalization, it is trained with 100% class coverage, but fewer documents per class.

Domain Classes Bi-Encoder Cross-Encoder Multi-Label†
Personal Service 28 0.273 0.642 0.590
Food & Beverage 25 0.245 0.619 0.555
Cleaning & Grounds Maintenance 25 0.277 0.584 0.563
Marketing, Advertising & Public Relations 28 0.241 0.579 0.541
Repair, Maintenance & Installation 34 0.276 0.533 0.494
Healthcare 156 0.250 0.532 0.584
Protective & Security 27 0.302 0.529 0.509
Construction & Extraction 54 0.265 0.527 0.465
Architecture & Engineering 36 0.207 0.474 0.399
Sales, Retail & Customer Support 31 0.244 0.472 0.453
Supply Chain & Logistics 32 0.243 0.457 0.435
New Classes 478 0.251 0.534 0.523
Old Classes 433 0.457 0.575 0.465
All Classes 911 0.349 0.553 0.496
Training Documents 2500 2500 2733
Documents Per Class 5 5 3
Class Coverage 50% 50% 100%

set size). This result was unexpected, as it suggests that 0.70
given a small document labeling budget (<4K here), in
0.65
some settings it would be preferable to adopt ZSL and
spend resources annotating more documents with an in- 0.60
complete set of classes, rather than spreading the labeling
LRAP

budget uniformly over all classes and using traditional 0.55
classifiers.
Further analysis of zero-shot performance is given 0.50 Cross-Encoder
Bi-Encoder
in Table 4, which presents macro-AP by root class for Bi-Encoder + Cross-Encoder
unobserved classes in the extend setting with 50% cover- 0.45
24 8 16 32 64
age. Despite not being previously exposed to any classes Number of Candidates
from these domains, in all cased the cross-encoder out-
performed the multi-label classifier explicitly trained on Figure 4: LRAP for two-phase zero-shot classification for
these classes. candidates set sizes from 2 to 64. Dashed lines depict the
performance of standalone models.

5.3. Efficient Zero-Shot Inference
As noted previously, there is a significant computational used to identify a small subset of potentially relevant
cost associated with training the transformer-based zero- candidate classes. This smaller set of candidates was then
shot learners due to the need to process each label for evaluated with the more computationally demanding, but
each document. While this cost can be amortized for higher performance cross-encoder. Classes not selected
the bi-encoder at inference time by pre-computing label in the first phase were implicitly assumed to receive a
embeddings, this is not possible for the cross-encoder ar- score of zero. Results are shown in Figure 4 for candidate
chitecture. Several works explore the architecture space set sizes from 2 to 64. Scoring only 16 candidates resulted
between bi-encoders and cross-encoders to obtain a bet- in a small drop in LRAP (-2%) while resulting in a nearly
ter trade-off between performance and latency [40, 46]. A 98% reduction in computational overhead.
simpler technique was explored in this work inspired by
the common decomposition of recommender systems
into separate candidate retrieval and re-ranking [47] 6. Conclusion and Future Work
phases.
In the first phase, the more efficient bi-encoder was Taxonomies are widely used to organize knowledge and
can easily incorporate important information from do-
main experts that may be difficult to obtain in a purely [6] R. Ghani, K. Probst, Y. Liu, M. Krema, A. Fano,
automated fashion. However, the ability to associate Text mining for product attribute extraction, ACM
classes with real-world classes can be a bottleneck for the SIGKDD Explorations Newsletter (2006).
rapid expansion of taxonomies. Experiments presented [7] H. Larochelle, D. Erhan, Y. Bengio, Zero-data learn-
here demonstrate that modern zero-shot classification ing of new tasks., AAAI Conference on Artificial
techniques can sidestep this issue by classifying objects Intelligence (2008).
with novel classes using only minimal human guidance. [8] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Srikumar,
Better understanding and overcoming the failure Importance of semantic representation: Dataless
modes of the bi-encoder architecture would result in classification., AAAI Conference on Artificial Intel-
more efficient systems capable of scaling larger tax- ligence (2008).
onomies, either as stand-alone systems or as part of a [9] Y. Xian, B. Schiele, Z. Akata, Zero-shot learning-the
multi-phase such as that described in Section 5.3. Related good, the bad and the ugly, IEEE Conference on
work in the retrieval setting suggests adopting pretext Computer Vision and Pattern Recognition (2017).
[48] tasks that are better aligned with the downstream [10] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z. Pan,
task of interest could alleviate these issues [49]. Alter- H. Chen, Knowledge-aware zero-shot learning: Sur-
natively, more elaborate negative sampling strategies vey and perspective, Joint Conference on Artificial
[50, 51] could improve both zero-shot techniques stud- Intelligence (2021).
ied in this work, and close any observed gaps between [11] R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero-
zero-shot learners and traditional classifiers. Future work shot learning through cross-modal transfer, Ad-
should explore zero-shot capabilities in more sophisti- vances in Neural Information Processing Systems
cated knowledge bases (ontologies, knowledge graphs, (2013).
etc), a larger variety of class types, and different domains.
[12] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
Lastly, further experimentation is needed to fully explain J. Dean, Distributed representations of words and
observed differences between the results presented here phrases and their compositionality, Advances in
and those in [34] in order to better understand the success Neural Information Processing Systems (2013).
and failure modes of entailment-based ZSL. [13] B. Romera-Paredes, P. Torr, An embarrassingly sim-
ple approach to zero-shot learning, International
Conference on Machine Learning (2015).
Acknowledgments [14] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein,
B. Schiele, Latent embeddings for zero-shot classi-
Valuable insights, suggestions, and feedback was pro-
fication, IEEE Conference on Computer Vision and
vided by numerous individuals at Indeed. The author
Pattern Recognition (2016).
would especially like to thank Suyi Tu, Josh Levy, Ethan
[15] R. Qiao, L. Liu, C. Shen, A. Van Den Hengel, Less is
Handel, Arvi Sreenivasan, and Donal McMahon.
more: zero-shot learning from online textual docu-
ments with noise suppression, IEEE Conference on
References Computer Vision and Pattern Recognition (2016).
[16] L. Zhang, T. Xiang, S. Gong, Learning a deep em-
[1] L. Chiticariu, Y. Li, F. Reiss, Rule-based information bedding model for zero-shot learning, IEEE Confer-
extraction is dead! long live rule-based information ence on Computer Vision and Pattern Recognition
extraction systems!, Empirical Methods in Natural (2017).
Language Processing (2013). [17] W.-L. Chao, S. Changpinyo, B. Gong, F. Sha, An
[2] M. Kejriwal, R. Shao, P. Szekely, Expert-guided empirical study and analysis of generalized zero-
entity extraction using expressive rules, ACM SI- shot learning for object recognition in the wild,
GIR Conference on Research and Development in European Conference on Computer Vision (2016).
Information Retrieval (2019). [18] S. Liu, M. Long, J. Wang, M. I. Jordan, Generalized
[3] S. Cucerzan, Large-scale named entity disambigua- zero-shot learning with deep calibration network,
tion based on wikipedia data, Empirical Methods Advances in Neural Information Processing Sys-
in Natural Language Processing (2007). tems (2018).
[4] I. Karadeniz, A. Özgür, Linking entities through [19] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang,
an ontology using word embeddings and syntactic C. P. Lim, X.-Z. Wang, A review of general-
re-ranking, BMC Bioinformatics (2019). ized zero-shot learning methods, arXiv preprint
[5] T. Lee, Z. Wang, H. Wang, S.-w. Hwang, Attribute arXiv:2011.08641 (2020).
extraction and scoring: A probabilistic approach, [20] C. Li, W. Zhou, F. Ji, Y. Duan, H. Chen, A deep
IEEE International Conference on Data Engineering relevance model for zero-shot document filtering,
(2013). Association for Computational Linguistics (2018).
[21] S. Robertson, H. Zaragoza, et al., The probabilistic [36] T. Niven, H.-Y. Kao, Probing neural network com-
relevance framework: Bm25 and beyond, Founda- prehension of natural language arguments, Associ-
tions and Trends in Information Retrieval 3 (2009). ation for Computational Linguistics (2019).
[22] X. Wei, W. B. Croft, Lda-based document models [37] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malakasi-
for ad-hoc retrieval, ACM SIGIR Conference on otis, N. Aletras, I. Androutsopoulos, An empirical
Research and Development in Information Retrieval study on large-scale multi-label text classification
(2006). including few and zero-shot labels, Empirical Meth-
[23] P. K. Pushp, M. M. Srivastava, Train once, test ods in Natural Language Processing (2020).
anywhere: Zero-shot learning for text classification, [38] K. P. Murphy, Machine Learning: A Probabilistic
arXiv preprint arXiv:1712.05972 (2017). Perspective, 2012.
[24] S. Hochreiter, J. Schmidhuber, Long short-term [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
memory, Neural computation (1997). L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
[25] Y. Kim, Convolutional neural networks for sen- tention is all you need, Advances in Neural Infor-
tence classification, Empirical Methods in Natural mation Processing Systems (2017).
Language Processing (2014). [40] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston,
[26] J. Howard, S. Ruder, Universal language model Poly-encoders: Architectures and pre-training
fine-tuning for text classification, Association for strategies for fast and accurate multi-sentence scor-
Computational Linguistics (2018). ing, International Conference on Learning Repre-
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: sentations (2019).
Pre-training of deep bidirectional transformers for [41] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-
language understanding, North American Chapter label data, Data Mining and Knowledge Discovery
of the Association for Computational Linguistics Handbook (2009).
(2019). [42] D. P. Kingma, J. Ba, Adam: A method for stochastic
[28] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot optimization, International Conference on Learn-
text classification: Datasets, evaluation and entail- ing Representations (2015).
ment approach, Empirical Methods in Natural Lan- [43] J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient
guage Processing (2019). clipping accelerates training: A theoretical justifi-
[29] A. Williams, N. Nangia, S. Bowman, A broad- cation for adaptivity, International Conference on
coverage challenge corpus for sentence understand- Learning Representations (2019).
ing through inference, North American Chapter [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
of the Association for Computational Linguistics bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
(2018). L. Antiga, et al., Pytorch: An imperative style, high-
[30] K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task- performance deep learning library, Advances in
aware representation of sentences for generic text Neural Information Processing Systems (2019).
classification, International Conference on Compu- [45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
tational Linguistics (2020). langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
[31] T. Schick, H. Schütze, Exploiting cloze-questions for towicz, et al., Transformers: State-of-the-art nat-
few-shot text classification and natural language ural language processing, Empirical Methods in
inference, Conference of the European Chapter Natural Language Processing: System Demonstra-
of the Association for Computational Linguistics tions (2020).
(2021). [46] O. Khattab, M. Zaharia, Colbert: Efficient and effec-
[32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tive passage search via contextualized late interac-
I. Sutskever, et al., Language models are unsuper- tion over bert, ACM SIGIR Conference on Research
vised multitask learners, OpenAI blog (2019). and Development in Information Retrieval (2020).
[33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [47] P. Covington, J. Adams, E. Sargin, Deep neural net-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- works for youtube recommendations, ACM Con-
try, A. Askell, et al., Language models are few-shot ference on Recommender Systems (2016).
learners, Advances in Neural Information Process- [48] L. Jing, Y. Tian, Self-supervised visual feature learn-
ing Systems (2020). ing with deep neural networks: A survey, IEEE
[34] T. Ma, J.-G. Yao, C.-Y. Lin, T. Zhao, Issues with Transactions on Pattern Analysis and Machine In-
entailment-based zero-shot text classification, As- telligence 43 (2020).
sociation for Computational Linguistics (2021). [49] W.-C. Chang, X. Y. Felix, Y.-W. Chang, Y. Yang,
[35] S. Feng, E. Wallace, J. Boyd-Graber, Misleading S. Kumar, Pre-training tasks for embedding-based
failures of partial-input baselines, Association for large-scale retrieval, International Conference on
Computational Linguistics (2019). Learning Representations (2019).
[50] J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling
up to large vocabulary image annotation, Joint
Conference on Artificial Intelligence (2011).
[51] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Op-
timizing dense retrieval model training with hard
negatives, ACM SIGIR Conference on Research and
Development in Information Retrieval (2021).