<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Optimal Embeddings to Learn New Intents with Few Examples: An Application in the Insurance Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shailesh Acharya∗</string-name>
          <email>sachary1@amfam.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Glenn Fung</string-name>
          <email>gfung@amfam.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dataset, Fewshot learning, Embedding, Chatbot, Intent Classifica-</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>American Family Insurance, Machine Learning Research, Group</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>tion</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>The ubiquitous adoption of Conversational Agents (CA) in commercial settings is changing the way industries interact with their customers. Intent classification is an important first step in designing an eficient CA. Every intent that the CA can recognize is represented by a set of natural language examples that are used by the system to learn how to map any user's utterance to the corresponding intent. However, when a new intent is introduced, there are usually not enough examples to train the intent appropriately. In this paper we propose a hybrid system that combines a traditional Deep Neural Network-based classification approach with few-shot learning strategies. The simple but yet efective proposed approach achieves good performance for newly introduced intents with few training examples while maintaining performance for previously known intents. We show the potential of the proposed approach on a data generated by a deployed chat system for the insurance domain. To demonstrate that the propose approach can generalize to other domains, we also perform experiments in a publicly available dataset where we obtain similar approach-substantiating results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Natural language processing;
Learning latent representations.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Conversational Agent and chatbots are getting increasingly
popular in the industry that frequently interact with customers. A
successful conversational agent can alleviate the burden on
customer representatives by understanding the customer’s query
(often presented in natural language) and guiding the user towards
a solution. Intent classification is an important first step in
designing an intelligent chatbot. It allows chatbot to understand the
intent of customer and drive the conversation. For example, if a
customer of an insurance company asks What is the minimum liability
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.</p>
      <p>For all other uses, contact the owner/author(s).</p>
      <p>KDD Converse’20, August 2020,
© 2020 Copyright held by the owner/author(s).
limit for state of Wisconsin? then the general intent or superclass
that generalizes this query or utterance can be something like
AUTO_COVERAGE_LIABILITY_LIMIT. The chatbot should be able
to identify the intent and get the right answer from the knowledge
base.</p>
      <p>These intents are defined by business and added/modified
according to business needs. Every time a new product/service is
launched, new intents relating to that service are added to the
chatbot. For instance; if a new discount scheme is introduced for
usage-based insurance (UBI) then an associated intent for the
chatbot can be UBI_ELIGIBILITY. There could be more than one intent
associated with the new service/ofering depending on the scope or
complexity of it. Often, the newly added intents could be of
significant importance to the business for two reasons; first, it could be
associated to a newly introduced “hot" popular service or ofering
and hence much more likely to be queried or solicited by customers
and second, there could be a greater business drive to upsell this
newly introduced service. For these reasons, it is very important
for the chatbot to do well in the intent category/categories relating
to this service.</p>
      <p>An underlying problem with the newly added intents is the lack
of enough training examples to train an accurate intent classifier
since there is no past interaction with customers regarding this
topic. Normally, a subject matter expert comes up with diferent
ways of asking questions about that service in order to gather
several training samples for that intent class. Such a collection of
training examples is limited in number and variation. This can
present some challenges to the intent prediction classifier; in spite
of having few training examples to begin with, the likelihood of
getting queries related to this intent could be high. Furthermore,
the cost of misclassifying examples belonging to this new intent
class could be potentially much higher to the business.</p>
      <p>Intent classification is usually formulated as a multiclass
(sometimes multilabel) classification problem and the use of
deep-neuralnetwork-based classifiers is very popular. A well-known trait of
training deep neural networks (NN) classifiers is that they need
large amounts of labeled training data to provide satisfactory
classiifcation performance. They generally perform poorly on categories
with fewer training examples. Hence, the ability of deep NN to
extract complex statistics and learn high level features from vast
datasets is proven. However, most current deep learning approaches
have poor sample eficiency in contrast to human perception - even
a child could recognise a bird after previously seeing a few bird
pictures.</p>
      <p>
        There are abundant recent works in few-shot learning [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]: an
area of machine learning dedicated to solving problems with very
few examples per class (but large number of class labels).
      </p>
      <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        The main idea behind few-shot learning is to eficiently
combine meta-learning: that uses prior knowledge to define learning
strategies and metric learning: that aims to learn semantic
embeddings using a distance loss function and data augmentation (by
synthesizing data) to facilitate the learning process using fewer
labeled examples. Many few shot learning models such as
Prototypical Networks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Facenet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Matching Network [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used
optimized embedding transformations and distance-based learning
to embed examples belonging to the same class close to each other
in a new generated embedding space. This approach has proven to
be more efective than a regular fully connected NN classifier if the
dataset has large number of classes but fewer examples per class.
      </p>
      <p>
        Few shot approaches have also been successfully applied to intent
classification domain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in the past.
      </p>
      <p>
        The new intent classification problem described above is an
interesting combination of traditional learning and few-shot learning
since we have a large number of intent classes with suficiently
large number of examples collected from past interaction with
customers but we also have the newly added intents with very few
examples. Recent work aims to address the problem of learning
new intents with few examples. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the author’s propose using a
combination of data augmentation combined with few-shot
learning with modest results, while in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] only the efect of few-short
learning for the same task is studied. However, in contrast with our
work, these two recent papers don’t study how to incorporate the
new learned intents with the existing intent classification system
as described next.
      </p>
      <p>These ideas inspire a simple but efective approach: to combine
the strength and low data dependence of distance-based learning
with a regular industry standard NN classifier trained on classes
(intents) for which we have enough data, to build an ensemble
classifier that improves performance on new recently added
intent categories while maintaining the performance on the already
existing ones. This approach plays with with current deployed
CA implementations making it very practical in most industrial
settings.
2
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>PRELIMINARIES</title>
    </sec>
    <sec id="sec-4">
      <title>Triplet Network</title>
      <p>
        Facenet introduced an image embedding model that encodes face
images as vectors in a Euclidean space where distances directly
correspond to facial similarities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The model consists of a
weightsharing triplet network trained to minimize a loss function based
on large margin nearest neighbours [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A triplet  consists of
two examples 1, 2 from the same class known as “positive"
1 and anchor 2 and one example  from a diferent class
 
known as “negative". Triplet loss minimizes distance between the
positive points while maximizing the distance between the anchor
2 and the negative example. This idea is illustrated in Figure 1.

The objective function to minimize is:

Õ

||  (1 ) −  (2 ) ||22 − ||  (1 ) −  () ||22 + 
(1)
where  is a margin enforced between positive and negative pairs
and  is the number of triplets in training set.
      </p>
      <p>
        One key advantage of this approach is its ability to learn from
very few examples per class given that there are suficiently large
number of classes. Instead of classifying images to diferent
categories, the Facenet model learns the notion of similarity between
faces. For the downstream tasks, the encoder model can be used
to encode images into embedding vectors for tasks such as
clustering, nearest-neighbor-based classification, ranking, etc. The general
approach used to train Facenet can be easily applied in any other
domain including text. For the intent classification problem, we can
instantiate our problem as a triplet network formulation. Instead
of the encoder used in Facenet, we include our own text encoder
by adding fully connected layers on top of the Universal Sentence
Encoder [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The loss formulation in equation 2 and the training
procedure we used, however, is akin to the ones described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Our proposed model learns a transformation that maps sentences
belonging to same intents close to one another in the euclidean
space and far apart from another sentence belonging to diferent
intent classes.
Prototypical networks learn a non-linear mapping of the input
feature space into an embedding space and create class’s prototype
 ∈  to be the mean of the embedded support points belonging
to the class .
      </p>
      <p>= 1/| |</p>
      <p>Õ
(, ) ∈
 ( )
where  = {(1 , 1 ), (2 , 2 )..( ,  )} is the set of support points
belonging to class  and   :  →  is a non linear
transformation with trainable parameters. For a query point  , prototypical
networks produce a distribution over classes based on a softmax
over distances to the prototypes in the embedding space:
exp(− (  ( ),  ))
 ( =  | ) = Í exp(− (  ( ),   ))
The corresponding loss function is then derived as the negative
log-probability − ( =  | ) of true class .</p>
      <p>
        Similar to triplet networks, we can interpret prototypical
networks as a combination of a encoder model and a loss formulation
that ensures similar examples cluster in a tight formation. Thus, we
can then use a similar approach to frame our intent classification
problem to fit in the prototypical network framework. In order to
do this, we replace the encoder with our text encoder(USE followed
by two FC layers) and follow the training procedure identical to
(2)
(3)
the one described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, we needed to make some
modiifcations to the inference procedure to fit to our specific use-case.
We describe our modified inference algorithm in more detail in the
next section.
      </p>
      <p>The key aspect of both triplet network and prototypical network
is the encoder function  : 1− &gt; 2 that transforms input from one
vector space(1) to another(2) based on some distance based loss
minimization. As a result, each class forms a tight cluster separated
from one another in this new vector space as seen in figure 2. One
key strength of this approach compared to regular Deep Neural
Network (DNN) multiclass classifier is its ability to learn from very
few examples per class given that there are suficiently large
number of classes. DNN classifiers while generally performing good on
classification tasks with big datasets, performs poorly on categories
with few examples. This is the behaviour we observe in our
experiments as well. In general, intent classification datasets are highly
unbalanced with some intents having large number of examples
while some newly added intents having very few examples. In
order to combine the strength of distance-based learning (Triplet or
prototypical network) and the practicality and accuracy of regular
deep NN multiclass classifiers we propose a hybrid approach that
combines both models to build a classifier that is more robust to
less frequent intent classes.
The Amfam chatbot intent dataset(ACID) contains 174 unique
intents related to topics driven by customers contacting an
insurance company. Each intent represents a particular course of
action for the chatbot. For example; if a customer says I got a new
cell number and need to update my account info then it belongs
to the intent class INFO_UPDATE_PHONE_NUM. The intent
prediction dataset was collected from past interaction of customers
with our service representatives at American Family Insurance.
Subject matter experts created diferent intent categories and
handpicked examples belonging to each intent category. The dataset
contains 175 unique intents. It is split into a training set that
contains a total of 11,130 examples and test set that contains total of
11,042 examples. The distribution across the intent classes is highly
skewed with the smallest intent class containing 10 examples and
the biggest class containing 378 examples in the training set. The
dataset will be released to public and can be downloaded from:
https://github.com/AmFamMLTeam/ACID
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>CLINC150 dataset</title>
      <p>
        The CLINC150([
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) is another publicly available intent classification
dataset that contains large collection of in-scope and out-of-scope
intents and examples from diferent domains. The dataset has
several variants; Full, Small, Imbalanced and OOS+. Depending on the
variant, it has diferent train/test split. For our experiment, we take
the FULL version which contains 150 "in-scope" intent classes, each
with 100 examples in train, 30 examples in test and 20 in validation.
The dataset also contains out of scope examples("out-of-domain"
or "out-of-distribution") which we do not include in our
experiments. The dataset contains intents from diferent domains related
to travel, expenses, reminders, fees, and several intents associated
with smart assistant such as joke, play_music,etc.
4
4.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>MODEL AND ALGORITHM</title>
    </sec>
    <sec id="sec-7">
      <title>Model</title>
      <p>
        We used a pretrained Universal Sentence Encoder (USE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a
base encoder. USE projects input text into a 512 dimensional vector
space. We treat this base encoder as a preprocessing step and all
subsequent models use these sentence embeddings as their input.
      </p>
      <p>We created a baseline model by adding two fully connected
layers to the output from USE. In the last layer of this network we
incorporated a softmax activation function to produce a probability
distribution over the intent classes.</p>
      <p>Similarly, we also created both a triplet and a prototypical
network by adding two fully connected layers to the output from the
USE network. We followed the same training process described
in the original paper for these two networks. During inference,
however, we use a diferent strategy. This is because we are only
interested in the final embedding vectors these networks produce
which we use to perform our own distance-based classification.
The choice of the number of layers, the dimension of the layers
and other network parameters for all the three networks was
determined by hyperparameter search. Details of those hyperparameters
are provided in appendix section.</p>
      <p>It is important to note that our goal is not to compare the
baseline model against the triplet network or the prototypical
network. Instead, our goal is to compare the performance of the
baseline model against an ensemble model (baseline combined with
triplet/prototypical) on the newly added intent classes. We use the
ensemble model during inference by combining predictions from
the baseline model and the corresponding triplet/prototypical
network. The underlying idea of this ensemble is very simple; we use
the embedding model (triplet/prototypical) to determine if the
incoming testing example belongs to the newly added intent class,
if not, we pass it through the baseline classifier to get the
corresponding class label. We will describe the inference strategy in
more detail in algorithm 1 and algorithm 2. For simplicity, we will
use the term "embedding model" to refer to the triplet/prototypical
networks in the following sections.
4.2</p>
      <p>Algorithm
4.2.1 Adding a new intent. To simulate a scenario where a new
intent class is added with few available examples, we randomly
select one intent class  from dataset and retain only 13 examples
(randomly chosen) for that class in the training set. The number 13
is chosen considering the minimum examples required to train the
prototypical network with the settings we used. For ACID dataset,
we choose  from a subset of intent classes with at least 80
examples because ACID dataset is highly unbalanced. This guarantees
that the test set is big enough to have diverse distribution of
samples and correctly simulates real world scenario where new intent
categories have few examples to begin with but during inference
queries coming from customers is much more diverse. Inference
is done in two stages; first, we use the embedding model to decide
if an example belongs to the selected intent class  . This is done
by comparing the average distance between the test example and
all training examples belonging to class  with a pre-determined
threshold. If average euclidean distance is greater than the
predetermined threshold, we pass the example through the baseline NN
classifier and get the class label. We use the following algorithm
during inference to score a new example:
Algorithm 1 Inference Single:
1: Calculate threshold { }
2: Transform input  to embedding vector,  =   ( )
3: Calculate distance between  and all training set examples
belonging to intent class  ;  =  (,  ),  in class 
followed by average distance  =  ( )
4: If  &lt;  , assign example to class 
5: If  &gt;  , pass the example through baseline classifier to get
the class label
4.2.2 Threshold. The distance threshold  is an important
parameter calculated based on the intra-class distance matrix for the intent
class  in both the training and the validation set. We choose 10
equally spaced points between median and max of the distance
matrix as candidates for  and select the value that maximizes some
 on the the validation set. This  is generally defined based
on the relative business value of the newly added intent class. For
example; if the gain for correctly predicting examples in the newly
added intent class is  times higher than the gain of correctly
predicting any other class then a simple formulation for the score can
be
 =  * number of examples correctly classified to  + number
of examples correctly classified to any other class We defined a more
intuitive normalized score as the ratio of this score over the total
possible score in the test set as follows:

normalized score =    where;
total possible score =  ∗ (# examples in  )
+ (# examples in any other class)
(4)
When  = 1, normalized score equals  and thus the grid
search returns the value of  that maximizes the validation accuracy.
4.2.3 Extending to multiple intents. We also simulated an scenario
when multiple new intents are added to the classifier. Instead of
selecting a single intent, we randomly selected five intents { 1, 2, 3, 4, 5}
using the same criteria described in the previous section. The
training procedure remains the same, but we slightly modified the
inference algorithm to incorporate multiple "one vs. rest" classifiers.
Details are described in algorithm 2 below.</p>
      <p>Algorithm 2 Inference Multiple:</p>
      <p>Calculate the best combination of dist thresholds {1, 2, ..,  5}
for classes {1, 2, .., 5}
Transform input  to embedding vector,  =   ( )
for  ← {1, 2, 3, 4, 5} do</p>
      <p>Calculate distance between  and all training examples
belonging to intent class  ;  =  (,  ),  in class 
followed by average distance  =  ( )</p>
      <p>If  &lt;  , add  to candidate classes
end for
if There are multiple candidate classes then</p>
      <p>Assign example to  from the candidate classes with minimum
ratio of  /
else</p>
      <p>Pass the example through baseline classifier to get the class
label
end if</p>
      <p>The threshold  selection follows a similar procedure to the
one described in Algorithm 1 except that the grid search evaluates
the score for all possible combinations of  . Each selected intent
can have a diferent  . So, the objective is to find the combination
(1, 2, .., 5) that gives the highest score in the validation set. The
complexity of this optimization grows exponentially with number
of selected intents. Using the same candidate selection strategy as
described for 1 there is a total of 105 combinations to choose from.
A flowchart illustrating the Algorithm 2 and the threshold selection
strategy explained above is shown in figure 3.
5</p>
    </sec>
    <sec id="sec-8">
      <title>EMPIRICAL RESULTS AND ANALYSIS</title>
      <p>For the rest of the paper Baseline refers to the baseline model
that is a shallow neural network with two fully connected layers
after taking the output from the universal sentence encored (USE)
as inputs, followed by a softmax activation function to produce a
probability distribution over the all the intent classes described in
subsection 4.1. B+T denotes the ensemble of the baseline classifier
and the triplet network and B+P denotes the ensemble of the
baseline and the Prototypical network using the inferences strategies
described in Algorithms 1 and 2.</p>
      <p>Our goal for the ensemble is to boost performance in the newly
added intent class . We want to correctly classify more examples in
the newly added intent class (i.e. improve recall of that class) but at
the same time, it is important that the ensemble does not negatively
afect the overall test accuracy. That’s why we report recall in the
newly added class as well as overall test accuracy. Table 1 shows
recall of diferent classifiers on the selected new intent class on
ACID dataset. The table shows that for the majority of the runs, the
Baseline multiclass classifier has below average performance on
the selected new intent class compared to the overall accuracy of
the rest of the test set. It is evident that a traditional DNN multiclass
classifier performs better on intents (classes) with larger number
of examples and tends to perform poorly on the new intent classes
with few examples. For example, for run 6 the Baseline model
correctly predicts 23% of the examples from selected intent class
but the overall test accuracy for that run is much higher at 87.7%.</p>
      <p>From Table 1 and 2, we also see that the ensemble gives a boost
in performance for the selected intent class. We see this
improvement on all the 10 runs. The improvement is very significant in
some of the runs. However, it is important to ensure that higher
recall in selected intent class does not lead to reduced overall test
accuracy due to false positives in that class. So, the desired outcome
is increase in recall on that class without afecting the overall test
accuracy. For example; for run 1 in table 1 the recall of selected
intent class improved from 0.409 with baseline to 0.909 with B+T
and the overall test accuracy also improved from 88.1% to 88.5%. We
can see that the overall test accuracy either improves marginally
or remains unchanged in all the runs while the improvement in
selected intent class is significant which is often the desired outcome
for the newly added intents.</p>
      <p>Table 3 and 4 present results for five random experiments
representing a scenario where five new intents {1, 2, .., 5} are added
to the CA. We observe improvement in both the recall score and
the overall test accuracy for all five new intent classes with our
proposed ensemble classifier. These improvements are consistent for
all five runs. The rationale behind adding 5 new intents was based
on some preliminary experiments which showed that beyond 5
intents, it was hard to gain improvement in recall without afecting
the test accuracy.</p>
      <p>The results presented in Table 1-4 are special cases of our
proposed approach where the distance threshold  of the ensemble
is tuned to optimize test accuracy. As described in the previous
section, the distance threshold  can be optimized to maximize a
normalized score (equation 4) in order to give more importance to
newly added intents {1, .., 5}.</p>
      <p>The importance associated with correctly classifying examples
to one of these five classes can be represented by the weighting
factor  which is often determined according to business needs. A
higher value of  (&gt; 1) represents that the new intent class is more
important than the existing intents.</p>
      <p>To illustrate this, Table 5 reports   for three
different values of  for the five runs corresponding to Table 3 (ACID
dataset). At  = 1, the   is equivalent to accuracy
and equals to result from table 3. As the value of  increases, the
boost in performance resulting from the ensemble is more
significant. For instance; at  = 3 we see improvement of 3% to 5% on the
normalized score across all five runs.</p>
      <p>Table 2 and 4 shows results for the CLINC150 dataset for the
same set of experiment as ACID dataset in tables 1 and 3
respectively. Both table 2 and 4 show that recall on newly added intent
class(es) increases for all the runs and the overall test accuracy
either improves or remains unchanged. This result is consistent
with the results from ACID dataset, thereby showing that the
approach can be applied to any intent classification domain to solve
the problem of adding "new intent" and/or "prioritizing new intent"
categories with few examples.</p>
    </sec>
    <sec id="sec-9">
      <title>6 CONCLUSIONS AND FUTURE WORK</title>
      <p>We show that a simple but efective combination of an embedding
transformation model and a standard neural network multiclass
classifier can achieve significant improvements in performance on
newly added intents for which training examples are scarce while
maintaining and sometimes improving the overall intent classifier
performance. This ensemble approach can be easily optimized to
maximize performance on a subset of intent classes that are deemed
important by business needs. Results in a publicly available dataset
that contains intents from diferent industries demonstrate that the
proposed approach can be easily extended to other domains.</p>
      <p>Our approach is easy to implement and could be easily combined
or appended to current deployed CA implementations making it
very practical in most industrial settings.</p>
      <p>
        As future work, we would like to explore creating a system that
can automatically decide when enough samples are provided to
transfer intents to be part of the main neural network classifier in
order to maximize accuracy. This would be a key piece towards a
self-maintained intent classification system. Another topic of
interest for future work is how recent existing state-of-the-art language
representations (like BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) perform with combines with our
proposed approach. We also would like to share insights about
post-deployment evaluation after updating the deployed system
with several cycles of new added intent classes.
      </p>
    </sec>
    <sec id="sec-10">
      <title>A.1 Baseline Model Parameters</title>
      <p>
        As explained in paper, we create Baseline Model by adding two
fully connected(FC) layers to output of transformer based
Universal Sentence Encoder(USE). We use the transformer-based USE
en-coder [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which targets high accuracy at the cost of greater
model complexity and resource utilization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We keep the USE
weights fixed during training. We use the pretrained model
available in tensorflowhub. Hidden layer FC1: 512 units, activation=
relu, dropout ratio = 0.25 Output layer FC2: 250 units, activation =
softmax Learning rate = 5e-5 Batch size = 32 Number of epochs: 50
Loss = categorical crossentropy Hyper parameter tuning: We did
hyperparameter search on number of hidden layers, hidden layer
dimension and learning rate.
      </p>
    </sec>
    <sec id="sec-11">
      <title>A.2 Triplet Network Parameters</title>
      <p>
        As explained in the paper, we create the triplet network by adding
two fully connected (FC) layers to Universal Sentence Encoder(USE)
output. Hidden layer FC1: 512 units, activation= relu, dropout ratio =
0.25 Output layer FC2: 250 units, activation = None, normalization=
l2 normalization Learning rate = 5e-5 Batch size = 1000 (Large
batch size suggested in the paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) Number of steps: 5000 Loss =
triplet loss Triplet Selection Strategy: Random triplet mining We did
hyperparameter search on number of hidden layers, hidden layer
dimension, learning rate and triplet selection strategy. We tried
Random Triplet mining, Hard triplet mining and semi-hard triplet
mining as suggested in the paper but did not observe diference in
performance. Rest of the Training procedure is identical to the one
described in the Facenet paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
    </sec>
    <sec id="sec-12">
      <title>A.3 Prototypical Network Parameters</title>
      <p>
        As explained in the paper, we create the prototypical network by
adding two fully connected (FC) layers to Universal Sentence
Encoder(USE) output. Hidden layer FC1: 250 units, activation= relu,
dropout ratio = 0.25 Output layer FC2: 250 units, activation = None,
normalization= l2 normalization Learning rate = 5e-4 Number of
steps: 5000 N_way = 20 N_query = 8 N_shot=5 We did
hyperparameter search on number of hidden layers, hidden layer dimension,
learning rate, N_way, N_query , N_shot. Please refer to paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
for more explanation on these parameters. Rest of the Training
procedure is identical to the one described in the paper.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Iñigo</given-names>
            <surname>Casanueva</surname>
          </string-name>
          , Tadas Temčinas, Daniela Gerz, Matthew Henderson, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Vulić</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Eficient Intent Detection with Dual Sentence Encoders</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>04807</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Cer</surname>
          </string-name>
          , Yinfei Yang,
          <string-name>
            <surname>Sheng-yi Kong</surname>
            , Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan,
            <given-names>Chris</given-names>
          </string-name>
          <string-name>
            <surname>Tar</surname>
          </string-name>
          , et al.
          <year>2018</year>
          .
          <article-title>Universal sentence encoder</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11175</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In NAACL-HLT.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Krone</surname>
          </string-name>
          , Yi Zhang, and
          <string-name>
            <given-names>Mona</given-names>
            <surname>Diab</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Learning to Classify Intents and Slot Labels Given a Handful of Examples. arXiv:cs</article-title>
          .CL/
          <year>2004</year>
          .10793
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Varun</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Hadrien Glaude, Cyprien Lichy, and Wlliam Campbell.
          <year>2019</year>
          .
          <string-name>
            <given-names>A</given-names>
            <surname>Closer Look At Feature Space Data Augmentation For Few-Shot Intent</surname>
          </string-name>
          Classification.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -6101
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Larson</surname>
          </string-name>
          , Anish Mahendran,
          <string-name>
            <given-names>Joseph J.</given-names>
            <surname>Peper</surname>
          </string-name>
          , Christopher Clarke,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lee</surname>
          </string-name>
          , Parker Hill,
          <string-name>
            <surname>Jonathan K. Kummerfeld</surname>
            , Kevin Leach,
            <given-names>Michael A.</given-names>
          </string-name>
          <string-name>
            <surname>Laurenzano</surname>
            ,
            <given-names>Lingjia</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , and
            <given-names>Jason</given-names>
          </string-name>
          <string-name>
            <surname>Mars</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          . https://www.aclweb.org/anthology/D19-1131
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Schrof</surname>
          </string-name>
          , Dmitry Kalenichenko, and
          <string-name>
            <given-names>James</given-names>
            <surname>Philbin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>815</volume>
          -
          <fpage>823</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jake</given-names>
            <surname>Snell</surname>
          </string-name>
          , Kevin Swersky, and Richard Zemel.
          <year>2017</year>
          .
          <article-title>Prototypical networks for few-shot learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>4077</volume>
          -
          <fpage>4087</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Oriol</surname>
            <given-names>Vinyals</given-names>
          </string-name>
          , Charles Blundell, Timothy Lillicrap,
          <string-name>
            <given-names>Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Matching networks for one shot learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>3630</volume>
          -
          <fpage>3638</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kilian</surname>
            <given-names>Q Weinberger</given-names>
          </string-name>
          , John Blitzer, and Lawrence K Saul.
          <year>2006</year>
          .
          <article-title>Distance metric learning for large margin nearest neighbor classification</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>1473</volume>
          -
          <fpage>1480</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>