Enhancing gazetteers for named entity recognition in
conversational recommender systems
Nicholas Dingwall1 , Vianne R. Gao2
1
    Amazon, 525 Market St, San Francisco, CA 94105
2
    Weill Cornell Medicine, 1300 York Ave, New York, NY 10065 (Work conducted during an internship at Amazon)


                                             Abstract
                                             Named Entity Recognition (NER) is a crucial building block of a conversational agent, but remains challenging in real-word
                                             settings. It is particularly challenging for domains where the entities are linguistically complex and resemble common phrases
                                             (e.g. music and movies). While gazetteer features have been shown to improve NER performance, their utility is undermined
                                             by pervasive spurious entity matching. We propose a framework for gazetteer knowledge integration that incorporates
                                             external knowledge about entity popularity (e.g. a song’s play count) to reduce spurious entity matching and improve the
                                             robustness of gazetteer features. Our experimental evaluations show that using unfiltered gazetteers degrades performance,
                                             but that incorporating external information improves it compared to a baseline model that doesn’t use gazetteer information.
                                             Further, our framework can efficiently adapt to new entities in gazetteers without additional training, which is crucial for
                                             rapidly growing domains like music.

                                             Keywords
                                             natural language understanding, named-entity recognition, gazetteer, conversational recommender, music


1. Introduction                                                                                                       ging system can utilize to estimate a prior likelihood
                                                                                                                      that a token represents an entity (“Alexa play X”, etc),
Voice assistants (Siri, Alexa, Google Assistant) are be-                                                              conversational responses lack such affordances. The sys-
coming increasingly popular and music has emerged as                                                                  tem must also distinguish system-directed speech from
a primary use case for them [1, 2]. Without a screen for                                                              background conversation overheard while waiting for a
browsing, conversational recommenders are an appeal-                                                                  response: “who let the dogs out” is likely to be a song
ing avenue to help users navigate their favorite music.                                                               request (Baha Men), but “who let the cats out” is more
   But four factors make identifying mentions of these                                                                likely to be a frustrated parent chastising their children.
entities difficult in the music domain. First, there are a lot                                                        Finally, errors made by users in recalling an entity name
of songs and artists: thousands of artists release millions                                                           or by a voice recognition system, alongside nonstandard
of songs each year, and a modern deep learning system                                                                 spelling of artist and song titles, frustrate attempts at
must store their names in its weights. Second, song and                                                               simple string matching against canonical entities [3].
artist names can often resemble ordinary parts of speech,                                                                Gazetteers – flat lists of entity names – can provide
and so the system must disambiguate genuine references                                                                a source of valid entity names. But incorporating them
to musical entities from spurious matches. Third, users                                                               into modern NER models has proved difficult (see Section
misremember the titles of songs or use abbreviations to                                                               2.2), and the music domain makes their application even
refer to artists, limiting the applicability of canonical data                                                        more precarious: any song title gazetteer will include
sources. And fourth, new songs are continually being                                                                  common phrases like “yes” (LMFAO), “something like
released – some of which immediately achieve their peak                                                               that” (Tim McGraw), and “stop” (Spice Girls), resulting
popularity – which obliges the owners of a model to                                                                   in frequent false positive matches.
regularly retrain the model.                                                                                             Nevertheless, incorporating them into models is ap-
   Conversational systems make NER even more chal-                                                                    pealing since they could allow a production system to
lenging: while single-turn commands are often well-                                                                   generalize beyond examples seen during training, and to
structured and include indicators that a sequence tag-                                                                decouple updates to entity lists from model training.
                                                                                                                         In this paper, we experiment with utterance data and
3rd Edition of Knowledge-aware and Conversational Recommender                                                         music domain knowledge data. In the conversational mu-
Systems (KaRS) & 5th Edition of Recommendation in Complex
Environments (ComplexRec) Joint Workshop @ RecSys 2021,
                                                                                                                      sic recommender setting, a user is prompted to specify
September 27–1 October 2021, Amsterdam, Netherlands                                                                   genres, moods or artists and hears samples of playlists
Envelope-Open nickding@amazon.com (N. Dingwall);                                                                      matching the criteria they have provided so far (e.g. in-
vrg4001@med.cornell.edu (V. R. Gao)                                                                                   cluding a specified artist). The conversation continues
Orcid 0000-0003-0026-2740 (N. Dingwall); 0000-0001-8990-0897                                                          until a sample is accepted, the user requests to play a
(V. R. Gao)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   specific song or artist, or the user explicitly ends the
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        conversation or stops responding. The natural language
interpretation component must therefore be able to rec-     retain many of these homographs (phrases like ”yes”,
ognize any song or artist name mentioned by the user        “something like that” and ”stop”), which are particularly
in order to select matching playlists to recommend and      common in conversational responses, and exclude many
to avoid recommending playlists that do not match the       genuine references to entities. Moreover, we wish to use
user’s request.                                             gazetteers precisely because they will help generalize be-
   This paper explores different methods to extract value   yond the training data, especially for low-context inputs,
from gazetteers enriched with popularity information        like an artist name on its own.
about songs, artists and albums. In all cases, we add          These works also either did not use pre-trained lan-
token-level features indicating the presence or absence     guage models [6, 7, 8, 11] or did not fine-tune the weights
of that token (or sequence of tokens) within a gazetteer.   of the language models [10, 9]. Large pre-trained lan-
We vary the preprocessing applied to the gazetteers and     guage models based on the transformer architecture [12]
show that neither full gazetteers nor gazetteers filtered   have achieved state-of-the-art results across a variety of
to include only the most popular entities outperform a      natural language processing tasks [13] but successfully
baseline gazetteer-free model. However, after a more        integrating gazetteers remains elusive.
careful filtering of entities, adding a gazetteer does help    In these prior works, the gazetteers used were all flat
the model to robustly extract music entity names. In        lists of entity names, and so the systems could only con-
doing so, the model improves its ability to classify a user’s
                                                            sider the surface form of each entity (i.e. any string
overall intent.                                             matching the name of the entity, regardless of the in-
                                                            tended referent of that string). Oramas et al introduces
                                                            a framework to leverage the popularity of each associ-
2. Background and prior work                                ated entity to distinguish between ambiguous and non-
                                                            ambiguous names [14]. For each entity, they compute a
2.1. Named entity recognition                               ratio between the rank of the entity’s popularity and the
Named entity recognition (NER) is the task of associating rank of the number of occurrences of its surface form in
each word in a sentence with a label indicating its type. their corpus.
In typical settings, the type may be a person, a location,
                                                                                    𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦𝑅𝑎𝑛𝑘(𝑒)
or an organization. In our domain, we are interested in                      𝑟(𝑒) =                                 (1)
music entities: artist names, song titles and album names.                          𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑅𝑎𝑛𝑘(𝑒)
   In practice we refer to tokens instead of words, allow-     Mentions of entities that occur more frequently in their
ing for rare words to be split into subwords to limit the corpus than would be expected based on their popularity
vocabulary size necessary to cover the entire dataset. For rank (i.e. 𝑟(𝑒) is small) are likely to be spurious matches.
example, ‘ed sheeran’ is represented as the three tokens They use this to automatically label a training set: some
‘ed’, ‘sheer’ and ‘an’. We hope to train a model that asso- entities can be confidently labeled as songs or artists,
ciates all three tokens with the artist_name tag.           some – like “Could You”, “Play Music” and “Xmas” –
                                                            are ignored, and inputs containing potentially-confusing
2.2. Gazetteers for NER                                     mentions like “Country Joe” and “Spanish House” are
                                                            excluded entirely. However, this informs only the dataset
Gazetteers were common in pre-neural NER architec- generation; their model does not have access to the un-
tures: indeed, Mikheev et al in 1999 was notable for derlying popularities or the rank.
doing it without gazetteers [4].                               Meng et al propose a mixture of experts model for
   Their use has fallen out of fashion with the recent NER that directly models how much weight to give to
dominance of large pre-trained language models for NER, features derived from the context (using a BERT encoder)
since these models can better leverage contextual infor- and from gazetteers (using a BiLSTM built on gazetteer
mation to detect entity mentions [5]. More recent work matches) for each token. This substantially improves
has demonstrated that gazetteers can still improve NER performance on their datasets, but still relies exclusively
performance with neural architectures, especially where on linguistic features [15].
training data is limited [6, 7, 8, 9, 10, 11].                 In this work, we show that filtering gazetteers using a
   However, the improved performance of modern NER similar formula to Equation 1 [14] allows an NER model
models exposes the noise in gazetteers: Magnolini et to leverage gazetteer information. We compare against
al showed that filtering rarely-occurring values from two other preprocessing methods which both degrade
large gazetteers boosts performance more than using performance.
the unfiltered gazetteer [7]. But in the music domain,
the noise comes principally from linguistic ambiguity:
entity names can be homographs of non-entity words
and phrases. Filtering based on corpus frequency would
3. Datasets                                                                           Table 1
                                                                                      Example showing which gazetteer embeddings trigger for
3.1. Corpus                                                                           each token in the utterance play dancing queen. In this case,
                                                                                      dancing queen is recognized as a song, and queen as an artist.
We train our model on historical user utterances. The                                 For simplicity, we assume no other gazetteer entries match
data are labeled using a hand-crafted set of grammatical                              the utterance.
rules designed to match the most frequently-occurring
                                                                                               entity type      play   dancing     queen
utterances. The rules consist of a pattern and allowed
slot values. For example, the pattern j u s t p l a y < a r t i s t -                          artist name       7         7         3
n a m e > includes an < a r t i s t - n a m e > slot; this slot is associ-                     song title        7         3         3
ated with a list of popular artists. The combination of                                        album name        7         7         7
pattern and slot list allows us to match “just play the
beatles” and “just play rihanna”, but not “just play trivial
pursuit”, since trivial pursuit is a game and not an artist.                          3.2. Gazetteers
    In practice, we abstract common phrases and nest                                  We use gazetteers derived from historical single-turn
rules: < n e g a t i v e - t r i g g e r > < a r t i s t - n a m e > matches a pre-   utterances that expressed a music request to a voice assis-
determined set of negative trigger phrases like “not”, “i                             tant. Since these utterances are well-structured (usually
don’t like”, etc, along with an artist name.                                          of the form “play X”), the entity name can be extracted
    To control the latency of these rules, we limit slot lists                        and associated with an entity type by an entity resolution
to popular entities. The rules therefore fail on a long tail                          system.
of infrequent utterances, either because the utterance                                   These gazetteers can include entities with user or voice
contains an entity outside our canonical lists, or because                            recognition errors as long as the entity resolution system
the pattern is unusual. We observe that statistical models                            was able to resolve them to a canonical entity.
trained on these labeled utterances can generalize to long                               As such, the gazetteers consist of multiple strings cor-
tail utterances, as proposed in [16].                                                 responding to the same entity: e.g. “blink one eighty two”
    The rules cover multiple intents, including standard                              and “blink one eight two” both appear in our gazetteer,
ones like Y e s I n t e n t , N o I n t e n t and S t o p I n t e n t , but also      even though the canonical name is “Blink 182”.
an A d d M u s i c C o n s t r a i n t I n t e n t for when users constrain              Each entity in a gazetteer is augmented with the num-
recommended music entities. Slot types include entity                                 ber of times it was requested (which we refer to as its
types as well as trigger phrases that indicate negation,                              popularity).
instructions to go immediately to playback, and so on.
    To evaluate the model’s ability to discriminate be-
tween spurious mentions of entity names, we hand                                      4. Methodology
label independently-collected validation and test sets
where each utterance contains a substring included                                    4.1. Candidate entity matching
in the song title gazetteer (see Section 3.2). We ig-
nore the 50 most frequently-occurring spurious matches                                Before passing a user’s utterance to the model, we must
(e.g. “play”, “just”, “yeah”, etc) so that approximately                              first determine which gazetteer entries appear in it. Note
50% of utterances express a A d d M u s i c C o n s t r a i n t I n t e n t ,         that at this stage, we do not distinguish between true
and about 50% of these A d d M u s i c C o n s t r a i n t I n t e n t utter-         positives (the user was referring to the entity) and false
ances contain a true reference to a specific song. Only                               positives (spurious matches like ‘yes’, ‘play’, etc).
A d d M u s i c C o n s t r a i n t I n t e n t can include song titles. These
                                                                                         We find these candidates using a regex search for each
titles include some that were misremembered or that                                   gazetteer entry, enforcing that the match must terminate
contain voice recognition errors.                                                     at whitespace or at the beginning or end of the utterance.
    Our grammatical rules can interpret about 30% of ut-                                 Next, we associate all of the tokens in an utterance with
terances in the test set and we consider these to be in                               the entity type of the candidate. We summarize this infor-
domain; the remaining 70% test the model’s generaliza-                                mation with a binary vector for each word, where each
tion capability, either to new utterance patterns or to                               dimension corresponds to one of the gazetteers (artist
novel entities.                                                                       name, song title, album name). See Table 1 for an exam-
    The training, validation and test datasets are fixed for                          ple.
all experiments.                                                                         We add a dimension to this vector with its value fixed
                                                                                      to 1 as a bias term. This entry can be thought of as the
                                                                                      no entity dimension, which captures the possibility that
                                                                                      the gazetteer matches were false positives.
4.2. Gazetteer filtering
In this work, we use a straightforward technique to in-
corporate the summary gazetteer vectors into our base-
line model (see Section 4.3). The baseline model does
not have access to this vector, and so we can evaluate
whether gazetteers improve or degrade performance.
   We compare three methods to preprocess the
gazetteers:
                                                                Figure 1: Gazetteer features are computed as the sum of as-
   First, we use the full, unfiltered gazetteer. Following      sociated ‘ingredient’ embeddings. Here, for example, ‘queen’
Magnolini et al [7], we expect the noise introduced by          appears in the artist and song title gazetteers (via the artist
false positive matches to outweigh any information pro-         “Queen” and the song “Dancing Queen”), so we take the aver-
vided by the true positives.                                    age of those along with the no entity embedding.
   Second, we filter out the least popular entities in the
gazetteer by thresholding the popularity (see Section 3.2).
This is equivalent to using a shorter collection window
                                                                name, song title, album name). Each token in the utter-
to gather candidates. We expect that this does little to
                                                                ance is represented as the average of the ingredient em-
exclude ambiguous entities.
                                                                beddings for the entity types matched which that token
   Third, we threshold the ratio between the number of
                                                                appears in a candidate. Note that every token receives
occurrences of each entity’s surface form in our train-
                                                                the no entity embedding as a bias term. This is illustrated
ing corpus and its popularity, similar to Oramas et al
                                                                in Figure 1.
[14]. While Oramas et al used ranks, we use raw counts
                                                                   We concatenate these gazetteer embeddings with the
to capture the assumption that the number of genuine
                                                                BERT output embeddings and add a single transformer
mentions of an entity is proportional to the underlying
                                                                encoder layer (i.e. self-attention with position embed-
popularity of that entity. We call our version 𝑟 ̂ to avoid
                                                                dings and a fully-connected output layer) so that the
confusion:
                                                                gazetteer information can be shared among all the to-
                                             popularity(𝑒)      kens.
                        𝑟(𝑒)
                         ̂ =                                (2)    The [ C L S ] token, which represents the entire utterance,
                                    mention_frequency(𝑒)
                                                                receives the average embedding taken over all tokens in
This has the practical benefit of allowing new entities to the utterance.
be added without re-ranking.                                       The outputs of the final transformer layer are passed to
    We rejected a fourth candidate method of thresholding prediction heads for each token. The [ C L S ] token predicts
based on the corpus frequency since the resulting filtered the user intent, and the remaining tokens predict their
gazetteers preferentially included exactly the entities we own entity type label (or O T H E R ). Note that each token
wished to exclude, like “yes”, “play”, “stop”, etc, and ex- can have only one label, and the utterance is associated
cluded entities not mentioned in our corpus, limiting a with exactly one intent.
model’s ability to generalize beyond its training data.            This architecture is illustrated in Figure 2.
    Where we filter gazetteers, we treat the percentage            The baseline model is identical, except that nothing is
to filter out as a hyperparameter (25%, 50% or 75%) and concatenated with the BERT outputs.
select the model that performed best on the validation             The model is fine-tuned using cross-entropy with label
set.                                                            smoothing [18], where the total loss is the sum of the
                                                                classification loss and the slot tagging loss for each token.
4.3. Model                                                      We update all parameters during fine-tuning, including
                                                                the gazetteer ‘ingredients’.
To understand a user’s utterance, we need to pre-                  This architecture resembles the joint intent-
dict the user’s intent (classification) and label any classification and slot filling model introduced in Chen
entities they mentioned (NER). We start with a et al [19], except for the gazetteer embeddings, the
standard BERT-base model [17], pretrained on additional transformer encoder layer, and the use of
b o o k _ c o r p u s _ w i k i _ e n _ u n c a s e d 1.        label smoothing. The first two of these additions provide
    To represent information from the gazetteers, we a method to fuse gazetteer information into the model
start by randomly-initialize four 64-dimensional ‘ingredi- before the prediction heads. Label smoothing helps
ent’ embeddings corresponding to the four-dimensional restrain the model’s overconfidence on ‘easy’ examples,
gazetteer vector described in Section 4.1 (no entity, artist resulting in more robust performance on utterances
                                                                outside the training distribution [18].
      1
        From https://nlp.gluon.ai/model_zoo/bert/index.html        Aside from the percentage of each gazetteer to filter out
Figure 2: Model architecture. Contextual embeddings (from the BERT encoder) are concatenated with gazetteer embeddings
(see Figure 1), and the resulting representation is passed through a transformer layer to prediction heads for both intent
classification (IC) and entity labeling (NER).


(in the popularity-filtered and 𝑟(𝑒)-filtered
                                  ̂           experiment), Table 2
we do not conduct any hyperparameter selection. We Results of experiments, shown as percentage increases or de-
find in both cases that filtering out 75% of the gazetteer creases from the baseline model.
gives the best performance on the validation set. For
                                                                                (a) Song title detection.
other hyperparameters, we use values that previously
performed well with a simplified baseline model that           Gazetteer                Precision             Recall                 F1
does not include the final transformer layer: since the        None                           -                    -                   -
utterances are typically short, we truncate them to 16         Full                      −2.72%              +0.58% −1.17%
tokens (this affects fewer than 0.1% of utterances), use a     Popularity-filtered       −3.18%              +1.57% −0.96%
batch size of 128, a label smoothing 𝛼 = 0.1, and train for    𝑟(𝑒)-filtered
                                                                ̂                        +3.60%              +3.82% +3.70%
10,000 updates. We checkpoint every 100 updates and
choose the version of the model that achieved the highest       (b) Intent classification (A d d M u s i c C o n s t r a i n t I n t e n t ).
intent classification F1 score on the validation set. Other    Gazetteer                Precision             Recall                 F1
hyperparameters follow those in Chen et al [19].               None                           -                    -                   -
                                                                                    Full                  +0.21%      −0.23%    −0.05%
                                                                                    Popularity-filtered   −0.21%      +1.85%    +0.98%
5. Results                                                                          𝑟(𝑒)-filtered
                                                                                     ̂                    +0.25%      +2.58%    +1.70%
For each experiment, we evaluate the model’s ability
to discriminate A d d M u s i c C o n s t r a i n t I n t e n t from other in-
tents, and its ability to extract correct song titles. Song                      methodology [20]. That means the span must exactly
title detection is particularly challenging for a conversa-                      match the annotated span to be counted as a true positive;
tional music recommender due to song titles’ variability,                        predicting the wrong span counts as both a false positive
cardinality and resemblance to normal speech.                                    (the incorrectly-predicted span) and a false negative (the
   We report this metric using the SemEval ‘strict’                              missed prediction). We choose this metric because we
                                                                 it may have approached a local minimum in the loss sur-
                                                                 face from which it cannot escape, resulting in poorer
                                                                 performance at convergence (as shown in Table 2).
                                                                    Table 3 shows some example user inputs that highlight
                                                                 how gazetteers help the model. Each utterance is shown
                                                                 with the song title predicted by the model learned under
                                                                 each experiment. While all the models are usually able to
                                                                 detect the presence of a song title, only the model trained
                                                                 using the 𝑟(𝑒)-filtered
                                                                             ̂           gazetteers is able to reliably detect
                                                                 the boundaries of the mention.


                                                                 6. Limitations and future work
Figure 3: Song title F1 during training. Results shown every
10% up to 30% of training data, when F1 has begun to con- We note that this work only considers utterances in En-
verge. Note that actual F1 scores are redacted due to their glish. The technique described here should apply to other
commercial sensitivity.                                      languages, but in some, whitespace cannot be used to
                                                                 delimit entities, making candidate matching more chal-
                                                                 lenging.
                                                                    We only briefly experimented with the impact of
require substantially-complete predictions for the down-
                                                                 changes to the gazetteer after model training (e.g. due to
stream entity resolver to associate the span with the cor-
                                                                 new releases or changing popularity of existing releases).
rect entity. A simple token-by-token evaluation showed
                                                                 While these initial results are promising, we would want
similar differences between models.
                                                                 to conduct more thorough research to evaluate how pre-
   Table 2 shows the results of our experiments. As ex-
                                                                 dictions are affected.
pected, we observe that using the full gazetteers increases
                                                                    We have also not explored the impact of false neg-
the recall of song titles at the expense of precision, re-
                                                                 atives (i.e. real entities not matched in the gazetteers,
sulting in a drop in F1 score of 1.17%. Filtering based
                                                                 either because the entity is not sufficiently popular, be-
on popularity seems to exaggerate these differences, fur-
                                                                 cause it has been recently released, or due to a user or
ther diminishing precision but boosting recall even more,
                                                                 voice recognition error). Our evaluation shows an overall
presumably because the model becomes too trusting of
                                                                 improvement in precision and recall, but there may be in-
information from the gazetteers which still include spu-
                                                                 dividual cases where the baseline model better leverages
rious matches. The overall effect is that F1 dropped by
                                                                 contextual clues to predict entity mentions. Randomly
slightly less: 0.96% from the baseline model.
                                                                 dropping out gazetteer features during training (i.e. re-
   Filtering based on the ratio 𝑟(𝑒)̂ addresses this issue.
                                                                 placing a 1 with a 0 in the gazetteer vector described
Common-but-spurious mentions are now excluded from
                                                                 in Section 4.1 some fraction of the time) might force a
the gazetteer, leaving a cleaner gazetteer that contains
                                                                 model to learn how to use gazetteer features where avail-
unambiguous entities, and which results in improved
                                                                 able, but to continue to attend to contextual information
precision and recall and an overall increase in F1 of 3.70%.
                                                                 otherwise, further improving overall performance.
   These results seem to be correlated with intent classi-
                                                                    This work was evaluated on manually-annotated of-
fication performance, with the worst song title detection
                                                                 fline datasets, but we have planned an A-B test to measure
F1 corresponding to the worst intent classification F1
                                                                 the downstream impact of improved NER performance on
(full gazetteers), and best with best (𝑟(𝑒)-filtered).
                                        ̂              This is
                                                                 the rate with which users accept the system’s recommen-
to be expected: correctly recognizing the presence or ab-
                                                                 dations. We expect to see an improvement corresponding
sence of a song title (or artist name) makes distinguishing
                                                                 to the system’s ability to correctly interpret our users’
intents easier.
                                                                 wishes.
   Figure 3 shows that the model with access to the
                                                                    In future work, we intend to fuse popularity and 𝑟(𝑒) ̂
𝑟-filtered
 ̂         gazetteer learns most quickly. The full and
                                                                 directly into the model, rather than using it to filter the
poularity-filtered gazetteers give an early boost to F1,
                                                                 gazetteers. Incorporating the ratio 𝑟(𝑒)̂ into the model
when model performance is poor, but are quickly over-
                                                                 as a feature would allow it to attend more heavily to
taken by the baseline model without gazetteers. This
                                                                 gazetteer features where the entity is unambiguous, and
supports our hypothesis that information from noisy
                                                                 use contextual cues to disambiguate less obvious exam-
gazetteers helps weak models, but when the model is
                                                                 ples. It also avoids introducing an arbitrary cut off: small
better able to leverage contextual cues, the noise begins
                                                                 values of 𝑟(𝑒)
                                                                              ̂ would be almost, but not quite, equiva-
to dominate any signal they provide. The model would
                                                                 lent to excluding the entity entirely. We hope that such
by now perform better by ignoring the information, but
Table 3
Examples of errors made by models trained with different gazetteer information. The expected song titles are underlined.
Note that the model handles over a dozen intents, and so identifying song titles even in somewhat structured utterances (e.g.
“X by Y”) is nontrivial.

      Utterance                               No gazetteers                           Full gazetteers
      rolling in the deep                     in the deep                             in the deep
      play cruella de vil                                                             cruella de
      high voltage
      just the way you are by                 the way you are                         the way you are
      you dropped the bomb on me              dropped the bomb on me                  dropped the bomb on me
      green eyed lady by sugarloaf            eyed lady                               eyed lady
      monsters by shinedown
      Utterance                               Popularity-filtered gazetteers          𝑟(𝑒)-filtered
                                                                                       ̂            gazetteers
      rolling in the deep                     rolling in the deep                     rolling in the deep
      play cruella de vil                     cruella de vil                          cruella de vil
      high voltage                                                                    high voltage
      just the way you are by                                                         just the way you are
      you dropped the bomb on me              dropped the bomb on me                  you dropped the bomb on me
      green eyed lady by sugarloaf            eyed lady                               green eyed lady
      monsters by shinedown                                                           monsters


an approach will yield further improvements and be a           References
step towards a general approach to integrating gazetteers
with pre-trained transformers.                                   [1] T. Ammari, J. Kaye, J. Y. Tsai, F. Bentley, Music,
                                                                     Search, and IoT: How people (really) use voice as-
                                                                     sistants, ACM Transactions on Computer-Human
7. Conclusion                                                        Interaction 26 (2019). doi:1 0 . 1 1 4 5 / 3 3 1 1 9 5 6 .
                                                                 [2] J. Thom, A. Nazarian, R. Brillman, H. Cramer,
In this paper, we demonstrate that a rather simple ar-               S. Mennicken, ”Play Music”: User Motivations and
chitecture with carefully filtered gazetteers can greatly            Expectations for Non-Specific Voice Queries, in:
improve NER performance in a conversational recom-                   21st International Society for Music Information
mendation system for the music domain. By augment-                   Retrieval Conference, 2020.
ing gazetteers with information about the underlying             [3] C. Gao, W. Lei, X. He, M. de Rijke, T.-S.
likelihood of a mention of each entity, the models can               Chua, Advances and challenges in conversa-
avoid false positives, and are better able to rely on large          tional recommender systems: A survey, AI Open
gazetteers.                                                          2 (2021) 100–126. URL: https://doi.org/10.1016/j.
   This finding could apply to other domains where large             aiopen.2021.06.002. doi:1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 1 . 0 6 .
gazetteers are common and where relevant frequency                   002. arXiv:2101.09459.
information is available. Examples might include place           [4] A. Mikheev, M. Moens, C. Grover, Named Entity
names along with their populations, or diseases with the             recognition without gazetteers, in: Proceedings of
number of diagnoses mentioned in discharge notes.                    EACL ’99, 1999, p. 1. doi:1 0 . 3 1 1 5 / 9 7 7 0 3 5 . 9 7 7 0 3 7 .
                                                                 [5] S. Peshterliev, C. Dupuy, I. Kiss, Self-Attention
Acknowledgments                                                      Gazetteer Embeddings for Named-Entity Recogni-
                                                                     tion (2020). URL: http://arxiv.org/abs/2004.04060.
The authors would like to thank Tao Ye, Justin Hugues-               arXiv:2004.04060.
Nuger, Chelsea Weaver and Vlad Magdin for their help             [6] S. Rijhwani, S. Zhou, G. Neubig, J. Carbonell, Soft
through conversations regarding the evaluation, techni-              Gazetteers for Low-Resource Named Entity Recog-
cal implementation and presentation of this work. We                 nition, in: Proceedings of the 58th Annual Meet-
also thank the reviewers for their valuable comments.                ing of the Association for Computational Linguis-
                                                                     tics, 2020, pp. 8118–8123. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 .
                                                                     acl- main.722. arXiv:2005.01866.
                                                                 [7] S. Magnolini, V. Piccioni, V. Balaraman, M. Guerini,
                                                                     B. Magnini, How to Use Gazetteers for En-
     tity Recognition with Neural Models, in: Pro-                               tant supervision for relation extraction without
     ceedings of the 5th Workshop on Semantic                                    labeled data, in: Proceedings of the Joint Con-
     Deep Learning (SemDeep-5), 2019, pp. 40–49.                                 ference of the 47th Annual Meeting of the ACL
     URL: https://github.com/XuezheMax/NeuroNLP2%                                and the 4th International Joint Conference on Nat-
     0Ahttps://www.aclweb.org/anthology/W19-5807.                                ural Language Processing of the AFNLP, 2009, pp.
 [8] T. Liu, J. G. Yao, C. Y. Lin, Towards improving                             1003–1011. URL: https://aclanthology.org/P09-1113.
     neural named entity recognition with gazetteers,                            doi:1 0 . 3 1 1 5 / 1 6 9 0 2 1 9 . 1 6 9 0 2 8 7 .
     in: Proceedings ofthe 57th Annual Meeting ofthe                        [17] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT:
     Association for Computational Linguistics, 2019, pp.                        Pre-training of deep bidirectional transformers for
     5301–5307. doi:1 0 . 1 8 6 5 3 / v 1 / p 1 9 - 1 5 2 4 .                    language understanding, NAACL HLT 2019 - 2019
 [9] C. H. Song, D. Lawrie, T. Finin, J. Mayfield, Im-                           Conference of the North American Chapter of the
     proving Neural Named Entity Recognition with                                Association for Computational Linguistics: Human
     Gazetteers, in: The 33rd International FLAIRS Con-                          Language Technologies - Proceedings of the Con-
     ference, 2020, p. 8. URL: https://arxiv.org/abs/2003.                       ference 1 (2019) 4171–4186. a r X i v : 1 8 1 0 . 0 4 8 0 5 .
     03072.                                                                 [18] R. Müller, S. Kornblith, G. Hinton, When does label
[10] H. Lin, Y. Lu, X. Han, L. Sun, B. Dong, S. Jiang,                           smoothing help?, in: Advances in Neural Informa-
     Gazetteer-Enhanced Attentive Neural Networks for                            tion Processing Systems, 2019. a r X i v : 1 9 0 6 . 0 2 6 2 9 .
     Named Entity Recognition, in: Proceedings ofthe                        [19] Q. Chen, Z. Zhuo, W. Wang, Bert for joint
     2019 Conference on Empirical Methods in Natural                             intent classification and slot filling, 2019.
     Language Processing and the 9th International Joint                         arXiv:1902.10909.
     Conference on Natural Language Processing, 2019,                       [20] D. S. Batista, Named-entity evaluation metrics
     pp. 6232–6237.                                                              based on entity-level, 2019. URL: http://www.
[11] O. Agarwal, A. Nenkova, The Utility and Inter-                              davidsbatista.net/.
     play of Gazetteers and Entity Segmentation for
     Named Entity Recognition in English, in: Findings
     ofthe Association for Computational Linguistics:
     ACL-IJCNLP, Association for Computational Lin-
     guistics, 2021, pp. 3990–4002. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .
     findings- acl.349.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-
     eit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: I. Guyon,
     U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
     gus, S. Vishwanathan, R. Garnett (Eds.), Ad-
     vances in Neural Information Processing Sys-
     tems, volume 30, Curran Associates, Inc., 2017.
     URL: https://proceedings.neurips.cc/paper/2017/
     file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[13] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transform-
     ers: ”The End of History” for NLP? (2021). URL:
     http://arxiv.org/abs/2105.00813. a r X i v : 2 1 0 5 . 0 0 8 1 3 .
[14] S. Oramas, M. Quadrana, F. Gouyon, P. M. Llc, Boot-
     strapping a Music Voice Assistant with Weak Su-
     pervision, in: Proceedings of NAACL HLT 2021:
     Industry Track, 2021, pp. 49–55.
[15] T. Meng, A. Fang, O. Rokhlenko, S. Malmasi, GEM-
     NET: Effective Gated Gazetteer Representations
     for Recognizing Complex Entities in Low-context
     Input, in: Proceedings of the 2021 Conference
     of the North American Chapter of theAssociation
     for Computational Linguistics: Human Language
     Technologies, Association for Computational Lin-
     guistics, 2021, pp. 1499–1512. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .
     naacl- main.118.
[16] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Dis-