<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing gazetteers for named entity recognition in conversational recom mender systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicholas Dingwall</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vianne R. Gao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>3rd Edition of Knowledge-aware and Conversational Recommender</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Amazon</institution>
          ,
          <addr-line>525 Market St, San Francisco, CA 94105</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Environments (ComplexRec) Joint Workshop @ RecSys 2021</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Named Entity Recognition (NER) is a crucial building block of a conversational agent, but remains challenging in real-word settings. It is particularly challenging for domains where the entities are linguistically complex and resemble common phrases (e.g. music and movies). While gazetteer features have been shown to improve NER performance, their utility is undermined by pervasive spurious entity matching. We propose a framework for gazetteer knowledge integration that incorporates external knowledge about entity popularity (e.g. a song's play count) to reduce spurious entity matching and improve the robustness of gazetteer features. Our experimental evaluations show that using unfiltered gazetteers degrades performance, but that incorporating external information improves it compared to a baseline model that doesn't use gazetteer information. Further, our framework can eficiently adapt to new entities in gazetteers without additional training, which is crucial for rapidly growing domains like music.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language understanding</kwd>
        <kwd>named-entity recognition</kwd>
        <kwd>gazetteer</kwd>
        <kwd>conversational recommender</kwd>
        <kwd>music</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        coming increasingly popular and music has emerged as
a primary use case for them [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Without a screen for
ing avenue to help users navigate their favorite music.
      </p>
      <p>But four factors make identifying mentions of these
entities dificult in the music domain. First, there are a lot
of songs and artists: thousands of artists release millions
of songs each year, and a modern deep learning system
must store their names in its weights. Second, song and
artist names can often resemble ordinary parts of speech,
and so the system must disambiguate genuine references
to musical entities from spurious matches. Third, users
misremember the titles of songs or use abbreviations to
refer to artists, limiting the applicability of canonical data
sources. And fourth, new songs are continually being
released – some of which immediately achieve their peak
popularity – which obliges the owners of a model to
regularly retrain the model.
nEvelop-O
(V. R. Gao)
Systems (KaRS) &amp; 5th Edition of Recommendation in Complex
that” (Tim McGraw), and “stop” (Spice Girls), resulting
in frequent false positive matches.</p>
      <sec id="sec-1-1">
        <title>Nevertheless, incorporating them into models is ap</title>
        <p>generalize beyond examples seen during training, and to</p>
        <p>In this paper, we experiment with utterance data and
music domain knowledge data. In the conversational
music recommender setting, a user is prompted to specify
genres, moods or artists and hears samples of playlists
matching the criteria they have provided so far (e.g.
including a specified artist). The conversation continues
until a sample is accepted, the user requests to play a
specific song or artist, or the user explicitly ends the
conversation or stops responding. The natural language</p>
        <p>Conversational systems make NER even more chal- pealing since they could allow a production system to
lenging: while single-turn commands are often
wellstructured and include indicators that a sequence tag- decouple updates to entity lists from model training.
interpretation component must therefore be able to rec- retain many of these homographs (phrases like ”yes”,
ognize any song or artist name mentioned by the user “something like that” and ”stop”), which are particularly
in order to select matching playlists to recommend and common in conversational responses, and exclude many
to avoid recommending playlists that do not match the genuine references to entities. Moreover, we wish to use
user’s request. gazetteers precisely because they will help generalize
be</p>
        <p>
          This paper explores diferent methods to extract value yond the training data, especially for low-context inputs,
from gazetteers enriched with popularity information like an artist name on its own.
about songs, artists and albums. In all cases, we add These works also either did not use pre-trained
lantoken-level features indicating the presence or absence guage models [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7, 8, 11</xref>
          ] or did not fine-tune the weights
of that token (or sequence of tokens) within a gazetteer. of the language models [10, 9]. Large pre-trained
lanWe vary the preprocessing applied to the gazetteers and guage models based on the transformer architecture [12]
show that neither full gazetteers nor gazetteers filtered have achieved state-of-the-art results across a variety of
to include only the most popular entities outperform a natural language processing tasks [13] but successfully
baseline gazetteer-free model. However, after a more integrating gazetteers remains elusive.
careful filtering of entities, adding a gazetteer does help In these prior works, the gazetteers used were all flat
the model to robustly extract music entity names. In lists of entity names, and so the systems could only
condoing so, the model improves its ability to classify a user’s sider the surface form of each entity (i.e. any string
overall intent. matching the name of the entity, regardless of the
intended referent of that string). Oramas et al introduces
a framework to leverage the popularity of each
associ2. Background and prior work ated entity to distinguish between ambiguous and
nonambiguous names [14]. For each entity, they compute a
2.1. Named entity recognition ratio between the rank of the entity’s popularity and the
rank of the number of occurrences of its surface form in
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Gazetteers were common in pre-neural NER architec</title>
        <p>
          tures: indeed, Mikheev et al in 1999 was notable for
doing it without gazetteers [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Their use has fallen out of fashion with the recent
dominance of large pre-trained language models for NER,
since these models can better leverage contextual
information to detect entity mentions [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. More recent work
has demonstrated that gazetteers can still improve NER
performance with neural architectures, especially where
training data is limited [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7, 8, 9, 10, 11</xref>
          ].
        </p>
        <p>
          However, the improved performance of modern NER
models exposes the noise in gazetteers: Magnolini et
al showed that filtering rarely-occurring values from
large gazetteers boosts performance more than using
the unfiltered gazetteer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. But in the music domain,
the noise comes principally from linguistic ambiguity:
entity names can be homographs of non-entity words
and phrases. Filtering based on corpus frequency would
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Named entity recognition (NER) is the task of associating</title>
        <p>each word in a sentence with a label indicating its type. their corpus.</p>
        <p>In typical settings, the type may be a person, a location,
or an organization. In our domain, we are interested in  () =   () (1)
music entities: artist names, song titles and album names.    ()</p>
        <p>In practice we refer to tokens instead of words, allow- Mentions of entities that occur more frequently in their
ing for rare words to be split into subwords to limit the corpus than would be expected based on their popularity
vocabulary size necessary to cover the entire dataset. For rank (i.e.  () is small) are likely to be spurious matches.
example, ‘ed sheeran’ is represented as the three tokens They use this to automatically label a training set: some
‘ed’, ‘sheer ’ and ‘an’. We hope to train a model that asso- entities can be confidently labeled as songs or artists,
ciates all three tokens with the artist_name tag. some – like “Could You”, “Play Music” and “Xmas” –
are ignored, and inputs containing potentially-confusing
2.2. Gazetteers for NER mentions like “Country Joe” and “Spanish House” are
excluded entirely. However, this informs only the dataset
generation; their model does not have access to the
underlying popularities or the rank.</p>
        <p>Meng et al propose a mixture of experts model for
NER that directly models how much weight to give to
features derived from the context (using a BERT encoder)
and from gazetteers (using a BiLSTM built on gazetteer
matches) for each token. This substantially improves
performance on their datasets, but still relies exclusively
on linguistic features [15].</p>
        <p>In this work, we show that filtering gazetteers using a
similar formula to Equation 1 [14] allows an NER model
to leverage gazetteer information. We compare against
two other preprocessing methods which both degrade
performance.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <sec id="sec-2-1">
        <title>3.1. Corpus</title>
        <p>We train our model on historical user utterances. The
data are labeled using a hand-crafted set of grammatical
rules designed to match the most frequently-occurring
utterances. The rules consist of a pattern and allowed
slot values. For example, the pattern j u s t p l a y &lt; a r t i s t
n a m e &gt; includes an &lt; a r t i s t - n a m e &gt; slot; this slot is
associated with a list of popular artists. The combination of
pattern and slot list allows us to match “just play the
beatles” and “just play rihanna”, but not “just play trivial
pursuit”, since trivial pursuit is a game and not an artist.</p>
        <p>In practice, we abstract common phrases and nest
rules: &lt; n e g a t i v e - t r i g g e r &gt; &lt; a r t i s t - n a m e &gt; matches a
predetermined set of negative trigger phrases like “not”, “i
don’t like”, etc, along with an artist name.</p>
        <p>To control the latency of these rules, we limit slot lists
to popular entities. The rules therefore fail on a long tail
of infrequent utterances, either because the utterance
contains an entity outside our canonical lists, or because
the pattern is unusual. We observe that statistical models
trained on these labeled utterances can generalize to long
tail utterances, as proposed in [16].</p>
        <p>The rules cover multiple intents, including standard
ones like Y e s I n t e n t , N o I n t e n t and S t o p I n t e n t , but also
an A d d M u s i c C o n s t r a i n t I n t e n t for when users constrain
recommended music entities. Slot types include entity
types as well as trigger phrases that indicate negation,
instructions to go immediately to playback, and so on.</p>
        <p>To evaluate the model’s ability to discriminate
between spurious mentions of entity names, we hand
label independently-collected validation and test sets
where each utterance contains a substring included
in the song title gazetteer (see Section 3.2). We
ignore the 50 most frequently-occurring spurious matches
(e.g. “play”, “just”, “yeah”, etc) so that approximately
50% of utterances express a A d d M u s i c C o n s t r a i n t I n t e n t ,
and about 50% of these A d d M u s i c C o n s t r a i n t I n t e n t
utterances contain a true reference to a specific song. Only
A d d M u s i c C o n s t r a i n t I n t e n t can include song titles. These
titles include some that were misremembered or that
contain voice recognition errors.</p>
        <p>Our grammatical rules can interpret about 30% of
utterances in the test set and we consider these to be in
domain; the remaining 70% test the model’s
generalization capability, either to new utterance patterns or to
novel entities.</p>
        <p>The training, validation and test datasets are fixed for
all experiments.</p>
        <sec id="sec-2-1-1">
          <title>We use gazetteers derived from historical single-turn</title>
          <p>utterances that expressed a music request to a voice
assistant. Since these utterances are well-structured (usually
of the form “play X”), the entity name can be extracted
and associated with an entity type by an entity resolution
system.</p>
          <p>These gazetteers can include entities with user or voice
recognition errors as long as the entity resolution system
was able to resolve them to a canonical entity.</p>
          <p>As such, the gazetteers consist of multiple strings
corresponding to the same entity: e.g. “blink one eighty two”
and “blink one eight two” both appear in our gazetteer,
even though the canonical name is “Blink 182”.</p>
          <p>Each entity in a gazetteer is augmented with the
number of times it was requested (which we refer to as its
popularity).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <sec id="sec-3-1">
        <title>4.1. Candidate entity matching</title>
        <sec id="sec-3-1-1">
          <title>Before passing a user’s utterance to the model, we must</title>
          <p>ifrst determine which gazetteer entries appear in it. Note
that at this stage, we do not distinguish between true
positives (the user was referring to the entity) and false
positives (spurious matches like ‘yes’, ‘play’, etc).</p>
          <p>We find these candidates using a regex search for each
gazetteer entry, enforcing that the match must terminate
at whitespace or at the beginning or end of the utterance.</p>
          <p>Next, we associate all of the tokens in an utterance with
the entity type of the candidate. We summarize this
information with a binary vector for each word, where each
dimension corresponds to one of the gazetteers (artist
name, song title, album name). See Table 1 for an
example.</p>
          <p>We add a dimension to this vector with its value fixed
to 1 as a bias term. This entry can be thought of as the
no entity dimension, which captures the possibility that
the gazetteer matches were false positives.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Gazetteer filtering</title>
        <sec id="sec-3-2-1">
          <title>In this work, we use a straightforward technique to in</title>
          <p>corporate the summary gazetteer vectors into our
baseline model (see Section 4.3). The baseline model does
not have access to this vector, and so we can evaluate
whether gazetteers improve or degrade performance.</p>
          <p>We compare three methods to preprocess the
gazetteers: Figure 1: Gazetteer features are computed as the sum of
as</p>
          <p>
            First, we use the full, unfiltered gazetteer. Following sociated ‘ingredient’ embeddings. Here, for example, ‘queen’
Magnolini et al [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], we expect the noise introduced by appears in the artist and song title gazetteers (via the artist
false positive matches to outweigh any information pro- “Queen” and the song “Dancing Queen”), so we take the
avervided by the true positives. age of those along with the no entity embedding.
          </p>
          <p>Second, we filter out the least popular entities in the
gazetteer by thresholding the popularity (see Section 3.2).</p>
          <p>This is equivalent to using a shorter collection window
to gather candidates. We expect that this does little to
exclude ambiguous entities.</p>
          <p>Third, we threshold the ratio between the number of
occurrences of each entity’s surface form in our
training corpus and its popularity, similar to Oramas et al
[14]. While Oramas et al used ranks, we use raw counts
to capture the assumption that the number of genuine
mentions of an entity is proportional to the underlying
popularity of that entity. We call our version  ̂ to avoid
confusion:
name, song title, album name). Each token in the
utterance is represented as the average of the ingredient
embeddings for the entity types matched which that token
appears in a candidate. Note that every token receives
the no entity embedding as a bias term. This is illustrated
in Figure 1.</p>
          <p>We concatenate these gazetteer embeddings with the
BERT output embeddings and add a single transformer
encoder layer (i.e. self-attention with position
embeddings and a fully-connected output layer) so that the
gazetteer information can be shared among all the
topopularity() kens.
 (̂ ) = mention_frequency() (2) The [ C L S ] token, which represents the entire utterance,
receives the average embedding taken over all tokens in
This has the practical benefit of allowing new entities to the utterance.
be added without re-ranking. The outputs of the final transformer layer are passed to</p>
          <p>We rejected a fourth candidate method of thresholding prediction heads for each token. The [ C L S ] token predicts
based on the corpus frequency since the resulting filtered the user intent, and the remaining tokens predict their
gazetteers preferentially included exactly the entities we own entity type label (or O T H E R ). Note that each token
wished to exclude, like “yes”, “play”, “stop”, etc, and ex- can have only one label, and the utterance is associated
cluded entities not mentioned in our corpus, limiting a with exactly one intent.
model’s ability to generalize beyond its training data. This architecture is illustrated in Figure 2.</p>
          <p>Where we filter gazetteers, we treat the percentage The baseline model is identical, except that nothing is
to filter out as a hyperparameter (25%, 50% or 75%) and concatenated with the BERT outputs.
select the model that performed best on the validation The model is fine-tuned using cross-entropy with label
set. smoothing [18], where the total loss is the sum of the
classification loss and the slot tagging loss for each token.
4.3. Model We update all parameters during fine-tuning, including
the gazetteer ‘ingredients’.</p>
          <p>To understand a user’s utterance, we need to pre- This architecture resembles the joint
intentdict the user’s intent (classification) and label any classification and slot filling model introduced in Chen
entities they mentioned (NER). We start with a et al [19], except for the gazetteer embeddings, the
standard BERT-base model [17], pretrained on additional transformer encoder layer, and the use of
b o o k _ c o r p u s _ w i k i _ e n _ u n c a s e d 1. label smoothing. The first two of these additions provide</p>
          <p>To represent information from the gazetteers, we a method to fuse gazetteer information into the model
start by randomly-initialize four 64-dimensional ‘ingredi- before the prediction heads. Label smoothing helps
ent’ embeddings corresponding to the four-dimensional restrain the model’s overconfidence on ‘easy’ examples,
gazetteer vector described in Section 4.1 (no entity, artist resulting in more robust performance on utterances
outside the training distribution [18].</p>
          <p>1From https://nlp.gluon.ai/model_zoo/bert/index.html Aside from the percentage of each gazetteer to filter out
(in the popularity-filtered and  (̂ ) -filtered experiment),
we do not conduct any hyperparameter selection. We
ifnd in both cases that filtering out 75% of the gazetteer
gives the best performance on the validation set. For
other hyperparameters, we use values that previously
performed well with a simplified baseline model that
does not include the final transformer layer: since the
utterances are typically short, we truncate them to 16
tokens (this afects fewer than 0.1% of utterances), use a
batch size of 128, a label smoothing  = 0.1 , and train for
10,000 updates. We checkpoint every 100 updates and
choose the version of the model that achieved the highest
intent classification F1 score on the validation set. Other
hyperparameters follow those in Chen et al [19].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>For each experiment, we evaluate the model’s ability</title>
        <p>to discriminate A d d M u s i c C o n s t r a i n t I n t e n t from other
intents, and its ability to extract correct song titles. Song
title detection is particularly challenging for a
conversational music recommender due to song titles’ variability,
cardinality and resemblance to normal speech.</p>
        <p>We report this metric using the SemEval ‘strict’
Gazetteer
None
Full
Popularity-filtered
 (̂ ) -filtered</p>
        <p>Precision</p>
        <p>−2.72%
−3.18%
+3.60%
Gazetteer
None
Full
Popularity-filtered
 (̂ ) -filtered</p>
        <p>Precision</p>
        <p>+0.21%
−0.21%
+0.25%
methodology [20]. That means the span must exactly
match the annotated span to be counted as a true positive;
predicting the wrong span counts as both a false positive
(the incorrectly-predicted span) and a false negative (the
missed prediction). We choose this metric because we
require substantially-complete predictions for the
downstream entity resolver to associate the span with the
correct entity. A simple token-by-token evaluation showed
similar diferences between models.</p>
        <p>Table 2 shows the results of our experiments. As
expected, we observe that using the full gazetteers increases
the recall of song titles at the expense of precision,
resulting in a drop in F1 score of 1.17%. Filtering based
on popularity seems to exaggerate these diferences,
further diminishing precision but boosting recall even more,
presumably because the model becomes too trusting of
information from the gazetteers which still include
spurious matches. The overall efect is that F1 dropped by
slightly less: 0.96% from the baseline model.</p>
        <p>Filtering based on the ratio  (̂ ) addresses this issue.
Common-but-spurious mentions are now excluded from
the gazetteer, leaving a cleaner gazetteer that contains
unambiguous entities, and which results in improved
precision and recall and an overall increase in F1 of 3.70%.</p>
        <p>These results seem to be correlated with intent
classiifcation performance, with the worst song title detection
F1 corresponding to the worst intent classification F1
(full gazetteers), and best with best ( (̂ ) -filtered). This is
to be expected: correctly recognizing the presence or
absence of a song title (or artist name) makes distinguishing
intents easier.</p>
        <p>Figure 3 shows that the model with access to the
 -̂filtered gazetteer learns most quickly. The full and
poularity-filtered gazetteers give an early boost to F1,
when model performance is poor, but are quickly
overtaken by the baseline model without gazetteers. This
supports our hypothesis that information from noisy
gazetteers helps weak models, but when the model is
better able to leverage contextual cues, the noise begins
to dominate any signal they provide. The model would
by now perform better by ignoring the information, but
it may have approached a local minimum in the loss
surface from which it cannot escape, resulting in poorer
performance at convergence (as shown in Table 2).</p>
        <p>Table 3 shows some example user inputs that highlight
how gazetteers help the model. Each utterance is shown
with the song title predicted by the model learned under
each experiment. While all the models are usually able to
detect the presence of a song title, only the model trained
using the  (̂ ) -filtered gazetteers is able to reliably detect
the boundaries of the mention.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Limitations and future work</title>
      <sec id="sec-5-1">
        <title>We note that this work only considers utterances in En</title>
        <p>glish. The technique described here should apply to other
languages, but in some, whitespace cannot be used to
delimit entities, making candidate matching more
challenging.</p>
        <p>We only briefly experimented with the impact of
changes to the gazetteer after model training (e.g. due to
new releases or changing popularity of existing releases).
While these initial results are promising, we would want
to conduct more thorough research to evaluate how
predictions are afected.</p>
        <p>We have also not explored the impact of false
negatives (i.e. real entities not matched in the gazetteers,
either because the entity is not suficiently popular,
because it has been recently released, or due to a user or
voice recognition error). Our evaluation shows an overall
improvement in precision and recall, but there may be
individual cases where the baseline model better leverages
contextual clues to predict entity mentions. Randomly
dropping out gazetteer features during training (i.e.
replacing a 1 with a 0 in the gazetteer vector described
in Section 4.1 some fraction of the time) might force a
model to learn how to use gazetteer features where
available, but to continue to attend to contextual information
otherwise, further improving overall performance.</p>
        <p>This work was evaluated on manually-annotated
oflfine datasets, but we have planned an A-B test to measure
the downstream impact of improved NER performance on
the rate with which users accept the system’s
recommendations. We expect to see an improvement corresponding
to the system’s ability to correctly interpret our users’
wishes.</p>
        <p>In future work, we intend to fuse popularity and  (̂ )
directly into the model, rather than using it to filter the
gazetteers. Incorporating the ratio  (̂ ) into the model
as a feature would allow it to attend more heavily to
gazetteer features where the entity is unambiguous, and
use contextual cues to disambiguate less obvious
examples. It also avoids introducing an arbitrary cut of: small
values of  (̂ ) would be almost, but not quite,
equivalent to excluding the entity entirely. We hope that such
an approach will yield further improvements and be a
step towards a general approach to integrating gazetteers
with pre-trained transformers.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>In this paper, we demonstrate that a rather simple
architecture with carefully filtered gazetteers can greatly
improve NER performance in a conversational
recommendation system for the music domain. By
augmenting gazetteers with information about the underlying
likelihood of a mention of each entity, the models can
avoid false positives, and are better able to rely on large
gazetteers.</p>
      <p>This finding could apply to other domains where large
gazetteers are common and where relevant frequency
information is available. Examples might include place
names along with their populations, or diseases with the
number of diagnoses mentioned in discharge notes.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>The authors would like to thank Tao Ye, Justin Hugues</title>
        <p>Nuger, Chelsea Weaver and Vlad Magdin for their help
through conversations regarding the evaluation,
technical implementation and presentation of this work. We
also thank the reviewers for their valuable comments.
tity Recognition with Neural Models, in: Pro- tant supervision for relation extraction without
ceedings of the 5th Workshop on Semantic labeled data, in: Proceedings of the Joint
ConDeep Learning (SemDeep-5), 2019, pp. 40–49. ference of the 47th Annual Meeting of the ACL
URL: https://github.com/XuezheMax/NeuroNLP2% and the 4th International Joint Conference on
Nat0Ahttps://www.aclweb.org/anthology/W19-5807. ural Language Processing of the AFNLP, 2009, pp.
[8] T. Liu, J. G. Yao, C. Y. Lin, Towards improving 1003–1011. URL: https://aclanthology.org/P09-1113.
neural named entity recognition with gazetteers, doi:1 0 . 3 1 1 5 / 1 6 9 0 2 1 9 . 1 6 9 0 2 8 7 .
in: Proceedings ofthe 57th Annual Meeting ofthe [17] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT:
Association for Computational Linguistics, 2019, pp. Pre-training of deep bidirectional transformers for
5301–5307. doi:1 0 . 1 8 6 5 3 / v 1 / p 1 9 - 1 5 2 4 . language understanding, NAACL HLT 2019 - 2019
[9] C. H. Song, D. Lawrie, T. Finin, J. Mayfield, Im- Conference of the North American Chapter of the
proving Neural Named Entity Recognition with Association for Computational Linguistics: Human
Gazetteers, in: The 33rd International FLAIRS Con- Language Technologies - Proceedings of the
Conference, 2020, p. 8. URL: https://arxiv.org/abs/2003. ference 1 (2019) 4171–4186. a r X i v : 1 8 1 0 . 0 4 8 0 5 .
03072. [18] R. Müller, S. Kornblith, G. Hinton, When does label
[10] H. Lin, Y. Lu, X. Han, L. Sun, B. Dong, S. Jiang, smoothing help?, in: Advances in Neural
InformaGazetteer-Enhanced Attentive Neural Networks for tion Processing Systems, 2019. a r X i v : 1 9 0 6 . 0 2 6 2 9 .
Named Entity Recognition, in: Proceedings ofthe [19] Q. Chen, Z. Zhuo, W. Wang, Bert for joint
2019 Conference on Empirical Methods in Natural intent classification and slot filling, 2019.
Language Processing and the 9th International Joint a r X i v : 1 9 0 2 . 1 0 9 0 9 .</p>
        <p>Conference on Natural Language Processing, 2019, [20] D. S. Batista, Named-entity evaluation metrics
pp. 6232–6237. based on entity-level, 2019. URL: http://www.
[11] O. Agarwal, A. Nenkova, The Utility and Inter- davidsbatista.net/.</p>
        <p>play of Gazetteers and Entity Segmentation for
Named Entity Recognition in English, in: Findings
ofthe Association for Computational Linguistics:
ACL-IJCNLP, Association for Computational
Linguistics, 2021, pp. 3990–4002. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .</p>
        <p>f i n d i n g s - a c l . 3 4 9 .
[12] A. Vaswani, N. Shazeer, N. Parmar, J.
Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I.
Polosukhin, Attention is all you need, in: I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R.
Fergus, S. Vishwanathan, R. Garnett (Eds.),
Advances in Neural Information Processing
Systems, volume 30, Curran Associates, Inc., 2017.</p>
        <p>URL: https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[13] A. Chernyavskiy, D. Ilvovsky, P. Nakov,
Transformers: ”The End of History” for NLP? (2021). URL:
http://arxiv.org/abs/2105.00813. a r X i v : 2 1 0 5 . 0 0 8 1 3 .
[14] S. Oramas, M. Quadrana, F. Gouyon, P. M. Llc,
Bootstrapping a Music Voice Assistant with Weak
Supervision, in: Proceedings of NAACL HLT 2021:</p>
        <p>Industry Track, 2021, pp. 49–55.
[15] T. Meng, A. Fang, O. Rokhlenko, S. Malmasi,
GEM</p>
        <p>NET: Efective Gated Gazetteer Representations
for Recognizing Complex Entities in Low-context
Input, in: Proceedings of the 2021 Conference
of the North American Chapter of theAssociation
for Computational Linguistics: Human Language
Technologies, Association for Computational
Linguistics, 2021, pp. 1499–1512. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .</p>
        <p>n a a c l - m a i n . 1 1 8 .
[16] M. Mintz, S. Bills, R. Snow, D. Jurafsky,
Dis</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ammari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bentley</surname>
          </string-name>
          , Music, Search, and
          <article-title>IoT: How people (really) use voice assistants</article-title>
          ,
          <source>ACM Transactions on Computer-Human Interaction</source>
          <volume>26</volume>
          (
          <year>2019</year>
          ).
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 1 1 9 5 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nazarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brillman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cramer</surname>
          </string-name>
          , S. Mennicken, ”Play Music”
          <article-title>: User Motivations and Expectations for Non-Specific Voice Queries</article-title>
          , in: 21st
          <source>International Society for Music Information Retrieval Conference</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Advances and challenges in conversational recommender systems: A survey</article-title>
          ,
          <source>AI</source>
          Open 2
          <article-title>(</article-title>
          <year>2021</year>
          )
          <fpage>100</fpage>
          -
          <lpage>126</lpage>
          . URL: https://doi.org/10.1016/j. aiopen.
          <year>2021</year>
          .
          <volume>06</volume>
          .002.
          <source>doi:1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 1 . 0 6 . 0 0 2 . a r X i v : 2</source>
          <volume>1 0 1 . 0 9 4 5 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikheev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <article-title>Named Entity recognition without gazetteers</article-title>
          ,
          <source>in: Proceedings of EACL '99</source>
          ,
          <year>1999</year>
          , p.
          <fpage>1</fpage>
          .
          <source>doi:1 0 . 3 1</source>
          <volume>1 5 / 9 7 7 0 3 5 . 9 7 7 0 3 7 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Peshterliev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dupuy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kiss</surname>
          </string-name>
          ,
          <article-title>Self-Attention Gazetteer Embeddings for Named-Entity Recognition (</article-title>
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
          <year>2004</year>
          .04060.
          <article-title>a r X i v : 2 0 0 4 . 0 4 0 6 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rijhwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Neubig, J. Carbonell, Soft Gazetteers for
          <article-title>Low-Resource Named Entity Recognition</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8118</fpage>
          -
          <lpage>8123</lpage>
          . doi:
          <article-title>1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 7 2 2 . a r X i v : 2 0 0 5 . 0 1 8 6 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Magnolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Piccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Balaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , How to Use Gazetteers for En-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>