Enhancing gazetteers for named entity recognition in conversational recommender systems Nicholas Dingwall1 , Vianne R. Gao2 1 Amazon, 525 Market St, San Francisco, CA 94105 2 Weill Cornell Medicine, 1300 York Ave, New York, NY 10065 (Work conducted during an internship at Amazon) Abstract Named Entity Recognition (NER) is a crucial building block of a conversational agent, but remains challenging in real-word settings. It is particularly challenging for domains where the entities are linguistically complex and resemble common phrases (e.g. music and movies). While gazetteer features have been shown to improve NER performance, their utility is undermined by pervasive spurious entity matching. We propose a framework for gazetteer knowledge integration that incorporates external knowledge about entity popularity (e.g. a song’s play count) to reduce spurious entity matching and improve the robustness of gazetteer features. Our experimental evaluations show that using unfiltered gazetteers degrades performance, but that incorporating external information improves it compared to a baseline model that doesn’t use gazetteer information. Further, our framework can efficiently adapt to new entities in gazetteers without additional training, which is crucial for rapidly growing domains like music. Keywords natural language understanding, named-entity recognition, gazetteer, conversational recommender, music 1. Introduction ging system can utilize to estimate a prior likelihood that a token represents an entity (“Alexa play X”, etc), Voice assistants (Siri, Alexa, Google Assistant) are be- conversational responses lack such affordances. The sys- coming increasingly popular and music has emerged as tem must also distinguish system-directed speech from a primary use case for them [1, 2]. Without a screen for background conversation overheard while waiting for a browsing, conversational recommenders are an appeal- response: “who let the dogs out” is likely to be a song ing avenue to help users navigate their favorite music. request (Baha Men), but “who let the cats out” is more But four factors make identifying mentions of these likely to be a frustrated parent chastising their children. entities difficult in the music domain. First, there are a lot Finally, errors made by users in recalling an entity name of songs and artists: thousands of artists release millions or by a voice recognition system, alongside nonstandard of songs each year, and a modern deep learning system spelling of artist and song titles, frustrate attempts at must store their names in its weights. Second, song and simple string matching against canonical entities [3]. artist names can often resemble ordinary parts of speech, Gazetteers – flat lists of entity names – can provide and so the system must disambiguate genuine references a source of valid entity names. But incorporating them to musical entities from spurious matches. Third, users into modern NER models has proved difficult (see Section misremember the titles of songs or use abbreviations to 2.2), and the music domain makes their application even refer to artists, limiting the applicability of canonical data more precarious: any song title gazetteer will include sources. And fourth, new songs are continually being common phrases like “yes” (LMFAO), “something like released – some of which immediately achieve their peak that” (Tim McGraw), and “stop” (Spice Girls), resulting popularity – which obliges the owners of a model to in frequent false positive matches. regularly retrain the model. Nevertheless, incorporating them into models is ap- Conversational systems make NER even more chal- pealing since they could allow a production system to lenging: while single-turn commands are often well- generalize beyond examples seen during training, and to structured and include indicators that a sequence tag- decouple updates to entity lists from model training. In this paper, we experiment with utterance data and 3rd Edition of Knowledge-aware and Conversational Recommender music domain knowledge data. In the conversational mu- Systems (KaRS) & 5th Edition of Recommendation in Complex Environments (ComplexRec) Joint Workshop @ RecSys 2021, sic recommender setting, a user is prompted to specify September 27–1 October 2021, Amsterdam, Netherlands genres, moods or artists and hears samples of playlists Envelope-Open nickding@amazon.com (N. Dingwall); matching the criteria they have provided so far (e.g. in- vrg4001@med.cornell.edu (V. R. Gao) cluding a specified artist). The conversation continues Orcid 0000-0003-0026-2740 (N. Dingwall); 0000-0001-8990-0897 until a sample is accepted, the user requests to play a (V. R. Gao) © 2021 Copyright for this paper by its authors. Use permitted under Creative specific song or artist, or the user explicitly ends the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) conversation or stops responding. The natural language interpretation component must therefore be able to rec- retain many of these homographs (phrases like ”yes”, ognize any song or artist name mentioned by the user “something like that” and ”stop”), which are particularly in order to select matching playlists to recommend and common in conversational responses, and exclude many to avoid recommending playlists that do not match the genuine references to entities. Moreover, we wish to use user’s request. gazetteers precisely because they will help generalize be- This paper explores different methods to extract value yond the training data, especially for low-context inputs, from gazetteers enriched with popularity information like an artist name on its own. about songs, artists and albums. In all cases, we add These works also either did not use pre-trained lan- token-level features indicating the presence or absence guage models [6, 7, 8, 11] or did not fine-tune the weights of that token (or sequence of tokens) within a gazetteer. of the language models [10, 9]. Large pre-trained lan- We vary the preprocessing applied to the gazetteers and guage models based on the transformer architecture [12] show that neither full gazetteers nor gazetteers filtered have achieved state-of-the-art results across a variety of to include only the most popular entities outperform a natural language processing tasks [13] but successfully baseline gazetteer-free model. However, after a more integrating gazetteers remains elusive. careful filtering of entities, adding a gazetteer does help In these prior works, the gazetteers used were all flat the model to robustly extract music entity names. In lists of entity names, and so the systems could only con- doing so, the model improves its ability to classify a user’s sider the surface form of each entity (i.e. any string overall intent. matching the name of the entity, regardless of the in- tended referent of that string). Oramas et al introduces a framework to leverage the popularity of each associ- 2. Background and prior work ated entity to distinguish between ambiguous and non- ambiguous names [14]. For each entity, they compute a 2.1. Named entity recognition ratio between the rank of the entity’s popularity and the Named entity recognition (NER) is the task of associating rank of the number of occurrences of its surface form in each word in a sentence with a label indicating its type. their corpus. In typical settings, the type may be a person, a location, 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦𝑅𝑎𝑛𝑘(𝑒) or an organization. In our domain, we are interested in 𝑟(𝑒) = (1) music entities: artist names, song titles and album names. 𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑅𝑎𝑛𝑘(𝑒) In practice we refer to tokens instead of words, allow- Mentions of entities that occur more frequently in their ing for rare words to be split into subwords to limit the corpus than would be expected based on their popularity vocabulary size necessary to cover the entire dataset. For rank (i.e. 𝑟(𝑒) is small) are likely to be spurious matches. example, ‘ed sheeran’ is represented as the three tokens They use this to automatically label a training set: some ‘ed’, ‘sheer’ and ‘an’. We hope to train a model that asso- entities can be confidently labeled as songs or artists, ciates all three tokens with the artist_name tag. some – like “Could You”, “Play Music” and “Xmas” – are ignored, and inputs containing potentially-confusing 2.2. Gazetteers for NER mentions like “Country Joe” and “Spanish House” are excluded entirely. However, this informs only the dataset Gazetteers were common in pre-neural NER architec- generation; their model does not have access to the un- tures: indeed, Mikheev et al in 1999 was notable for derlying popularities or the rank. doing it without gazetteers [4]. Meng et al propose a mixture of experts model for Their use has fallen out of fashion with the recent NER that directly models how much weight to give to dominance of large pre-trained language models for NER, features derived from the context (using a BERT encoder) since these models can better leverage contextual infor- and from gazetteers (using a BiLSTM built on gazetteer mation to detect entity mentions [5]. More recent work matches) for each token. This substantially improves has demonstrated that gazetteers can still improve NER performance on their datasets, but still relies exclusively performance with neural architectures, especially where on linguistic features [15]. training data is limited [6, 7, 8, 9, 10, 11]. In this work, we show that filtering gazetteers using a However, the improved performance of modern NER similar formula to Equation 1 [14] allows an NER model models exposes the noise in gazetteers: Magnolini et to leverage gazetteer information. We compare against al showed that filtering rarely-occurring values from two other preprocessing methods which both degrade large gazetteers boosts performance more than using performance. the unfiltered gazetteer [7]. But in the music domain, the noise comes principally from linguistic ambiguity: entity names can be homographs of non-entity words and phrases. Filtering based on corpus frequency would 3. Datasets Table 1 Example showing which gazetteer embeddings trigger for 3.1. Corpus each token in the utterance play dancing queen. In this case, dancing queen is recognized as a song, and queen as an artist. We train our model on historical user utterances. The For simplicity, we assume no other gazetteer entries match data are labeled using a hand-crafted set of grammatical the utterance. rules designed to match the most frequently-occurring entity type play dancing queen utterances. The rules consist of a pattern and allowed slot values. For example, the pattern j u s t p l a y < a r t i s t - artist name 7 7 3 n a m e > includes an < a r t i s t - n a m e > slot; this slot is associ- song title 7 3 3 ated with a list of popular artists. The combination of album name 7 7 7 pattern and slot list allows us to match “just play the beatles” and “just play rihanna”, but not “just play trivial pursuit”, since trivial pursuit is a game and not an artist. 3.2. Gazetteers In practice, we abstract common phrases and nest We use gazetteers derived from historical single-turn rules: < n e g a t i v e - t r i g g e r > < a r t i s t - n a m e > matches a pre- utterances that expressed a music request to a voice assis- determined set of negative trigger phrases like “not”, “i tant. Since these utterances are well-structured (usually don’t like”, etc, along with an artist name. of the form “play X”), the entity name can be extracted To control the latency of these rules, we limit slot lists and associated with an entity type by an entity resolution to popular entities. The rules therefore fail on a long tail system. of infrequent utterances, either because the utterance These gazetteers can include entities with user or voice contains an entity outside our canonical lists, or because recognition errors as long as the entity resolution system the pattern is unusual. We observe that statistical models was able to resolve them to a canonical entity. trained on these labeled utterances can generalize to long As such, the gazetteers consist of multiple strings cor- tail utterances, as proposed in [16]. responding to the same entity: e.g. “blink one eighty two” The rules cover multiple intents, including standard and “blink one eight two” both appear in our gazetteer, ones like Y e s I n t e n t , N o I n t e n t and S t o p I n t e n t , but also even though the canonical name is “Blink 182”. an A d d M u s i c C o n s t r a i n t I n t e n t for when users constrain Each entity in a gazetteer is augmented with the num- recommended music entities. Slot types include entity ber of times it was requested (which we refer to as its types as well as trigger phrases that indicate negation, popularity). instructions to go immediately to playback, and so on. To evaluate the model’s ability to discriminate be- tween spurious mentions of entity names, we hand 4. Methodology label independently-collected validation and test sets where each utterance contains a substring included 4.1. Candidate entity matching in the song title gazetteer (see Section 3.2). We ig- nore the 50 most frequently-occurring spurious matches Before passing a user’s utterance to the model, we must (e.g. “play”, “just”, “yeah”, etc) so that approximately first determine which gazetteer entries appear in it. Note 50% of utterances express a A d d M u s i c C o n s t r a i n t I n t e n t , that at this stage, we do not distinguish between true and about 50% of these A d d M u s i c C o n s t r a i n t I n t e n t utter- positives (the user was referring to the entity) and false ances contain a true reference to a specific song. Only positives (spurious matches like ‘yes’, ‘play’, etc). A d d M u s i c C o n s t r a i n t I n t e n t can include song titles. These We find these candidates using a regex search for each titles include some that were misremembered or that gazetteer entry, enforcing that the match must terminate contain voice recognition errors. at whitespace or at the beginning or end of the utterance. Our grammatical rules can interpret about 30% of ut- Next, we associate all of the tokens in an utterance with terances in the test set and we consider these to be in the entity type of the candidate. We summarize this infor- domain; the remaining 70% test the model’s generaliza- mation with a binary vector for each word, where each tion capability, either to new utterance patterns or to dimension corresponds to one of the gazetteers (artist novel entities. name, song title, album name). See Table 1 for an exam- The training, validation and test datasets are fixed for ple. all experiments. We add a dimension to this vector with its value fixed to 1 as a bias term. This entry can be thought of as the no entity dimension, which captures the possibility that the gazetteer matches were false positives. 4.2. Gazetteer filtering In this work, we use a straightforward technique to in- corporate the summary gazetteer vectors into our base- line model (see Section 4.3). The baseline model does not have access to this vector, and so we can evaluate whether gazetteers improve or degrade performance. We compare three methods to preprocess the gazetteers: Figure 1: Gazetteer features are computed as the sum of as- First, we use the full, unfiltered gazetteer. Following sociated ‘ingredient’ embeddings. Here, for example, ‘queen’ Magnolini et al [7], we expect the noise introduced by appears in the artist and song title gazetteers (via the artist false positive matches to outweigh any information pro- “Queen” and the song “Dancing Queen”), so we take the aver- vided by the true positives. age of those along with the no entity embedding. Second, we filter out the least popular entities in the gazetteer by thresholding the popularity (see Section 3.2). This is equivalent to using a shorter collection window name, song title, album name). Each token in the utter- to gather candidates. We expect that this does little to ance is represented as the average of the ingredient em- exclude ambiguous entities. beddings for the entity types matched which that token Third, we threshold the ratio between the number of appears in a candidate. Note that every token receives occurrences of each entity’s surface form in our train- the no entity embedding as a bias term. This is illustrated ing corpus and its popularity, similar to Oramas et al in Figure 1. [14]. While Oramas et al used ranks, we use raw counts We concatenate these gazetteer embeddings with the to capture the assumption that the number of genuine BERT output embeddings and add a single transformer mentions of an entity is proportional to the underlying encoder layer (i.e. self-attention with position embed- popularity of that entity. We call our version 𝑟 ̂ to avoid dings and a fully-connected output layer) so that the confusion: gazetteer information can be shared among all the to- popularity(𝑒) kens. 𝑟(𝑒) ̂ = (2) The [ C L S ] token, which represents the entire utterance, mention_frequency(𝑒) receives the average embedding taken over all tokens in This has the practical benefit of allowing new entities to the utterance. be added without re-ranking. The outputs of the final transformer layer are passed to We rejected a fourth candidate method of thresholding prediction heads for each token. The [ C L S ] token predicts based on the corpus frequency since the resulting filtered the user intent, and the remaining tokens predict their gazetteers preferentially included exactly the entities we own entity type label (or O T H E R ). Note that each token wished to exclude, like “yes”, “play”, “stop”, etc, and ex- can have only one label, and the utterance is associated cluded entities not mentioned in our corpus, limiting a with exactly one intent. model’s ability to generalize beyond its training data. This architecture is illustrated in Figure 2. Where we filter gazetteers, we treat the percentage The baseline model is identical, except that nothing is to filter out as a hyperparameter (25%, 50% or 75%) and concatenated with the BERT outputs. select the model that performed best on the validation The model is fine-tuned using cross-entropy with label set. smoothing [18], where the total loss is the sum of the classification loss and the slot tagging loss for each token. 4.3. Model We update all parameters during fine-tuning, including the gazetteer ‘ingredients’. To understand a user’s utterance, we need to pre- This architecture resembles the joint intent- dict the user’s intent (classification) and label any classification and slot filling model introduced in Chen entities they mentioned (NER). We start with a et al [19], except for the gazetteer embeddings, the standard BERT-base model [17], pretrained on additional transformer encoder layer, and the use of b o o k _ c o r p u s _ w i k i _ e n _ u n c a s e d 1. label smoothing. The first two of these additions provide To represent information from the gazetteers, we a method to fuse gazetteer information into the model start by randomly-initialize four 64-dimensional ‘ingredi- before the prediction heads. Label smoothing helps ent’ embeddings corresponding to the four-dimensional restrain the model’s overconfidence on ‘easy’ examples, gazetteer vector described in Section 4.1 (no entity, artist resulting in more robust performance on utterances outside the training distribution [18]. 1 From https://nlp.gluon.ai/model_zoo/bert/index.html Aside from the percentage of each gazetteer to filter out Figure 2: Model architecture. Contextual embeddings (from the BERT encoder) are concatenated with gazetteer embeddings (see Figure 1), and the resulting representation is passed through a transformer layer to prediction heads for both intent classification (IC) and entity labeling (NER). (in the popularity-filtered and 𝑟(𝑒)-filtered ̂ experiment), Table 2 we do not conduct any hyperparameter selection. We Results of experiments, shown as percentage increases or de- find in both cases that filtering out 75% of the gazetteer creases from the baseline model. gives the best performance on the validation set. For (a) Song title detection. other hyperparameters, we use values that previously performed well with a simplified baseline model that Gazetteer Precision Recall F1 does not include the final transformer layer: since the None - - - utterances are typically short, we truncate them to 16 Full −2.72% +0.58% −1.17% tokens (this affects fewer than 0.1% of utterances), use a Popularity-filtered −3.18% +1.57% −0.96% batch size of 128, a label smoothing 𝛼 = 0.1, and train for 𝑟(𝑒)-filtered ̂ +3.60% +3.82% +3.70% 10,000 updates. We checkpoint every 100 updates and choose the version of the model that achieved the highest (b) Intent classification (A d d M u s i c C o n s t r a i n t I n t e n t ). intent classification F1 score on the validation set. Other Gazetteer Precision Recall F1 hyperparameters follow those in Chen et al [19]. None - - - Full +0.21% −0.23% −0.05% Popularity-filtered −0.21% +1.85% +0.98% 5. Results 𝑟(𝑒)-filtered ̂ +0.25% +2.58% +1.70% For each experiment, we evaluate the model’s ability to discriminate A d d M u s i c C o n s t r a i n t I n t e n t from other in- tents, and its ability to extract correct song titles. Song methodology [20]. That means the span must exactly title detection is particularly challenging for a conversa- match the annotated span to be counted as a true positive; tional music recommender due to song titles’ variability, predicting the wrong span counts as both a false positive cardinality and resemblance to normal speech. (the incorrectly-predicted span) and a false negative (the We report this metric using the SemEval ‘strict’ missed prediction). We choose this metric because we it may have approached a local minimum in the loss sur- face from which it cannot escape, resulting in poorer performance at convergence (as shown in Table 2). Table 3 shows some example user inputs that highlight how gazetteers help the model. Each utterance is shown with the song title predicted by the model learned under each experiment. While all the models are usually able to detect the presence of a song title, only the model trained using the 𝑟(𝑒)-filtered ̂ gazetteers is able to reliably detect the boundaries of the mention. 6. Limitations and future work Figure 3: Song title F1 during training. Results shown every 10% up to 30% of training data, when F1 has begun to con- We note that this work only considers utterances in En- verge. Note that actual F1 scores are redacted due to their glish. The technique described here should apply to other commercial sensitivity. languages, but in some, whitespace cannot be used to delimit entities, making candidate matching more chal- lenging. We only briefly experimented with the impact of require substantially-complete predictions for the down- changes to the gazetteer after model training (e.g. due to stream entity resolver to associate the span with the cor- new releases or changing popularity of existing releases). rect entity. A simple token-by-token evaluation showed While these initial results are promising, we would want similar differences between models. to conduct more thorough research to evaluate how pre- Table 2 shows the results of our experiments. As ex- dictions are affected. pected, we observe that using the full gazetteers increases We have also not explored the impact of false neg- the recall of song titles at the expense of precision, re- atives (i.e. real entities not matched in the gazetteers, sulting in a drop in F1 score of 1.17%. Filtering based either because the entity is not sufficiently popular, be- on popularity seems to exaggerate these differences, fur- cause it has been recently released, or due to a user or ther diminishing precision but boosting recall even more, voice recognition error). Our evaluation shows an overall presumably because the model becomes too trusting of improvement in precision and recall, but there may be in- information from the gazetteers which still include spu- dividual cases where the baseline model better leverages rious matches. The overall effect is that F1 dropped by contextual clues to predict entity mentions. Randomly slightly less: 0.96% from the baseline model. dropping out gazetteer features during training (i.e. re- Filtering based on the ratio 𝑟(𝑒)̂ addresses this issue. placing a 1 with a 0 in the gazetteer vector described Common-but-spurious mentions are now excluded from in Section 4.1 some fraction of the time) might force a the gazetteer, leaving a cleaner gazetteer that contains model to learn how to use gazetteer features where avail- unambiguous entities, and which results in improved able, but to continue to attend to contextual information precision and recall and an overall increase in F1 of 3.70%. otherwise, further improving overall performance. These results seem to be correlated with intent classi- This work was evaluated on manually-annotated of- fication performance, with the worst song title detection fline datasets, but we have planned an A-B test to measure F1 corresponding to the worst intent classification F1 the downstream impact of improved NER performance on (full gazetteers), and best with best (𝑟(𝑒)-filtered). ̂ This is the rate with which users accept the system’s recommen- to be expected: correctly recognizing the presence or ab- dations. We expect to see an improvement corresponding sence of a song title (or artist name) makes distinguishing to the system’s ability to correctly interpret our users’ intents easier. wishes. Figure 3 shows that the model with access to the In future work, we intend to fuse popularity and 𝑟(𝑒) ̂ 𝑟-filtered ̂ gazetteer learns most quickly. The full and directly into the model, rather than using it to filter the poularity-filtered gazetteers give an early boost to F1, gazetteers. Incorporating the ratio 𝑟(𝑒)̂ into the model when model performance is poor, but are quickly over- as a feature would allow it to attend more heavily to taken by the baseline model without gazetteers. This gazetteer features where the entity is unambiguous, and supports our hypothesis that information from noisy use contextual cues to disambiguate less obvious exam- gazetteers helps weak models, but when the model is ples. It also avoids introducing an arbitrary cut off: small better able to leverage contextual cues, the noise begins values of 𝑟(𝑒) ̂ would be almost, but not quite, equiva- to dominate any signal they provide. The model would lent to excluding the entity entirely. We hope that such by now perform better by ignoring the information, but Table 3 Examples of errors made by models trained with different gazetteer information. The expected song titles are underlined. Note that the model handles over a dozen intents, and so identifying song titles even in somewhat structured utterances (e.g. “X by Y”) is nontrivial. Utterance No gazetteers Full gazetteers rolling in the deep in the deep in the deep play cruella de vil cruella de high voltage just the way you are by the way you are the way you are you dropped the bomb on me dropped the bomb on me dropped the bomb on me green eyed lady by sugarloaf eyed lady eyed lady monsters by shinedown Utterance Popularity-filtered gazetteers 𝑟(𝑒)-filtered ̂ gazetteers rolling in the deep rolling in the deep rolling in the deep play cruella de vil cruella de vil cruella de vil high voltage high voltage just the way you are by just the way you are you dropped the bomb on me dropped the bomb on me you dropped the bomb on me green eyed lady by sugarloaf eyed lady green eyed lady monsters by shinedown monsters an approach will yield further improvements and be a References step towards a general approach to integrating gazetteers with pre-trained transformers. [1] T. Ammari, J. Kaye, J. Y. Tsai, F. Bentley, Music, Search, and IoT: How people (really) use voice as- sistants, ACM Transactions on Computer-Human 7. Conclusion Interaction 26 (2019). doi:1 0 . 1 1 4 5 / 3 3 1 1 9 5 6 . [2] J. Thom, A. Nazarian, R. Brillman, H. Cramer, In this paper, we demonstrate that a rather simple ar- S. Mennicken, ”Play Music”: User Motivations and chitecture with carefully filtered gazetteers can greatly Expectations for Non-Specific Voice Queries, in: improve NER performance in a conversational recom- 21st International Society for Music Information mendation system for the music domain. By augment- Retrieval Conference, 2020. ing gazetteers with information about the underlying [3] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. likelihood of a mention of each entity, the models can Chua, Advances and challenges in conversa- avoid false positives, and are better able to rely on large tional recommender systems: A survey, AI Open gazetteers. 2 (2021) 100–126. URL: https://doi.org/10.1016/j. This finding could apply to other domains where large aiopen.2021.06.002. doi:1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 1 . 0 6 . gazetteers are common and where relevant frequency 002. arXiv:2101.09459. information is available. Examples might include place [4] A. Mikheev, M. Moens, C. Grover, Named Entity names along with their populations, or diseases with the recognition without gazetteers, in: Proceedings of number of diagnoses mentioned in discharge notes. EACL ’99, 1999, p. 1. doi:1 0 . 3 1 1 5 / 9 7 7 0 3 5 . 9 7 7 0 3 7 . [5] S. Peshterliev, C. Dupuy, I. Kiss, Self-Attention Acknowledgments Gazetteer Embeddings for Named-Entity Recogni- tion (2020). URL: http://arxiv.org/abs/2004.04060. The authors would like to thank Tao Ye, Justin Hugues- arXiv:2004.04060. Nuger, Chelsea Weaver and Vlad Magdin for their help [6] S. Rijhwani, S. Zhou, G. Neubig, J. Carbonell, Soft through conversations regarding the evaluation, techni- Gazetteers for Low-Resource Named Entity Recog- cal implementation and presentation of this work. We nition, in: Proceedings of the 58th Annual Meet- also thank the reviewers for their valuable comments. ing of the Association for Computational Linguis- tics, 2020, pp. 8118–8123. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . acl- main.722. arXiv:2005.01866. [7] S. Magnolini, V. Piccioni, V. Balaraman, M. Guerini, B. Magnini, How to Use Gazetteers for En- tity Recognition with Neural Models, in: Pro- tant supervision for relation extraction without ceedings of the 5th Workshop on Semantic labeled data, in: Proceedings of the Joint Con- Deep Learning (SemDeep-5), 2019, pp. 40–49. ference of the 47th Annual Meeting of the ACL URL: https://github.com/XuezheMax/NeuroNLP2% and the 4th International Joint Conference on Nat- 0Ahttps://www.aclweb.org/anthology/W19-5807. ural Language Processing of the AFNLP, 2009, pp. [8] T. Liu, J. G. Yao, C. Y. Lin, Towards improving 1003–1011. URL: https://aclanthology.org/P09-1113. neural named entity recognition with gazetteers, doi:1 0 . 3 1 1 5 / 1 6 9 0 2 1 9 . 1 6 9 0 2 8 7 . in: Proceedings ofthe 57th Annual Meeting ofthe [17] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Association for Computational Linguistics, 2019, pp. Pre-training of deep bidirectional transformers for 5301–5307. doi:1 0 . 1 8 6 5 3 / v 1 / p 1 9 - 1 5 2 4 . language understanding, NAACL HLT 2019 - 2019 [9] C. H. Song, D. Lawrie, T. Finin, J. Mayfield, Im- Conference of the North American Chapter of the proving Neural Named Entity Recognition with Association for Computational Linguistics: Human Gazetteers, in: The 33rd International FLAIRS Con- Language Technologies - Proceedings of the Con- ference, 2020, p. 8. URL: https://arxiv.org/abs/2003. ference 1 (2019) 4171–4186. a r X i v : 1 8 1 0 . 0 4 8 0 5 . 03072. [18] R. Müller, S. Kornblith, G. Hinton, When does label [10] H. Lin, Y. Lu, X. Han, L. Sun, B. Dong, S. Jiang, smoothing help?, in: Advances in Neural Informa- Gazetteer-Enhanced Attentive Neural Networks for tion Processing Systems, 2019. a r X i v : 1 9 0 6 . 0 2 6 2 9 . Named Entity Recognition, in: Proceedings ofthe [19] Q. Chen, Z. Zhuo, W. Wang, Bert for joint 2019 Conference on Empirical Methods in Natural intent classification and slot filling, 2019. Language Processing and the 9th International Joint arXiv:1902.10909. Conference on Natural Language Processing, 2019, [20] D. S. Batista, Named-entity evaluation metrics pp. 6232–6237. based on entity-level, 2019. URL: http://www. [11] O. Agarwal, A. Nenkova, The Utility and Inter- davidsbatista.net/. play of Gazetteers and Entity Segmentation for Named Entity Recognition in English, in: Findings ofthe Association for Computational Linguistics: ACL-IJCNLP, Association for Computational Lin- guistics, 2021, pp. 3990–4002. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . findings- acl.349. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- eit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polo- sukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, R. Garnett (Eds.), Ad- vances in Neural Information Processing Sys- tems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [13] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transform- ers: ”The End of History” for NLP? (2021). URL: http://arxiv.org/abs/2105.00813. a r X i v : 2 1 0 5 . 0 0 8 1 3 . [14] S. Oramas, M. Quadrana, F. Gouyon, P. M. Llc, Boot- strapping a Music Voice Assistant with Weak Su- pervision, in: Proceedings of NAACL HLT 2021: Industry Track, 2021, pp. 49–55. [15] T. Meng, A. Fang, O. Rokhlenko, S. Malmasi, GEM- NET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input, in: Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Association for Computational Lin- guistics, 2021, pp. 1499–1512. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . naacl- main.118. [16] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Dis-