1. Introduction

Enhancing gazetteers for named entity recognition in conversational recom mender systems

Nicholas Dingwall

0 1 2 3

Vianne R. Gao

0 2 3 0 3rd Edition of Knowledge-aware and Conversational Recommender 1 Amazon , 525 Market St, San Francisco, CA 94105 , USA 2 Environments (ComplexRec) Joint Workshop @ RecSys 2021 3 Workshop Proce dings

2021

Named Entity Recognition (NER) is a crucial building block of a conversational agent, but remains challenging in real-word settings. It is particularly challenging for domains where the entities are linguistically complex and resemble common phrases (e.g. music and movies). While gazetteer features have been shown to improve NER performance, their utility is undermined by pervasive spurious entity matching. We propose a framework for gazetteer knowledge integration that incorporates external knowledge about entity popularity (e.g. a song's play count) to reduce spurious entity matching and improve the robustness of gazetteer features. Our experimental evaluations show that using unfiltered gazetteers degrades performance, but that incorporating external information improves it compared to a baseline model that doesn't use gazetteer information. Further, our framework can eficiently adapt to new entities in gazetteers without additional training, which is crucial for rapidly growing domains like music.

natural language understanding named-entity recognition gazetteer conversational recommender music

1. Introduction

coming increasingly popular and music has emerged as a primary use case for them [ 1, 2 ]. Without a screen for ing avenue to help users navigate their favorite music.

But four factors make identifying mentions of these entities dificult in the music domain. First, there are a lot of songs and artists: thousands of artists release millions of songs each year, and a modern deep learning system must store their names in its weights. Second, song and artist names can often resemble ordinary parts of speech, and so the system must disambiguate genuine references to musical entities from spurious matches. Third, users misremember the titles of songs or use abbreviations to refer to artists, limiting the applicability of canonical data sources. And fourth, new songs are continually being released – some of which immediately achieve their peak popularity – which obliges the owners of a model to regularly retrain the model. nEvelop-O (V. R. Gao) Systems (KaRS) & 5th Edition of Recommendation in Complex that” (Tim McGraw), and “stop” (Spice Girls), resulting in frequent false positive matches.

Nevertheless, incorporating them into models is ap

generalize beyond examples seen during training, and to

In this paper, we experiment with utterance data and music domain knowledge data. In the conversational music recommender setting, a user is prompted to specify genres, moods or artists and hears samples of playlists matching the criteria they have provided so far (e.g. including a specified artist). The conversation continues until a sample is accepted, the user requests to play a specific song or artist, or the user explicitly ends the conversation or stops responding. The natural language

Conversational systems make NER even more chal- pealing since they could allow a production system to lenging: while single-turn commands are often wellstructured and include indicators that a sequence tag- decouple updates to entity lists from model training. interpretation component must therefore be able to rec- retain many of these homographs (phrases like ”yes”, ognize any song or artist name mentioned by the user “something like that” and ”stop”), which are particularly in order to select matching playlists to recommend and common in conversational responses, and exclude many to avoid recommending playlists that do not match the genuine references to entities. Moreover, we wish to use user’s request. gazetteers precisely because they will help generalize be

This paper explores diferent methods to extract value yond the training data, especially for low-context inputs, from gazetteers enriched with popularity information like an artist name on its own. about songs, artists and albums. In all cases, we add These works also either did not use pre-trained lantoken-level features indicating the presence or absence guage models [ 6, 7, 8, 11 ] or did not fine-tune the weights of that token (or sequence of tokens) within a gazetteer. of the language models [10, 9]. Large pre-trained lanWe vary the preprocessing applied to the gazetteers and guage models based on the transformer architecture [12] show that neither full gazetteers nor gazetteers filtered have achieved state-of-the-art results across a variety of to include only the most popular entities outperform a natural language processing tasks [13] but successfully baseline gazetteer-free model. However, after a more integrating gazetteers remains elusive. careful filtering of entities, adding a gazetteer does help In these prior works, the gazetteers used were all flat the model to robustly extract music entity names. In lists of entity names, and so the systems could only condoing so, the model improves its ability to classify a user’s sider the surface form of each entity (i.e. any string overall intent. matching the name of the entity, regardless of the intended referent of that string). Oramas et al introduces a framework to leverage the popularity of each associ2. Background and prior work ated entity to distinguish between ambiguous and nonambiguous names [14]. For each entity, they compute a 2.1. Named entity recognition ratio between the rank of the entity’s popularity and the rank of the number of occurrences of its surface form in

Gazetteers were common in pre-neural NER architec

tures: indeed, Mikheev et al in 1999 was notable for doing it without gazetteers [ 4 ].

Their use has fallen out of fashion with the recent dominance of large pre-trained language models for NER, since these models can better leverage contextual information to detect entity mentions [ 5 ]. More recent work has demonstrated that gazetteers can still improve NER performance with neural architectures, especially where training data is limited [ 6, 7, 8, 9, 10, 11 ].

However, the improved performance of modern NER models exposes the noise in gazetteers: Magnolini et al showed that filtering rarely-occurring values from large gazetteers boosts performance more than using the unfiltered gazetteer [ 7 ]. But in the music domain, the noise comes principally from linguistic ambiguity: entity names can be homographs of non-entity words and phrases. Filtering based on corpus frequency would

Named entity recognition (NER) is the task of associating

each word in a sentence with a label indicating its type. their corpus.

In typical settings, the type may be a person, a location, or an organization. In our domain, we are interested in () = () (1) music entities: artist names, song titles and album names. ()

In practice we refer to tokens instead of words, allow- Mentions of entities that occur more frequently in their ing for rare words to be split into subwords to limit the corpus than would be expected based on their popularity vocabulary size necessary to cover the entire dataset. For rank (i.e. () is small) are likely to be spurious matches. example, ‘ed sheeran’ is represented as the three tokens They use this to automatically label a training set: some ‘ed’, ‘sheer ’ and ‘an’. We hope to train a model that asso- entities can be confidently labeled as songs or artists, ciates all three tokens with the artist_name tag. some – like “Could You”, “Play Music” and “Xmas” – are ignored, and inputs containing potentially-confusing 2.2. Gazetteers for NER mentions like “Country Joe” and “Spanish House” are excluded entirely. However, this informs only the dataset generation; their model does not have access to the underlying popularities or the rank.

Meng et al propose a mixture of experts model for NER that directly models how much weight to give to features derived from the context (using a BERT encoder) and from gazetteers (using a BiLSTM built on gazetteer matches) for each token. This substantially improves performance on their datasets, but still relies exclusively on linguistic features [15].

In this work, we show that filtering gazetteers using a similar formula to Equation 1 [14] allows an NER model to leverage gazetteer information. We compare against two other preprocessing methods which both degrade performance.

3. Datasets 3.1. Corpus

We train our model on historical user utterances. The data are labeled using a hand-crafted set of grammatical rules designed to match the most frequently-occurring utterances. The rules consist of a pattern and allowed slot values. For example, the pattern j u s t p l a y < a r t i s t n a m e > includes an < a r t i s t - n a m e > slot; this slot is associated with a list of popular artists. The combination of pattern and slot list allows us to match “just play the beatles” and “just play rihanna”, but not “just play trivial pursuit”, since trivial pursuit is a game and not an artist.

In practice, we abstract common phrases and nest rules: < n e g a t i v e - t r i g g e r > < a r t i s t - n a m e > matches a predetermined set of negative trigger phrases like “not”, “i don’t like”, etc, along with an artist name.

To control the latency of these rules, we limit slot lists to popular entities. The rules therefore fail on a long tail of infrequent utterances, either because the utterance contains an entity outside our canonical lists, or because the pattern is unusual. We observe that statistical models trained on these labeled utterances can generalize to long tail utterances, as proposed in [16].

The rules cover multiple intents, including standard ones like Y e s I n t e n t , N o I n t e n t and S t o p I n t e n t , but also an A d d M u s i c C o n s t r a i n t I n t e n t for when users constrain recommended music entities. Slot types include entity types as well as trigger phrases that indicate negation, instructions to go immediately to playback, and so on.

To evaluate the model’s ability to discriminate between spurious mentions of entity names, we hand label independently-collected validation and test sets where each utterance contains a substring included in the song title gazetteer (see Section 3.2). We ignore the 50 most frequently-occurring spurious matches (e.g. “play”, “just”, “yeah”, etc) so that approximately 50% of utterances express a A d d M u s i c C o n s t r a i n t I n t e n t , and about 50% of these A d d M u s i c C o n s t r a i n t I n t e n t utterances contain a true reference to a specific song. Only A d d M u s i c C o n s t r a i n t I n t e n t can include song titles. These titles include some that were misremembered or that contain voice recognition errors.

Our grammatical rules can interpret about 30% of utterances in the test set and we consider these to be in domain; the remaining 70% test the model’s generalization capability, either to new utterance patterns or to novel entities.

The training, validation and test datasets are fixed for all experiments.

We use gazetteers derived from historical single-turn

utterances that expressed a music request to a voice assistant. Since these utterances are well-structured (usually of the form “play X”), the entity name can be extracted and associated with an entity type by an entity resolution system.

These gazetteers can include entities with user or voice recognition errors as long as the entity resolution system was able to resolve them to a canonical entity.

As such, the gazetteers consist of multiple strings corresponding to the same entity: e.g. “blink one eighty two” and “blink one eight two” both appear in our gazetteer, even though the canonical name is “Blink 182”.

Each entity in a gazetteer is augmented with the number of times it was requested (which we refer to as its popularity).

4. Methodology 4.1. Candidate entity matching Before passing a user’s utterance to the model, we must

ifrst determine which gazetteer entries appear in it. Note that at this stage, we do not distinguish between true positives (the user was referring to the entity) and false positives (spurious matches like ‘yes’, ‘play’, etc).

We find these candidates using a regex search for each gazetteer entry, enforcing that the match must terminate at whitespace or at the beginning or end of the utterance.

Next, we associate all of the tokens in an utterance with the entity type of the candidate. We summarize this information with a binary vector for each word, where each dimension corresponds to one of the gazetteers (artist name, song title, album name). See Table 1 for an example.

We add a dimension to this vector with its value fixed to 1 as a bias term. This entry can be thought of as the no entity dimension, which captures the possibility that the gazetteer matches were false positives.

4.2. Gazetteer filtering In this work, we use a straightforward technique to in

corporate the summary gazetteer vectors into our baseline model (see Section 4.3). The baseline model does not have access to this vector, and so we can evaluate whether gazetteers improve or degrade performance.

We compare three methods to preprocess the gazetteers: Figure 1: Gazetteer features are computed as the sum of as

First, we use the full, unfiltered gazetteer. Following sociated ‘ingredient’ embeddings. Here, for example, ‘queen’ Magnolini et al [ 7 ], we expect the noise introduced by appears in the artist and song title gazetteers (via the artist false positive matches to outweigh any information pro- “Queen” and the song “Dancing Queen”), so we take the avervided by the true positives. age of those along with the no entity embedding.

Second, we filter out the least popular entities in the gazetteer by thresholding the popularity (see Section 3.2).

This is equivalent to using a shorter collection window to gather candidates. We expect that this does little to exclude ambiguous entities.

Third, we threshold the ratio between the number of occurrences of each entity’s surface form in our training corpus and its popularity, similar to Oramas et al [14]. While Oramas et al used ranks, we use raw counts to capture the assumption that the number of genuine mentions of an entity is proportional to the underlying popularity of that entity. We call our version ̂ to avoid confusion: name, song title, album name). Each token in the utterance is represented as the average of the ingredient embeddings for the entity types matched which that token appears in a candidate. Note that every token receives the no entity embedding as a bias term. This is illustrated in Figure 1.

We concatenate these gazetteer embeddings with the BERT output embeddings and add a single transformer encoder layer (i.e. self-attention with position embeddings and a fully-connected output layer) so that the gazetteer information can be shared among all the topopularity() kens. (̂ ) = mention_frequency() (2) The [ C L S ] token, which represents the entire utterance, receives the average embedding taken over all tokens in This has the practical benefit of allowing new entities to the utterance. be added without re-ranking. The outputs of the final transformer layer are passed to

We rejected a fourth candidate method of thresholding prediction heads for each token. The [ C L S ] token predicts based on the corpus frequency since the resulting filtered the user intent, and the remaining tokens predict their gazetteers preferentially included exactly the entities we own entity type label (or O T H E R ). Note that each token wished to exclude, like “yes”, “play”, “stop”, etc, and ex- can have only one label, and the utterance is associated cluded entities not mentioned in our corpus, limiting a with exactly one intent. model’s ability to generalize beyond its training data. This architecture is illustrated in Figure 2.

Where we filter gazetteers, we treat the percentage The baseline model is identical, except that nothing is to filter out as a hyperparameter (25%, 50% or 75%) and concatenated with the BERT outputs. select the model that performed best on the validation The model is fine-tuned using cross-entropy with label set. smoothing [18], where the total loss is the sum of the classification loss and the slot tagging loss for each token. 4.3. Model We update all parameters during fine-tuning, including the gazetteer ‘ingredients’.

To understand a user’s utterance, we need to pre- This architecture resembles the joint intentdict the user’s intent (classification) and label any classification and slot filling model introduced in Chen entities they mentioned (NER). We start with a et al [19], except for the gazetteer embeddings, the standard BERT-base model [17], pretrained on additional transformer encoder layer, and the use of b o o k _ c o r p u s _ w i k i _ e n _ u n c a s e d 1. label smoothing. The first two of these additions provide

To represent information from the gazetteers, we a method to fuse gazetteer information into the model start by randomly-initialize four 64-dimensional ‘ingredi- before the prediction heads. Label smoothing helps ent’ embeddings corresponding to the four-dimensional restrain the model’s overconfidence on ‘easy’ examples, gazetteer vector described in Section 4.1 (no entity, artist resulting in more robust performance on utterances outside the training distribution [18].

1From https://nlp.gluon.ai/model_zoo/bert/index.html Aside from the percentage of each gazetteer to filter out (in the popularity-filtered and (̂ ) -filtered experiment), we do not conduct any hyperparameter selection. We ifnd in both cases that filtering out 75% of the gazetteer gives the best performance on the validation set. For other hyperparameters, we use values that previously performed well with a simplified baseline model that does not include the final transformer layer: since the utterances are typically short, we truncate them to 16 tokens (this afects fewer than 0.1% of utterances), use a batch size of 128, a label smoothing = 0.1 , and train for 10,000 updates. We checkpoint every 100 updates and choose the version of the model that achieved the highest intent classification F1 score on the validation set. Other hyperparameters follow those in Chen et al [19].

5. Results For each experiment, we evaluate the model’s ability

to discriminate A d d M u s i c C o n s t r a i n t I n t e n t from other intents, and its ability to extract correct song titles. Song title detection is particularly challenging for a conversational music recommender due to song titles’ variability, cardinality and resemblance to normal speech.

We report this metric using the SemEval ‘strict’ Gazetteer None Full Popularity-filtered (̂ ) -filtered

Precision

−2.72% −3.18% +3.60% Gazetteer None Full Popularity-filtered (̂ ) -filtered

Precision

+0.21% −0.21% +0.25% methodology [20]. That means the span must exactly match the annotated span to be counted as a true positive; predicting the wrong span counts as both a false positive (the incorrectly-predicted span) and a false negative (the missed prediction). We choose this metric because we require substantially-complete predictions for the downstream entity resolver to associate the span with the correct entity. A simple token-by-token evaluation showed similar diferences between models.

Table 2 shows the results of our experiments. As expected, we observe that using the full gazetteers increases the recall of song titles at the expense of precision, resulting in a drop in F1 score of 1.17%. Filtering based on popularity seems to exaggerate these diferences, further diminishing precision but boosting recall even more, presumably because the model becomes too trusting of information from the gazetteers which still include spurious matches. The overall efect is that F1 dropped by slightly less: 0.96% from the baseline model.

Filtering based on the ratio (̂ ) addresses this issue. Common-but-spurious mentions are now excluded from the gazetteer, leaving a cleaner gazetteer that contains unambiguous entities, and which results in improved precision and recall and an overall increase in F1 of 3.70%.

These results seem to be correlated with intent classiifcation performance, with the worst song title detection F1 corresponding to the worst intent classification F1 (full gazetteers), and best with best ( (̂ ) -filtered). This is to be expected: correctly recognizing the presence or absence of a song title (or artist name) makes distinguishing intents easier.

Figure 3 shows that the model with access to the -̂filtered gazetteer learns most quickly. The full and poularity-filtered gazetteers give an early boost to F1, when model performance is poor, but are quickly overtaken by the baseline model without gazetteers. This supports our hypothesis that information from noisy gazetteers helps weak models, but when the model is better able to leverage contextual cues, the noise begins to dominate any signal they provide. The model would by now perform better by ignoring the information, but it may have approached a local minimum in the loss surface from which it cannot escape, resulting in poorer performance at convergence (as shown in Table 2).

Table 3 shows some example user inputs that highlight how gazetteers help the model. Each utterance is shown with the song title predicted by the model learned under each experiment. While all the models are usually able to detect the presence of a song title, only the model trained using the (̂ ) -filtered gazetteers is able to reliably detect the boundaries of the mention.

6. Limitations and future work We note that this work only considers utterances in En

glish. The technique described here should apply to other languages, but in some, whitespace cannot be used to delimit entities, making candidate matching more challenging.

We only briefly experimented with the impact of changes to the gazetteer after model training (e.g. due to new releases or changing popularity of existing releases). While these initial results are promising, we would want to conduct more thorough research to evaluate how predictions are afected.

We have also not explored the impact of false negatives (i.e. real entities not matched in the gazetteers, either because the entity is not suficiently popular, because it has been recently released, or due to a user or voice recognition error). Our evaluation shows an overall improvement in precision and recall, but there may be individual cases where the baseline model better leverages contextual clues to predict entity mentions. Randomly dropping out gazetteer features during training (i.e. replacing a 1 with a 0 in the gazetteer vector described in Section 4.1 some fraction of the time) might force a model to learn how to use gazetteer features where available, but to continue to attend to contextual information otherwise, further improving overall performance.

This work was evaluated on manually-annotated oflfine datasets, but we have planned an A-B test to measure the downstream impact of improved NER performance on the rate with which users accept the system’s recommendations. We expect to see an improvement corresponding to the system’s ability to correctly interpret our users’ wishes.

In future work, we intend to fuse popularity and (̂ ) directly into the model, rather than using it to filter the gazetteers. Incorporating the ratio (̂ ) into the model as a feature would allow it to attend more heavily to gazetteer features where the entity is unambiguous, and use contextual cues to disambiguate less obvious examples. It also avoids introducing an arbitrary cut of: small values of (̂ ) would be almost, but not quite, equivalent to excluding the entity entirely. We hope that such an approach will yield further improvements and be a step towards a general approach to integrating gazetteers with pre-trained transformers.

7. Conclusion

In this paper, we demonstrate that a rather simple architecture with carefully filtered gazetteers can greatly improve NER performance in a conversational recommendation system for the music domain. By augmenting gazetteers with information about the underlying likelihood of a mention of each entity, the models can avoid false positives, and are better able to rely on large gazetteers.

This finding could apply to other domains where large gazetteers are common and where relevant frequency information is available. Examples might include place names along with their populations, or diseases with the number of diagnoses mentioned in discharge notes.

Acknowledgments The authors would like to thank Tao Ye, Justin Hugues

Nuger, Chelsea Weaver and Vlad Magdin for their help through conversations regarding the evaluation, technical implementation and presentation of this work. We also thank the reviewers for their valuable comments. tity Recognition with Neural Models, in: Pro- tant supervision for relation extraction without ceedings of the 5th Workshop on Semantic labeled data, in: Proceedings of the Joint ConDeep Learning (SemDeep-5), 2019, pp. 40–49. ference of the 47th Annual Meeting of the ACL URL: https://github.com/XuezheMax/NeuroNLP2% and the 4th International Joint Conference on Nat0Ahttps://www.aclweb.org/anthology/W19-5807. ural Language Processing of the AFNLP, 2009, pp. [8] T. Liu, J. G. Yao, C. Y. Lin, Towards improving 1003–1011. URL: https://aclanthology.org/P09-1113. neural named entity recognition with gazetteers, doi:1 0 . 3 1 1 5 / 1 6 9 0 2 1 9 . 1 6 9 0 2 8 7 . in: Proceedings ofthe 57th Annual Meeting ofthe [17] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Association for Computational Linguistics, 2019, pp. Pre-training of deep bidirectional transformers for 5301–5307. doi:1 0 . 1 8 6 5 3 / v 1 / p 1 9 - 1 5 2 4 . language understanding, NAACL HLT 2019 - 2019 [9] C. H. Song, D. Lawrie, T. Finin, J. Mayfield, Im- Conference of the North American Chapter of the proving Neural Named Entity Recognition with Association for Computational Linguistics: Human Gazetteers, in: The 33rd International FLAIRS Con- Language Technologies - Proceedings of the Conference, 2020, p. 8. URL: https://arxiv.org/abs/2003. ference 1 (2019) 4171–4186. a r X i v : 1 8 1 0 . 0 4 8 0 5 . 03072. [18] R. Müller, S. Kornblith, G. Hinton, When does label [10] H. Lin, Y. Lu, X. Han, L. Sun, B. Dong, S. Jiang, smoothing help?, in: Advances in Neural InformaGazetteer-Enhanced Attentive Neural Networks for tion Processing Systems, 2019. a r X i v : 1 9 0 6 . 0 2 6 2 9 . Named Entity Recognition, in: Proceedings ofthe [19] Q. Chen, Z. Zhuo, W. Wang, Bert for joint 2019 Conference on Empirical Methods in Natural intent classification and slot filling, 2019. Language Processing and the 9th International Joint a r X i v : 1 9 0 2 . 1 0 9 0 9 .

Conference on Natural Language Processing, 2019, [20] D. S. Batista, Named-entity evaluation metrics pp. 6232–6237. based on entity-level, 2019. URL: http://www. [11] O. Agarwal, A. Nenkova, The Utility and Inter- davidsbatista.net/.

play of Gazetteers and Entity Segmentation for Named Entity Recognition in English, in: Findings ofthe Association for Computational Linguistics: ACL-IJCNLP, Association for Computational Linguistics, 2021, pp. 3990–4002. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .

f i n d i n g s - a c l . 3 4 9 . [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017.

URL: https://proceedings.neurips.cc/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [13] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers: ”The End of History” for NLP? (2021). URL: http://arxiv.org/abs/2105.00813. a r X i v : 2 1 0 5 . 0 0 8 1 3 . [14] S. Oramas, M. Quadrana, F. Gouyon, P. M. Llc, Bootstrapping a Music Voice Assistant with Weak Supervision, in: Proceedings of NAACL HLT 2021:

Industry Track, 2021, pp. 49–55. [15] T. Meng, A. Fang, O. Rokhlenko, S. Malmasi, GEM

NET: Efective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input, in: Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2021, pp. 1499–1512. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 .

n a a c l - m a i n . 1 1 8 . [16] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Dis

[1]

Ammari ,

Kaye ,

J. Y.

Tsai ,

Bentley , Music, Search, and IoT: How people (really) use voice assistants , ACM Transactions on Computer-Human Interaction 26 ( 2019 ). doi:1 0 . 1 1 4 5 / 3 3 1 1 9 5 6 .

[2]

Thom ,

Nazarian ,

Brillman ,

Cramer , S. Mennicken, ”Play Music” : User Motivations and Expectations for Non-Specific Voice Queries , in: 21st International Society for Music Information Retrieval Conference , 2020 .

[3]

Gao ,

Lei ,

He , M. de Rijke , T.-S. Chua, Advances and challenges in conversational recommender systems: A survey , AI Open 2 ( 2021 ) 100 - 126 . URL: https://doi.org/10.1016/j. aiopen. 2021 . 06 .002. doi:1 0 . 1 0 1 6 / j . a i o p e n . 2 0 2 1 . 0 6 . 0 0 2 . a r X i v : 2 1 0 1 . 0 9 4 5 9 .

[4]

Mikheev ,

Moens ,

Grover , Named Entity recognition without gazetteers , in: Proceedings of EACL '99 , 1999 , p. 1 . doi:1 0 . 3 1 1 5 / 9 7 7 0 3 5 . 9 7 7 0 3 7 .

[5]

Peshterliev ,

Dupuy , I. Kiss , Self-Attention Gazetteer Embeddings for Named-Entity Recognition ( 2020 ). URL: http://arxiv.org/abs/ 2004 .04060. a r X i v : 2 0 0 4 . 0 4 0 6 0 .

[6]

Rijhwani ,

Zhou , G. Neubig, J. Carbonell, Soft Gazetteers for Low-Resource Named Entity Recognition , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 8118 - 8123 . doi: 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 7 2 2 . a r X i v : 2 0 0 5 . 0 1 8 6 6 .

[7]

Magnolini ,

Piccioni ,

Balaraman ,

Guerini ,

Magnini , How to Use Gazetteers for En-