1. Introduction

Extracting Spatial Entities Involved in the Description of a Movement Action Using Deep Learning Methods: A Comparative Study of Three Models

Abdelkrim Tafer

0 1

Mauro Gaio

0 0 University of Pau and the Adour Region, Laboratory of Mathematics and Their Applications , Pau , France 1 University of Zaragoza, Aragón Institute for Engineering Research, Advanced Information Systems Laboratory , Zaragoza , Spain

This paper proposes a methodology to automatically extract spatial information from itinerary descriptions in French. We compare three models: BiLSTM-CRF, CamemBERT, and GLINER, focusing on the recognition of nested spatial entities, motion verbs, spatial relations and spatial condition, and measures. Preliminary results demonstrate the potential of these models in accurately identifying and classifying spatial elements necessary for the annotation of movement actions evoked in textual descriptions.

eol>automatique annotation classification deep learning nested spatial named entities

1. Introduction 2. Related Work

Probabilistic models such as Conditional Random Fields (CRF) [ 4 ] have been widely used for structured sequence prediction tasks. When combined with recurrent neural networks such as Long Short-Term Memory (LSTM) networks [ 5 ], these models efectively capture local and contextual dependencies while improving the accuracy of named entity recognition (NER) [ 6 ].

Transformer-based language models [ 7 ] have significantly advanced the modeling of linguistic structures through large-scale pre-training on extensive text corpora. Among these, BERT [ 9 ] introduced a bidirectional transformer architecture that substantially improved performance across various NLP tasks and can be further specialized for NER through targeted fine-tuning. Additionally, newer approaches such as GLiNER [ 10 ] exploit pre-trained language models as backbone, such as Deberta v3 [ 11 ] in the original paper, to develop low-resource NER systems that require minimal or no fine-tuning with state-of-the-art performance in zero-shot learning NER. Transformers have also been adapted for domain-specific applications, such as place name extraction from unstructured text [ 8 ].

Although recurrent neural networks ofer moderate computational eficiency, their inherently sequential training and inference can limit parallelization and make it dificult to capture long-range dependencies. In contrast, transformer-based architectures leverage self-attention to process entire sequences in parallel, facilitating more efective modeling of distant context and exploiting modern GPU resources eficiently. However, transformers can become computationally demanding for very long inputs, as the self-attention mechanism scales quadratically with sequence length.

For these models, a labeled training corpus is required. Texts are first tokenized into word or subword units using various approaches, after that they are transformed into numerical vector representations [ 12, 13 ]. A classification layer is then applied to predict the token labels.

3. Method

In location category, strong named entity, hereafter simply called Named Entities (NE), is built from a toponym (i.e. a proper name, such as in Figure 1: "Saint-Ybars", "Porte de Mazet"). As weak spatial named entity is built from noun phrase describing the feature of the object to be referenced such as building, river, or path (e.g., "medieval street", "church tower"); for ease of reference, it is henceforth termed nominal entity (NoE). As mentioned earlier, the combination of the first two categories of spatial entities make up the category of spatial Nested Named Entity (NNE). For instance, in Figure 1, the phrase "hôtel de ville de Saint-Ybars" (’Saint-Ybars city hall’) exemplifies an NNE, where the NoE "hôtel de ville" functions as the feature and the NE "Saint-Ybars"; the same applies to the NNE "le chocher de l’église" (’the church bell tower’), where the first NoE "clocher" acts as a feature for the second NoE "église".

In addition to NNE, movement verbs or movement verbal phrases such as in Figure 1: "traversez" (’cross’) or "tourner pour descendre" (’turn down’) delineate a moving action. Finally expressions like "à gauche" (’left’), "au bout" (’at the end of’), "à côté" ( ’next to’), and/or "200 m" provide fine grain spatial context, these expressions while be called hereafter Ofsets or Measures .

(1) [. . . ] Traversez la route en diagonale et montez dans la rue de la Porte de Lezat . Après 200 m , tournez immédiatement à gauche vers le clocher de l’église au bout de cette rue , la rue de Dessous . Admirer l’imposante façade de l’hôtel de ville de Saint-Ybars sur la place , puis tourner pour descendre la rue Porte de Mazet à côté de la pharmacie . [. . . ] Translation:[. . . ] Cross the road diagonally and go up Porte de Lezat street . After 200 m , turn immediately left towards the church bell tower at end of this street , de Dessous street . Admire the imposing facade of Saint-Ybars city hall in the square , then turn down Porte de Mazet street next to the pharmacy . [. . . ] Named Entity (NE) Nominal Entity(NoE) Ofset Measure Verb of Movement (Motion) Nested Named Entity (NNE)

Rule-based approaches such as the Perdido system [14] have traditionally been employed for structured spatial tagging by combining morpho-syntactic and semantic constraints. Although efective for predefined structures, these methods are inherently limited in adaptability, often failing to detect variations in nominal entities and their relationships. This rigidity underscores the necessity for more lfexible methodologies capable of dynamically learning entity representations and dependencies.

To address these challenges, deep-learning-based approaches ofer a promising alternative. These models, trained on annotated corpora, exhibit strong generalization capabilities, allowing them to classify and extract spatial entities even in previously unseen contexts. Unlike rule-based systems, deep learning models learn implicit representations of spatial languages and capture hierarchical dependencies and context-aware entity relationships.

The aim of this study is to evaluate three models for recognizing NNE and their contextual references. These models were selected based on their significance in Named Entity Recognition (NER) research, each representing a distinct approach to structured sequence prediction: 1. Bidirectional Long Short-Term Memory with a Conditional Random Field Layer (BiLSTM-CRF): A well-established standard in NER using recurrent neural networks (RNNs). 2. Pre-Trained Bidirectional Transformer (CamemBERT): A transformer-based bidirectional language model (BiLM) with a classification head for token labeling. 3. Generalist Named Entity Recognition Using Bidirectional Transformers (GLiNER): An innovative zero-shot and few-shot learning model introducing a new paradigm for NER.

Each selected model represents a diferent paradigm in NER, providing a comparative analysis of their performance on structured sequence prediction tasks.

BiLSTM-CRF This model [ 6, 5, 4 ] is a widely adopted architecture for Named Entity Recognition (NER) and structured sequence labeling. It integrates a BiLSTM network with a CRF layer to eficiently capture contextual dependencies while enforcing valid label transitions.

The BiLSTM component processes input sequences in both forward and backward directions. Given a sequence of tokens x = {1, 2, . . . , }, two LSTM networks generate forward hidden state→s−ℎ and backward hidden state− s ←ℎ for each token. The final representation is obtained by concatenating these states, yielding ℎ = →[︁−ℎ− ; ←ℎ ]︁ with a total hidden state dimension . This bidirectional encoding allows the model to incorporate context from both past and future tokens.

A dense layer projects each hidden representation ℎ into a score vector () ∈ R, where is the number of possible labels. Instead of predicting labels independently, the CRF layer models dependencies between adjacent labels. The probability of a label sequence y = {1, ..., } is defined as: (y|x) = 1

∏︁ exp(− 1, + ()).

(x) =1 where − 1, is the transition score from label − 1 to , and () is the BiLSTM emission score at position . The partition function (x) normalizes over all possible sequences:

The model is optimized by minimizing the negative log-likelihood loss: (x) =

∑︁ ∏︁ exp(′− 1,′ + (′)).

y′∈(x) =1 ℒ = −

∑︁(− 1, + ()) + log (x).

During inference, the CRF layer selects the most probable label sequence by considering both emission scores from the BiLSTM and transition scores from the CRF. Figure 2 presents an overview of the model architecture.

CamemBERT This model [15] is a transformer-based model designed specifically for the French language. Unlike BiLSTM-CRF, which processes sequences token by token, CamemBERT employs self-attention mechanisms that allow all tokens in a sequence to be processed in parallel, capturing long-range dependencies more eficiently.

Given an input sequence x = {1, 2, ..., }, CamemBERT encodes each token using multiple transformer layers. At the core of its architecture is the self-attention mechanism, which computes (1) (2) (3) contextualized representations by attending to all tokens in the sequence. The attention score between token and token is computed as 4 and the output representation is then obtained as 5: exp( ) = ∑︀ =1 exp() , = q · k

√ ℎ′ = ∑︁ v , =1 v = ℎ (4) (5) where q = ℎ, k = ℎ , and is the head dimension, and , , and are learnable projection matrices.

Unlike BiLSTM, which encodes sequential dependencies using recurrence, self-attention allows each token to directly incorporate information from all other tokens in a single operation.

For Named Entity Recognition (NER), CamemBERT employs a classification head that assigns labels to tokens. A dense layer maps the final hidden representation into logits z ∈ R, where is the number of entity labels.

The model is trained using the cross-entropy loss. Compared to BiLSTM-CRF, which explicitly models label dependencies via a CRF layer, CamemBERT implicitly learns contextual relationships through self-attention. Figure 3 presents the overview of model architecture.

GLiNER This third and last model [ 10 ] is a transformer-based NER model that introduces span-based classification with zero-shot learning capabilities. By modeling spans instead of tokens, it allows more lfexible boundary detection and can better handle nested structures. The token encoder processes a unified input consisting of both entity type tokens and the input text, generating contextualized representations. Let p = {} =−0 1 ∈ R× denote the entity type representations, where is the number of entity types and is the dimensionality of each representation. Similarly, let h = {ℎ}=− 01 ∈ R× represent the contextual embeddings for each token in the input text, with being the number of tokens. The entity representations are refined through a two-layer feedforward network, producing q = {} =−0 1 ∈ R× .

The representation of a span from position to is computed as S = FFN(ℎ ⊗ ℎ ), where ⊗ denotes concatenation. To determine whether a span (, ) corresponds to entity type , a matching score is computed as: (, , ) = (︀ S⊤ ︀) , (6) where is the sigmoid activation function. This score represents the probability that the span (, ) belongs to entity type .

During training, the model distinguishes between positive pairs (spans correctly labeled with type ) and negative pairs (incorrect associations) using a binary cross-entropy loss: ℒ = − ∑︁ [I∈ log () + I∈ log (1 − ())] , (7)

∈× where I is the indicator function. This loss encourages high matching scores for correct span-type pairs while penalizing incorrect associations.

GLiNER difers fundamentally from BiLSTM-CRF, which explicitly models sequence dependencies via a CRF layer, and CamemBERT, which performs token-level classification. By employing span-based prediction and textual entailment-style classification, GLiNER enhances generalization across domains and under certain conditions, it enables entity recognition in low-resource and zero-shot settings. Figure 4 presents an overview of the model architecture.

By comparing these approaches, this study provides insights into the efectiveness of diferent NER paradigms in extracting spatial movement actions from descriptive texts.

4. Experiments

An initial pilot study aimed to assess the performance of these 3 models1 in accurately annotating text segments with six predefined labels.

Training Dataset and Annotation Process

The dataset2 consists of 1,897 french hiking descriptions, totaling 27,083 sentences and 569,214 tokens. Spatial expressions are categorized using the annotation labels given in Figure 1: strong named entities Named Entities (NE), weak named entities Nominal Entities (NoE), motion verbs or verbal phrases (Motion), expressions evoking spatial relation or condition (Ofset) , and numerical expressions followed by a unit of measurement (Measure) and finally, Nested Named Entities (NNE). These labels are inspired by previous rule-based approaches [16].

It is well known that producing an annotated dataset is a cumbersome and time-consuming task. It was therefore decided to use Perdido [14] as the annotator for this first study. But Perdido was not designed to be able to annotate nominal entities directly, and it would be a real challenge to integrate it. It was decided that this annotation would go through two stages (Figure 5). Firstly, following the result of the annotation carried out by Perdido, all the words or phrases involved in the annotation of a spatial named entity and having received the part of speech label "Noun" were extracted. A dictionary was created from these words or phrases, which then enabled all occurrences of the lexical entries in this dictionary to be labelled in the dataset as NoE.

The result is a silver-standard corpus—potentially containing errors due to fully automated annotation.

All models were trained and tested on an identical dataset extracted from the silver-standard corpus. Evaluation metrics include Precision, Recall, and micro F1-score, as summarized in Table 1. Tokenization Tokenization is a crucial preprocessing step that can significantly afects model performance. In our experiments, each model uses a distinct strategy. The BiLSTM-CRF model employs rule-based, word-level tokenization with TreeTagger [17], configured for French. Camembert-base uses subword tokenization based on Byte Pair Encoding (BPE) [18, 19] as implemented by SentencePiece [20] to decompose rare and compound words. GLiNER, which leverages a multilingual DeBERTa backbone, adopts a unigram-based subword tokenization strategy [21] via SentencePiece.

Models Parameters

For each model, the following parameter settings were used without applying additional hyperparameter tuning techniques: 1Model implementation: https://git.univ-pau.fr/atafer/sner 2Dataset: https://git.univ-pau.fr/atafer/hiking-dataset BiLSTM-CRF The BiLSTM-CRF model employs two LSTM cells (one for the forward and one for the backward direction) with an embedding size of 300 and a hidden dimension of 512 (256 per cell). The model is trained using a learning rate of 0.001.

Camembert-base Camembert-base is configured with an embedding/hidden size of 768, utilizes 12 transformer layers, and is trained with a learning rate of 2 × 10− 5.

GLiNER GLiNER utilizes the mDeBERTa-v3-large backbone—a multilingual variant of DeBERTav3—with an embedding/hidden size of 1024 and 12 transformer layers. The model is optimized using a learning rate of 5 × 10− 6.

Overall Analysis Camembert-base achieved the highest overall performance with an F1-score of 0.9534, followed closely by GLiNER with an F1-score of 0.9355. The superior performance of Camembert-base Could perhaps be explained by its pre-training on French-language data [15], which enhances its ability to capture ifne linguistic nuances inherent in the corpus. In contrast, GLiNER employs a backbone pretrained on the CC100 a multilingual corpus [22] where French comprises only about 3% of the tokens; this may partially explain its slightly lower performance on French-language data.

Interestingly, despite the absence of a dedicated pre-trained language model, the BiLSTM-CRF model efectively captured the specific characteristics of the hiking description corpus, achieving an F1-score of 0.9269 while maintaining a moderate number of parameters and lower computational cost.

Model Memory Footprint, Parameter Count, and Eficiency

Despite its compact architecture of approximately 8.73 million parameters and a minimal GPU memory allocation of 53.82 MB along with only 8.98 MB CPU memory during inference, the BiLSTM-CRF model demonstrates competitive performance relative to more complex transformer-based models. In contrast, CamemBERT-base, with 110.05 million parameters, requires substantially greater computational resources (430.07 MB allocated on the GPU and 324.73 MB on the CPU), achieving enhanced performance through richer language representations. The GLiNER model, which leverages a large multilingual DeBERTa backbone, comprises approximately 288.95 million parameters and incurs the highest memory demands (2206.75 MB allocated on the GPU and 1709.67 MB on the CPU).

These results highlight that small, specialized architectures such as BiLSTM-CRF can yield nearcomparable performance with significantly lower memory and parameter footprints, making them particularly advantageous for deployment in resource-constrained settings, while the choice of a larger model backbone in GLiNER underlines the trade-of between resource investment and the potential for improved cross-lingual generalization. In addition, although our evaluation does not formally assess cross-lingual transfer performance, preliminary examples in English suggest that GLiNER’s multilingual pretraining enables efective transfer of learned representations in french to other languages. Moreover, the GLiNER framework is inherently modular, allowing for the replacement of its resource-intensive multilingual DeBERTa backbone with alternatives such as CamemBERT, which ofers a lower memory footprint. This flexibility provides a promising avenue for optimizing the balance between computational eficiency and performance in Named Entity Recognition tasks.

5. Conclusion and Perspectives

In this study, we compared three deep learning models—our specialized BiLSTM-CRF model, CamemBERT-base, and GLiNER—for the extraction of spatial entities (nested or not, strong or weak) and movement actions from French itinerary descriptions. The experimental results indicate that transformer-based models, such as CamemBERT, efectively capture complex spatial patterns, while our specialized BiLSTM-CRF model, designed specifically for this task, ofers a competitive alternative with substantially lower computational requirements. The eficiency of the BiLSTM-CRF model makes it well suited for resource-constrained environments, and incorporating subword tokenization could further enhance its ability to handle out-of-vocabulary terms—an issue highlighted by the misclassification of certain named entities.

The GLiNER model, which utilizes a large multilingual DeBERTa backbone, was not subjected to a detailed cross-lingual transfer analysis; however, its design suggests that multilingual pretraining may support transferring representations learned on French data to other languages. Moreover, its modular architecture permits the substitution of its resource-intensive backbone with alternatives such as CamemBERT, potentially reducing memory usage while hoping to maintain good performance.

Future work will focus on several key directions. First, the development of a gold-standard corpus (especially for the test dataset) with manually corrected annotations is essential to overcome the limitations of our current silver-standard dataset and to provide a more reliable benchmark. Second, integrating higher-level structural annotations—particularly syntax-semantic dependencies linking spatial entities with their contextual elements—could refine the extraction process. Lastly, we will continue to investigate and refine model architectures to optimize the automated extraction and categorization of spatial entities and movement actions from descriptive texts.

Declaration on Generative AI

During the preparation of this work, the authors used GPT-4 to Grammar and spelling check. After using these tool, the author reviewed and edited the content as needed and take full responsibility for the publication’s content. Transactions of the Association for Computational Linguistics 5 (2017) 135–146. URL: https: //aclanthology.org/Q17-1010/. doi:10.1162/tacl_a_00051. [14] L. Moncla, M. Gaio, Perdido: Python library for geoparsing and geocoding French texts, in: First International Workshop on Geographic Information Extraction from Texts (GeoExT), Dublin, Ireland, 2023. URL: https://hal.science/hal-04049794. [15] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah, B. Sagot, CamemBERT: a tasty French language model, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7203–7219. URL: https: //aclanthology.org/2020.acl-main.645/. doi:10.18653/v1/2020.acl-main.645. [16] M. Gaio, L. Moncla, Extended Named Entity Recognition Using Finite-State Transducers: An Application To Place Names, in: The Ninth International Conference on Advanced Geographic Information Systems, Applications, and Services (GEOProcessing 2017), Nice, France, 2017. URL: https://hal.science/hal-01492994. [17] H. Schmid, Probabilistic part-of-speech tagging using decision trees, in: Proceedings of the

International Conference on New Methods in Language Processing (NOLP 1994), 1994, pp. 44–49. [18] P. Gage, A new algorithm for data compression, C Users J. 12 (1994) 23–38. [19] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: K. Erk, N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1715–1725. URL: https://aclanthology.org/P16-1162/. doi:10.18653/ v1/P16-1162. [20] T. Kudo, J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, in: E. Blanco, W. Lu (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 66–71. URL: https:// aclanthology.org/D18-2012/. doi:10.18653/v1/D18-2012. [21] T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 66–75. URL: https://aclanthology.org/P18-1007/. doi:10. 18653/v1/P18-1007. [22] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747/. doi:10.18653/v1/2020. acl-main.747.

[1]

Grishman ,

Sundheim , Message understanding conference-6 : a brief history , in: Proceedings of the 16th Conference on Computational Linguistics - Volume 1 , COLING '96, Association for Computational Linguistics, USA, 1996 , p. 466 - 471 . URL: https://doi.org/10.3115/992628.992709. doi: 10 .3115/992628.992709.

[2]

M. R.

Vicente , La glose comme outil de désambiguïsation référentielle des noms propres purs, Corela . Cognition, représentation, langage ( 2005 ). URL: http://journals.openedition.org/corela/1212. doi: 10 .4000/corela.1212.

[3]

J. R.

Finkel ,

C. D.

Manning , Nested named entity recognition , in: P. Koehn, R. Mihalcea (Eds.), Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2009 , pp. 141 - 150 . URL: https://aclanthology. org/D09-1015/.

[4]

Patil ,

Pawar , Named entity recognition using conditional random fields , Procedia Computer Science 167 ( 2020 ) 1181 - 1188 . URL: https://www.sciencedirect.com/science/article/pii/ S1877050920308978. doi:https://doi.org/10.1016/j.procs. 2020 . 03 .431, international Conference on Computational Intelligence and Data Science.

[5]

Hochreiter ,

Schmidhuber , Long short-term memory , Neural Comput. 9 ( 1997 ) 1735 - 1780 . URL: https://doi.org/10.1162/neco. 1997 . 9 .8.1735. doi: 10 .1162/neco. 1997 . 9 .8.1735.

[6]

Lample ,

Ballesteros ,

Subramanian ,

Kawakami ,

Dyer , Neural architectures for named entity recognition , in: K. Knight , A. Nenkova , O. Rambow (Eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , San Diego, California, 2016 , pp. 260 - 270 . URL: https://aclanthology.org/N16-1030/. doi: 10 .18653/v1/ N16 -1030.

[7]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS'17, Curran Associates Inc., Red

Hook

, NY , USA, 2017 , p. 6000 - 6010 .

[8]

Berragan ,

Singleton ,

Calafiore , J. M. and, Transformer based named entity recognition for place name extraction from unstructured text , International Journal of Geographical Information Science 37 ( 2023 ) 747 - 766 . URL: https: //doi.org/10.1080/13658816. 2022 . 2133125 . doi: 10 .1080/13658816. 2022 . 2133125 . arXiv:https://doi.org/10.1080/13658816. 2022 . 2133125 .

[9]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423. doi: 10 .18653/v1/ N19 -1423.

[10]

Zaratiana ,

Tomeh ,

Holat , T. Charnois, GLiNER: Generalist model for named entity recognition using bidirectional transformer , in: K. Duh,

Gomez , S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 5364 - 5376 . URL: https://aclanthology.org/ 2024 . naacl-long . 300 /. doi: 10 .18653/v1/ 2024 . naacl-long . 300 .

[11]

He ,

Gao , W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing , 2023 . URL: http://arxiv.org/abs/2111.09543. doi: 10 . 48550/arXiv.2111.09543. arXiv: 2111 .09543 [cs].

[12]

Mikolov , I. Sutskever,

Chen , G. Corrado,

Dean , Distributed representations of words and phrases and their compositionality , in: R. Caruana , S. Lawrence , C. Giles (Eds.), Advances in Neural Information Processing Systems , volume 26 , Curran

Associates

, Inc., 2013 , pp. 3111 - 3119 . Introduced the Skip-gram model and Negative Sampling, foundational for word embeddings .

[13]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information,