=Paper=
{{Paper
|id=Vol-3315/paper11
|storemode=property
|title=Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition
|pdfUrl=https://ceur-ws.org/Vol-3315/paper11.pdf
|volume=Vol-3315
|authors=Yiğit Bekir Kaya,Ahmet Cüneyd Tantuğ
}}
==Finding the Optimal Vocabulary Size for Turkish Named Entity Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-3315/paper11.pdf</pdf>
<pre>
Finding the Optimal Vocabulary Size for Turkish
Named Entity Recognition
Yiğit Bekir Kaya1 , A. Cüneyd Tantuğ1
1
    Istanbul Technical University, Faculty of Computer Engineering, Computer Engineering Department, Istanbul, Turkey


                                         Abstract
                                         Transformer-based language models such as BERT [1] (and its optimized versions) have outperformed
                                         previous models achieving state-of-the-art results on many English benchmark tasks. These multi-layered
                                         self-attention-based architectures are capable of producing contextual word vector representations.
                                         However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly
                                         for languages with a complex morphology. The granularity of the generated tokens is a hyperparameter
                                         to be tuned, being determined by the vocabulary size. Remarkably, the effect of this hyperparameter
                                         is not widely studied, and it is either chosen arbitrarily or via trial-and-error in practice. Considering
                                         Turkish’s complex productive morphological structure, the granularity hyperparameter plays a vital
                                         role as a significant hyperparameter to be tuned compared to English. In this work, we present novel
                                         BERT models (named ITUTurkBERT) pretrained with various vocabulary sizes from scratch on BERTurk
                                         corpus [2] and fine-tuned for named entity recognition (NER) downstream task in the Turkish language,
                                         achieving state-of-the-art performance (average 5-fold CoNLL F1 score of 0.9372) on the WikiANN
                                         dataset [3]. The empirical experiments demonstrate that increasing the vocabulary size leads to a high
                                         level of token granularity, which also achieves better NER performance.

                                         Keywords
                                         named entity recognition, Turkish, BERT, hyperparameter tuning


1. Introduction
BERT Multilingual model was pretrained on Wikipedia data for 104 languages and has proved
to be efficient on a wide range of tasks in NLP. Even though Google designed multilingual
BERT as a universal language model, it does not consistently achieve state-of-the-art results for
downstream tasks in non-English languages because it cannot fully capture the specific linguistic
characteristics of every language. To accommodate this shortcoming, we have pretrained
language models that incorporate the missing knowledge of the Turkish language.
   Developing a BERT model for Turkish NLP tasks is challenging because Turkish is an ag-
glutinative language, morphologically richer than inflectional languages such as English and
German. Because of this challenge, the out-of-vocabulary (OOV) problem is relatively more
important for Turkish. Also, NER is a token-based downstream task, i.e., the model labels
each token in a sentence for named entities. Thus tokenization is even more crucial for NER
downstream tasks than sentence-level tasks such as sentiment analysis.

The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural
Language Processing (ALTNLP), June 7-8, 2022, Online
$ kayayig@itu.edu.tr (Y. B. Kaya); tantug@itu.edu.tr (A. C. Tantuğ)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
    Tokenization algorithms, i.e., decomposition of words into pieces named tokens or subwords,
such as WordPiece [4] or SentencePiece [5], address the OOV problem by ultimately representing
a word as a sequence of subwords. Tokenization algorithms have a single hyperparameter
named granularity determined by a couple of other hyperparameters, the most essential being
vocabulary size. However, the effect of this hyperparameter is not well understood. In practice,
it is either chosen arbitrarily or via trial-and-error [6].
    In this work, we attempt to answer these questions: "What value of WordPiece vocabulary
size is best for NER?" "Does normalization improve NER performance?" and, more importantly,
an explanation for "Why this specific vocabulary size?". These questions boil down to how NER
utilizes the subwords produced by the tokenizer. Since NER uses only the first token by default,
the question becomes ’Which tokenization (normalized and larger vocabulary size) yields the
first token?’. We answer this question in Section 3.
    The contributions of this paper are as follows: We pretrained novel Turkish-specific BERT
language models with different vocabulary sizes and normalization settings, described in Sec-
tions 4.2 and 4.3. Section 4.2 also explained how we fine-tuned our language models for the best
NER performance. Section 3 explains with evidence why some vocabulary sizes are better than
others for Turkish. We describe our experimental setup in Section 4 and our results in Section
4.4.


2. Related Work
2.1. Language-Specific BERT Models for Morphologically Rich Languages
Multilingual BERT is pretrained on Wikipedia data for 104 languages. Although the multilin-
gual BERT shows remarkable cross-lingual ability, various language-specific BERT models for
morphologically rich languages are suggested for improvement. For that, language-specific
BERT models are trained in Korean [7], Finnish [8], Hebrew [9], and Turkish [2] which we
share our corpus.

2.2. Effect of Vocabulary Size on NLP Performance
Vocabulary size optimization is a common hyperparameter for NLP problems. For Neural
Machine Translation [10], [11]. For natural language understanding, document classification,
and natural language inference [12].

2.3. Turkish Named Entity Recognition
Named Entity Recognition downstream task is widely studied for the Turkish language. These
studies share the same named entity recognition dataset, WikiANN [13], [14], [15]. The first
Turkish NER study dates to 1999 [16].
3. Tokenization and Granularity
We started our tokenization process by replacing the standard WordPiece [4] algorithm with
rule-based tokens generated by Zemberek [17] morphological analyzer and disambiguator. To
use this tokenization algorithm, we lowercased the datasets.
   Using morphological tags did not improve performance compared to surface forms. After
investigating the cause of this performance decrease, we determined the increased granularity
resulted in fitting less information in a maximum sequence length of 512. We used inflection
groups to combine tags into chunks improving the results yet not better than the surface. Finally,
we decided to increase vocabulary size to minimize granularity to get the best results possible.
   During fine-tuning of the named entity recognition task, the fine-tuning model only uses the
first token of each word, and the rest of the tokens are discarded from attention calculation,
making the first token of each word very important. The first token depends on the granularity
of a word, which is determined by either the vocabulary size or how the tokens are separated,
e.g., using a morphological tokenizer, the first token would usually be the root of a word.
Table 1 demonstrated that increasing the vocabulary size altered how the WordPiece algorithm
tokenizes words and what the first token will be. For 256k vocabulary size, whole words can be
processed as a single token, increasing the depth of language understanding.

                                              Vocab Size (k)
   Word          32                    64                  128                256         Label
   velazquez     vel ##az ##que ##z    vela ##z ##quez     vela ##z ##quez    velazquez   I-PER
   winehouse     win ##eh ##ous ##e    win ##eh ##ouse wine ##house           winehouse   I-PER
   komnenos      kom ##ne ##no ##s     kom ##nen ##os komnen ##os             komnenos    I-PER
   yıldızeli     yıldız ##eli          yıldız ##eli        yıldızeli          yıldızeli   I-LOC
Table 1
Example tokenizations of each word for different vocabulary sizes trained using the same corpus

                                           Turkish             English
                         Vocab Size   All Words NEs       All Words NEs
                         32k          1.45         1.77   1.24         1.56
                         64k          1.33         1.56   N/A          N/A
                         128k         1.24         1.41   N/A          N/A
                         256k         1.18         1.31   N/A          N/A
Table 2
Granularity for WikiANN Dataset for Turkish and CoNLL-2003 Dataset [18] for English for different
vocabulary sizes considering all words and named entities-only


  Average granularity, i.e., tokens per word for English, is 1.116 on BookCorpus [19] with
32k vocabulary size, while average granularity for Turkish is 2.886 on BERTurk corpus with
32k vocabulary size. This significant difference requires Turkish to have a more extensive
vocabulary to maintain similar granularity.
  The granularity for specific downstream datasets is also similar, as seen in Table 2. A 128k
Turkish vocabulary has the same granularity as English vocabulary considering all words, while
a 64k Turkish one has the same granularity as English vocabulary, including only named entities.


4. Experiments and Results
Our experiments include pretraining the BERT model on the BERTurk corpus, which com-
bines four datasets. The BERTurk model [2] processes this dataset, and we generated an
additional normalized dataset using the Zemeberek [17] normalizer and other minor normaliza-
tion steps, explained in Section 4.2. The vocabulary is learned, fine-tuned, and evaluated with
HuggingFace’s Tokenizers [20] library, presented in Section 4.3 as depicted in Fig 1, followed
by fine-tuning for named entity recognition on the WikiANN dataset. We compared the re-
sults against the baseline of uncased versions of the pretrained multilingual BERT model and
BERTurk’s 32k and 128k vocabulary BERT models.


Figure 1: Data Flow of our Pretraining and Fine-tuning pipeline for Named Entity Recognition
4.1. WikiANN Dataset
[3] presents the WikiANN dataset. The WikiANN dataset is a collection of Turkish Wikipedia,
with semi-automatic annotations for three different entity types: Person, Location, Organization,
and Table 3 presents their distribution.

                          WikiANN         Training   Validation   Test
                          Location        9679       5014         4914
                          Organization    7970       4129         4154
                          Person          8833       4374         4519
                          Total words     149786     75930        75731
Table 3
WikiANN Dataset Named Entity Classes for Training, Validation and Test Splits


4.2. Preprocessing and Normalization
We merged the training, validation, and test splits and turned it into a 5-fold dataset to average
out the bias of the given partition.
   We normalized the dataset iteratively. First of all, all words are lower-cased. After that,
BERTurk data have characters in multiple encodings. Some letters are replaced with their
counterparts in another encoding to reduce it to a single encoding. Using Zemberek normalizer,
unknown words are detected and manually replaced with correct words. Also, extra cleaning
for HTML and URL are applied. Finally, we removed the characters that are not letters, numbers,
and a set of punctuation. We use this normalization step both on the training and fine-tuning
corpus, so there are no out-of-the-vocabulary issues.

4.3. Pretraining and Fine-tuning
We used the original BERT base hyperparameters from the Transformers library. Original BERT
is a decoder-only Transformer with 12 hidden layers, each having 768 hidden vector units
and a feed-forward intermediate size of 3072, with GELU activation and hidden and attention
probability dropout rate of 0.1.
   Our ITUTurkBERT model learns vocabularies using Transformers BERT WordPiece tokenizer
library based on BERTurk corpus using different vocabulary sizes. Furthermore, we generated
normalized vocabularies using a normalized version of the BERTurk corpus employing the same
process.
   Fine-tuning is run by adding a dropout and linear layer on BERT’s output layer. For fine-
tuning, we used the hyperparameters listed in Table 4.
   We ran vocabulary generation and fine-tuning tasks on ITU AI Center’s using Nvidia’s V100
GPUs and HuggingFace’s Transformers’ PyTorch-based library. All pretraining from scratch
is performed on virtual machines using Google’s v3-8 TPUs provided by the TRC program
on Google Cloud using the original TensorFlow-based pretraining script. After pretraining,
we convert these TensorFlow checkpoints to PyTorch compatible bin files to utilize older
                                  Hyper Parameter         Value
                                  BATCH_SIZE              16
                                  NUM_EPOCHS              10
                                  MAX_SEQ_LENGTH          512
Table 4
We fine-tuned the ITUTurkBERT model in mini-batches with a size of 16 for ten epochs with 512
maximum sequence length meaning longer sequences would be truncated


PyTorch-based fine-tuning scripts provided by Transformers, enabling models trained from
scratch.

4.4. Results

                                                       Vocabulary Size (k)
                          Model                 32         64     128      256
                    Multilingual BERT          N/A       N/A     0.931     N/A
                        BERTurk               0.9271     N/A     0.933     N/A
                      ITUTurkBERT             0.9152     0.931   0.935 0.9372
                 ITUTurkBERT Normalized       0.9271    0.9292 0.9351 0.9366
Table 5
Turkish Named Entity Recognition CoNLL F1 (5-fold Avg.) for each Turkish language model, including
baselines


  We present the training results of the named entity recognition in Table 5. We observed that
using more extensive vocabulary improved the CoNLL F1 score. Furthermore, normalizing the
input slightly decreased the performance.
  Since Multilingual BERT has a corpus from multiple languages, it is more advantageous
specifically for named entity recognition because of its inherent multilingual property. Also,
having a more extensive corpus than BERTurk and a large vocabulary size made Multilingual
BERT more successful than those with smaller vocabulary sizes. However, any model with a
more extensive vocabulary size outperformed multilingual BERT, indicating the importance of
the granularity of input data.


5. Conclusion
We have demonstrated that our Turkish-specific ITUTurkBERT model effectively deals with the
Turkish named entity recognition task. The ITUTurkBERT model shows a higher performance
than multilingual BERT in the named entity recognition task, which means it captures language-
specific linguistic phenomena.
  We also proved that increasing the vocabulary size improved the NER performance consis-
tently while normalization of the data has been partially effective.
   Compared to the BERTurk model, the ITUTurkBERT model has higher or comparable results
in named entity recognition performance. We investigated the effect of vocabulary size on
granularity hyperparameter for WordPiece tokenization to deal with a morphologically rich
language and can confirm that vocabulary size is a crucial parameter to consider for the NER
task. Our contribution was to test the efficacy of vocabulary size on the named entity recognition
task in a morphologically rich language, Turkish.
   For future work, we plan to extend our tests on other named entity recognition datasets in
literature, enriching named entity recognition attention calculation by including other tokens
for each word. We further plan to have other downstream tasks and cased and newer versions
of BERT such as ELECTRA or DistillBERT. Finally, we plan to add a CRF layer at the end of the
BERT output rather than a linear layer, as done lately in literature.


6. Acknowledgements
Our research is supported by Cloud TPUs from Google’s TPU Research Cloud (TRC), enabling
us to achieve SotA results. We also thank Stefan Schweter for providing us with the BERTurk
dataset, which we used for pretraining our models to compare their performance to the original
BERTurk model fairly.
  We thank the HuggingFace team for providing the libraries to generate custom WordPiece
vocabularies, implement WordPiece tokenization, and fine-tune our pretrained models for the
named entity recognition task. Furthermore, we thank the Zemberek team for the normalization,
morphological analysis, and disambiguation tools.


References
 [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [2] S. Schweter, Berturk - bert models for turkish, 2020. URL: https://doi.org/10.5281/zenodo.
     3770924. doi:10.5281/zenodo.3770924.
 [3] X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, H. Ji, Cross-lingual name tag-
     ging and linking for 282 languages, in: Proceedings of the 55th Annual Meeting
     of the Association for Computational Linguistics (Volume 1: Long Papers), Associ-
     ation for Computational Linguistics, Vancouver, Canada, 2017, pp. 1946–1958. URL:
     https://www.aclweb.org/anthology/P17-1178. doi:10.18653/v1/P17-1178.
 [4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between
     human and machine translation, arXiv preprint arXiv:1609.08144 (2016).
 [5] T. Kudo, Subword regularization: Improving neural network translation models with
     multiple subword candidates, arXiv preprint arXiv:1804.10959 (2018).
 [6] E. Salesky, A. Runge, A. Coda, J. Niehues, G. Neubig, Optimizing segmentation granularity
     for neural machine translation, Machine Translation 34 (2020) 41–59. URL: http://dx.doi.
     org/10.1007/s10590-019-09243-8. doi:10.1007/s10590-019-09243-8.
 [7] S. Lee, H. Jang, Y. Baik, S. Park, H. Shin, Kr-bert: A small-scale korean-specific language
     model, arXiv preprint arXiv:2008.03979 (2020).
 [8] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, S. Pyysalo,
     Multilingual is not enough: Bert for finnish, arXiv preprint arXiv:1912.07076 (2019).
 [9] A. Seker, E. Bandel, D. Bareket, I. Brusilovsky, R. S. Greenfeld, R. Tsarfaty, Alephbert: A
     hebrew large pre-trained language model to start-off your hebrew nlp application with,
     arXiv preprint arXiv:2104.04052 (2021).
[10] T. Gowda, J. May, Finding the optimal vocabulary size for neural machine translation,
     arXiv preprint arXiv:2004.02334 (2020).
[11] K. Park, J. Lee, S. Jang, D. Jung, An empirical study of tokenization strategies for various
     korean nlp tasks, 2020. arXiv:2010.02534.
[12] W. Chen, Y. Su, Y. Shen, Z. Chen, X. Yan, W. Wang, How large a vocabulary does
     text classification need? a variational approach to vocabulary selection, arXiv preprint
     arXiv:1902.10339 (2019).
[13] O. Güngör, S. Üsküdarlı, T. Güngör, Recurrent neural networks for turkish named entity
     recognition, in: 2018 26th Signal Processing and Communications Applications Conference
     (SIU), IEEE, 2018, pp. 1–4.
[14] R. Yeniterzi, Exploiting morphology in turkish named entity recognition system, in:
     Proceedings of the ACL 2011 Student Session, 2011, pp. 105–110.
[15] G. A. Şeker, G. Eryiğit, Initial explorations on using crfs for turkish named entity recogni-
     tion, in: Proceedings of COLING 2012, 2012, pp. 2459–2474.
[16] S. Cucerzan, D. Yarowsky, Language independent named entity recognition combining
     morphological and contextual evidence, in: 1999 joint SIGDAT conference on empirical
     methods in natural language processing and very large corpora, 1999.
[17] A. A. Akın, M. D. Akın, Zemberek, an open source nlp framework for turkic languages,
     Structure 10 (2007) 1–5.
[18] E. F. Sang, F. De Meulder, Introduction to the conll-2003 shared task: Language-independent
     named entity recognition, arXiv preprint cs/0306050 (2003).
[19] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning
     books and movies: Towards story-like visual explanations by watching movies and reading
     books, in: Proceedings of the IEEE international conference on computer vision, 2015, pp.
     19–27.
[20] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, et al., Huggingface’s transformers: State-of-the-art natural language pro-
     cessing, arXiv preprint arXiv:1910.03771 (2019).

</pre>