Accuracy of the Uzbek Stop Words Detection: a Case Study on
“School Corpus”
Khabibulla Madatov 1, Shukurla Bekchanov1 and Jernej Vičič 2,3
1
  Urgench state university, 14, Kh. Alimdjan str, Urgench city, 220100, Uzbekistan
2
  Research Centre of the Slovenian Academy of Sciences and Arts, The Fran Ramovš Institute, Novi trg 2, 1000
  Ljubljana, Slovenija
3
  University of Primorska, FAMNIT, Glagoljaska 8, 6000 Koper, Slovenia


                Abstract
                Stop words are very important for information retrieval and text analysis investigation tasks
                of natural language processing. Current work presents a method to evaluate the quality of a
                list of stop words aimed at automatically creating techniques. Although the method proposed
                in this paper was tested on an automatically-generated list of stop words for the Uzbek
                language, it can be, with some modifications, applied to similar languages either from the
                same family or the ones that have an agglutinative nature. Since the Uzbek language belongs
                to the family of agglutinative languages, it can be explained that the automatic detection of
                stop words in the language is a more complex process than in inflected languages. Moreover,
                we integrated our previous work on stop words detection in the example of the “School
                corpus” by investigating how to automatically analyse the detection of stop words in Uzbek
                texts. This work is devoted to answering whether there is a good way of evaluating available
                stop words for Uzbek texts, or whether it is possible to determine what part of the Uzbek
                sentence contains the majority of the stop words by studying the numerical characteristics of
                the probability of unique words. The results show acceptable accuracy of the stop words lists.

                Keywords 1
                stop word detection, Uzbek language, accuracy, agglutinative language

1. Introduction
   The application of Natural Language Processing (NLP) tasks in real-life scenarios are getting more
frequent than ever before, and there is huge research getting involved with different approaches to
enhance the quality of such tasks. An important aspect of many NLP tasks that make use of tasks,
such as information retrieval, text summarization, context-embedding, etc., relies on a task of
removing unimportant tokens and words from the context under focus. Such data are known as stop
words. Therefore, it is desired that some automatic method should be developed to identify stop
words that either make no change in the meaning of the context (or do very little) and remove them.
from the context.
   In this work, we are addressing the problem of automatic detection of stop words for the low-
resource agglutinative Uzbek language, and evaluate the proposed methods. The existing literature
that deal with stop words removal task for the Uzbek language [7] [8] [10] focus on the creation
process, the importance, as well as the availability of the proposed data, leaving a gap for further
investigation, which we discuss in this paper.
   The scientific term "stop words" is popular in the field of natural language processing, and its
definition we focus in this work is as follows: If the removal of those words from the text not only

1
 The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing
(ALTNLP), June 7-8, 2022, Koper, Slovenia
EMAIL: habi1972@mail.ru (A. 1); shukurla15@gmail.com (A. 2); jernej.vicic@upr.si (A. 3)
ORCID: 0000-0002-3664-4954 (A. 1); 0000-0001-9505-5781 (A. 2); 0000-0002-7876-5009 (A. 3)
             2020 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
does not change the context meaning but also leaves the minimum number of words possible that can
still hold the meaning of the context, then such words can be called stop words for this work.
    For instance, the following examples are shown to better explain what words would be considered
in given sentences, and what the final context would become after removing those stop words:
      ● “Men bu maqolani qiynalib yozdim”. (I wrote this article with difficulty). After removing
         the stop words (“men”, “bu”, “qiynalib”) the context becomes: “Maqolani yozdim”.(I wrote
         the article.);
      ● “Har bir inson baxtli bo’lishga haqlidir” (Every person has the right to be happy). After
         removing the stop words (“har ”, “bir”), the context becomes: “Inson baxtli bo’lishga
         haqlidir” (Person has right to be happy).

    Such definition is an extension of the traditional definition of stop words by including more words
than the actual expectations but still including the traditional stop words.
    The Term Frequency - Inverse Document Frequency (TF-IDF) method [15] was used to detect
stop words in Uzbek texts. TF-IDF is a numerical statistic that is intended to reflect how important a
word is to a document in a corpus, the method acknowledges words with the lowest TF-IDF values as
less important to the semantic meaning of the document and proposes these words as stop word
candidates.
         In our previous work[8], we discuss the methods and algorithms for automatic detection and
extraction of Uzbek stop words from previously collected text forming a new corpus called the
“School corpus”. The stop words detection method based on TF-IDF was applied to the
aforementioned corpus collected from 25 textbooks used for teaching at primary schools of
Uzbekistan, consisting of 731,156 words, of which 47,165 are unique words. To perform our
technique, for each word from the set of unique words, its frequency was determined (the number of
occurrences in the texts of the School corpus), and the inverse document frequency IDF(word) =
ln(n/m) where n = 25 – number of documents and m is the number of documents, containing the
unique word among 25 documents.
         The existing fundamental papers that deal with stop words in general, let alone for the Uzbek
language, barely address the quality of the automatically detected list of stop words. This statement
also applies to our previous work, where a preliminary manual expert observation of a part of the lists
(only unigrams) was done. To the authors‟ knowledge, there was no in-depth observation of the
accuracy of the automatically constructed lists of stop words for agglutinative languages. For
instance, [7][8][9][10] are mostly focusing on Uzbek texts‟ stop words and methods for automatic
extraction of stop words. But none of them discusses the accuracy of the presented methods. The
article is devoted to answering whether there is a good way of evaluating available stop words for
Uzbek texts, or whether it is possible to determine what part of the Uzbek sentence contains the
majority of the stop words by studying the numerical characteristics of the probability of unique
words.
    The words were sorted by the TF-IDF value in descending order and the lowest 5 percent of them
were tagged as stop words. We used this method to automatically detect stop words in the corpus [8].
Using this information, the article focuses on the followings:
     ● To create a probability distributions model of the TF-IDF of unique words in order to
         determine the position of stop words along with the corpus;
     ● To establish the accuracy of the detection method for stop words;
     ● To conclude on automatic position detection of stop words for the given text.

The rest of the paper is structured as follows: We start by explaining the related works in the field of
stop word removal, as well as the Uzbek language itself in Section 2, followed by the main
methodology of the paper in Section 3, which includes the creation of probability distribution law of
TF-IDF of unique words (Section 3.1), the numerical characteristics of the probability of unique
words (Section 3.2), and the evaluation of the created method using a small selected chunk
(Section3.3). The accuracy of the method for automatic detection of stop words in Uzbek texts, which
is based on TF-IDF, is presented in Section 4. The last section of the paper presents conclusions and
future work (Section 5).
2. Related works
Uzbek language belongs to the family of Turkic languages. There has been some research on the
Uzbek language mostly in the last few years. Most of the research done on Turkic languages can be
applied to the Uzbek language as well, using cross-lingual learning and mapping approaches,
alongside some language-specific additions. The paper [1] presents a viability study of established
techniques to align monolingual embedding spaces for Turkish, Uzbek, Azeri, Kazakh, and Kyrgyz,
members of the Turkic family which is heavily affected by the low-resource constraint.
Several authors present experiment and propose techniques for stopwords extraction from text for
agglutinative languages such as [2] that bases the stopword detection problem as a binary
classification problem and the evaluation shows that classification methods improve stopword
detection with respect to frequency-based methods for agglutinative languages but fails for English.
Ladani and Desai [5] present an overview of stopwords removal techniques for Indian and Non-Indian
Languages. Jayaweera et al. [2] proposes a dynamic approach to find Sinhala stopwords, the cutoff
point is subjective to the dataset. Wijeratne and de Silva [17] collected the data from patent
documents and listed the stopwords using term frequency. Rakholia et al. [14] proposed a rule-based
approach to detect stopwords for the Gujarati language dynamically. They developed 11 static rules
and used them to generate a stopword list at runtime. Fayaza et al. [1] presents a list of stopwords for
Tamil language and reports improvement in text clustering using removal.
   The paper Ошибка! Источник ссылки не найден. provides the first annotated corpus for
polarity classification for the Uzbek language. Three lists of stop words for the Uzbek language are
presented in [7] that were constructed using automatic detection of stop words by applying algorithms
and methods presented in [8]. Paper [9] focuses on the automatic discovery of stop words in the
Uzbek language and its importance. Articles [12] and [13] are also mainly concentrated on the
creation of stop words in Uzbek.
   Matlatipov et. al [10] propose the first electronic dictionary of Uzbek words‟ endings invariants for
morphological segmentation pre-processing useful for neural machine translation.
   The article [11] presents the algorithm of cosine similarity of Uzbek texts, based on TF-IDF to
determine similarity. Another work on similarity in Uzbek, but this time on semantic similarity of
words, a decent amount of work went on the creation and evaluation of a semantic evaluation dataset
that possesses both similarity and relatedness scores Ошибка! Источник ссылки не найден..


3. Methodology
The scientific novelty of the methodology used in this work can be shown as follows:
   ● The creation of probability distributions law based on TF-IDF scores of unique words;
   ● Thorough investigation of numerical characteristics of the probability of unique words;
   ● Better evaluation of the stop words detection method‟s accuracy;
Summarising the automatic detection of the position of stop words in given Uzbek texts.
   In our previous work[8], we proposed the usage of TF-IDF [15] to automatically extract stop
words from a corpus of documents. The stop words are discovered based on the Term Frequency
Inverse Document Frequency – TF-IDF. The number of times a word occurs in a text is defined by
Term Frequency -- TF. Inverse Document Frequency -- IDF is defined as the number of texts
(documents) being viewed and the presence of a given word in chosen texts (documents). TF-IDF is
one of the popular methods of knowledge discovery.
   Madatov et. al [8] propose the usage of TF-IDF [15] to automatically extract stop words from a
corpus of documents. The stop words are discovered based on the frequency of the word and the
frequency of the inverse document Term Frequency – Inverse Document Frequency – TF-IDF. The
number of times a word occurs in a text is defined by Term Frequency -- TF. Inverse Document
Frequency -- IDF is defined as the number of texts (documents) being viewed and the presence of a
given word in chosen texts (documents). TF-IDF is one of the popular methods of knowledge
discovery.
3.1.     Probability distribution
In order to determine the position of the stop words throughout the school corpus, we investigate the
probability distribution law of TF-IDF scores of stop words.

          Word weight and its probability. Select a word                    from the set of unique
words extracted from a corpus. For future references these two assumptions are valid: a word
represents a unique word from a corpus and a corpus represents the “School corpus” presented in our
previous work [4]. For every calculate average TF-IDF( ), called the weight of      and denoted as
  . It is known that   is not the probability of the word .

The probability       of                can be calculated using the following formula:
∑         . We match      for each word. Now ∑                .
    The probability density function. Suppose unique words are distributed independently in the
total corpus. In that case, word can be applied multiple times. In order to escape repeating the word
    We consider only the first appearance of this word. For each word           observe i as a random
variable. As the probability density function of the unique words, we get the following function:

f(i) can be considered as the probability density function of word .
In the Cartesian coordinate plane, observe i on the OX axis and observe along the OY axis. Figure
1 presents the described observations extracted from the “School corpus”. We need it to observe the
position of stop words along with the corpus.


   Figure 1. The probability density function of unique words. The X-axis represents the index number of words, while the
Y-axis shows the probability score.


3.2.     Numerical characteristics of the probability
   This section presents numerical characteristics of the probability of unique words. They are
calculated by the following formulas:
        ∑              the mathematical expectation of the unique words
        ∑                   – dispersion of the unique words
        √ – standard deviation of the unique words
         ∑                                        of the unique words
                                      third central moment of the unique words
                       The asymmetry of the theoretical distribution

    The described values extracted from the corpus are presented in Table 1.

    Table 1: Basic statistical properties extracted from the corpus.


23310,74    23310,74     13623,72     2,52864E+12   23310,74     728996416,52   25687931167881,50   41266663785,91   0,163


   The variety of words increases gradually with grades in the school literature. It means that the
probability density function of unique words is not symmetrical. One may predict it without a
mathematical way. However, mathematically, the data in Table 1, especially,          , confirms that
the probability density function is asymmetric.


    Figure 2. The probability density function of unique words with stop words. The orange dots indicate the positions of
stop words along with the corpus.


The stop words are distributed along the axis (not grouped at one part of the axis); represented by
orange dots in Figure 2.


3.3.       Evaluation using a sub-corpus
This section presents the probability density function of unique words of selected work from the
corpus. Each book from the corpus is devoted to one topic.
The prediction: Every book consists of the culmination part of the topic, the rest can be stop words.
That is why we investigated just one book.
A random book was selected from the range of 25 books (in the corpus): 11th class literature. The
book consists of 12837 unique words. The same process that was presented in Section 3.2 was applied
to just the selected part of the corpus in order to create the probability density function of unique
words. Figure 3 shows the probability density function of 11th class literature unique words.
   Figure 3: probability density function of 11th class literature unique words


   Mathematical analysis of the distribution is presented in Table 2.
   Table 2: Distribution analysis of the selected single book


7076,623   11981425      3461,41     414472396507     7076,623     602060020      598084106956   -10667328016   -0,251


   Figure 4:Unique words from part of the corpus sorted by probability, lowest 5% are candidates for stop words

   We obtain Figure 4 by the rule of stop words detection method, as mentioned in [4].


           means that the probability density function is asymmetric.
   The values were sorted in descending order and the lowest 5 percent of them are candidates for
stop words. Figure 4 graphically represents the process, words with probability less than           are
candidates to be a stop word (     = 0,00001034371184).
   The number of these candidates is 642. 85,8% of these words is located outside of the interval
                . On the left side of the interval there are 545 stop words and on the right side are 6
stop words. The same facts can be observed graphically on Figure 5 (Taking into the account the
numerical characteristics of 5% words of selected work and comparing Figure 3 and figure 4 we
detected their position along with the text).
   Figure 5: 85,8% of the stop word candidates are indeed located outside of the(E-σ,E+σ) interval


4. Evaluation results
The accuracy of the presented method if confirmed using the following reasoning:
Let suppose hypothesis
H0: Stop words of the selected document (11th class literature) are located outside of the interval
   (E-σ,E+σ);
and alternative hypothesis
H1: Stop words of the selected document (11th class literature) are located inside of the interval (E-
σ,E+σ).

The critical value – Z (Z-score or Standard score) is obtained using this Equation:
              .; where N=12837, =6419, =7076.62,                      .
           √
In the presented task |Z|≈21,526. Z is located on the left side of E-σ, meaning there is no reason to
reject the null hypothesis.
    This is the basis for rejecting the H1 hypothesis.


5. Conclusions and further work
Throughout the work performed in this paper, we presented a natural extension of the already
presented previous research of the automatic detection of stop words in the Uzbek language [4] and
the main focus of the analysis was twofold: a) a probability distributions model of the observed text
and b) the accuracy of the detection method for stop words.
From all theoretical investigations from previous sections, it can be concluded that, for a single genre,
the majority of stopwords have the following nature:
if       , are located at the beginning parts of the text;
   if        , are located at the ending of the text;
   if        , are located at the beginning at the ending part of the text.
   In future works, we would like to use the results of this article as the basis for automatically
extracting keywords and automatically extracting the abstract of a given text.


6. Acknowledgements
   The authors gratefully acknowledge the European Commission for funding the InnoRenew CoE
project (Grant Agreement $\#$739574) under the Horizon2020 Widespread-Teaming program and the
Republic of Slovenia (Investment funding of the Republic of Slovenia and the European Union of the
European Regional Development Fund).

7. Conclusion
The paper presents a natural extension of the already presented research of automatic detection of stop
words in Uzbek language[8] and presents two goals: a) a probability distributions model of the
observed text and b) the accuracy of the detection method for stop words.
    a) The probability density is defined and later used to observe the accuracy of the automatic
       method for extraction of stop words of Uzbek language.
    b) The accuracy of the method that is presented in Section Ошибка! Источник ссылки не
       найден..

   From this fact it can be concluded that, for a single genre, more of the stop words for texts:
   if       , are located at the beginning parts of the text;
   if       , are located at the ending of the text;
   if       , are located at the beginning at the ending part of the text.
Further we use this result in the process of automatically extracting keywords from the given text and
automatically extracting the annotation of the given text.

8. References
[1] F. Fayaza, F. Farhath. "Towards stop words identification in Tamil text clustering.", (IJACSA)
    International Journal of Advanced Computer Science and Applications, Vol. 12, No. 12, (2021).
[2] A. A. V. A. Jayaweera, Y. N. Senanayake, P. S. Haddela, "Dynamic Stopword Removal for
    Sinhala Language," 2019 Natl. Inf. Technol. Conf. NITC 2019, pp. 8–10, 2019, doi:
    10.1109/NITC48475.2019.9114476.
[3] M. Kumova, B. Karaoğlan. "Stop word detection as a binary classification problem." Anadolu
    University Journal of Science and Technology A-Applied Sciences and Engineering 18, no. 2
    (2017): 346-359.
[4] E. Kuriyozov, Y. Doval, C. Gomez-Rodriguez. “Cross-Lingual Word Embeddings for Turkic
    Languages”, Proceedings of The 12th Language Resources and Evaluation Conference, pp4054--
    4062, 2020
[5] Kuriyozov, E., Matlatipov, S., Alonso, M.A. and Gómez-Rodríguez, C., 2022. Construction and
    Evaluation of Sentiment Datasets for Low-Resource Languages: The Case of Uzbek. In
    Language and Technology Conference (pp. 232-243). Springer, Cham.
[6] D. J. Ladani, N. P. Desai, "Stopword Identification and Removal Techniques on TC and IR
    applications: A Survey," 2020 6th Int. Conf. Adv. Comput. Commun. Syst. ICACCS 2020, pp.
    466–472, (2020), doi: 10.1109/ICACCS48705.2020.9074166.
[7] K. Madatov, S. Bekchanov, J. Vičič. “Lists of Uzbek Stopwords”, Zenodo, (2021), doi:
    10.5281/zenodo.6319953
[8] K. Madatov, S. Bekchanov, J. Vičič. “Automatic Detection of Stop Words for Texts in the Uzbek
    Language”, Preprints, MDPI, 2022
[9] K. Madatov, M. Sharipov, S. Bekchanov. O „zbek Tili Matnlaridaginomuhim so „zlar //Computer
    Linguistics: Problems, Solutions, Prospects. – 2021. – Т. 1. – nr. 1.
[10] S. Matlatipov, U. Tukeyev, M. Aripov. “Towards the Uzbek Language Endings as a Language
     Resource”, In: Advances in Computational Collective Intelligence. ICCCI 2020.
     Communications in Computer and Information Science, vol 1287. Springer, Cham., (2020)
[11] S. Matlatipov. "Cosine Similarity and its Implementation to Uzbek Language Data," Central
     Asian Problems of Modern Science and Education: Vol. 2020 : Iss. 4 , Article 8, (2020).
[12] I. Rabbimov, S. Kobilov, I. Mporas. Uzbek News Categorization using Word Embeddings and
     Convolutional Neural Networks. 2020 IEEE 14th International Conference on Application of
     Information     and    Communication       Technologies      (AICT).     pp      1-5,   (2020),
     doi:10.1109/AICT50176.2020.9368822
[13] I. Rabbimov, S. Kobilov. “Multi-Class Text Classification of Uzbek News Articles using
     Machine Learning”. Journal of Physics: Conference Series. (2020), doi: 10.1088/1742-
     6596/1546/1/012097
[14] R. M. Rakholia, J. R. Saini, "A Rule-Based Approach to Identify Stop Words for Gujarati
     Language," In Proceedings of the 5th International Conference on Frontiers in Intelligent
     Computing: Theory and Applications, pp. 797-806, (2017)
[15] C. Sammut, G. Webb, eds. “Encyclopedia of machine learning”. Springer Science & Business
     Media, (2011)
[16] Salaev, Ulugbek, Elmurod, Kuriyozov, and Carlos, Gomez-Rodriguez. "SimRelUz: Similarity
     and Relatedness scores as a Semantic Evaluation dataset for Uzbek language". In Proceedings of
     the the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced
     Languages (pp. 199–206). European Language Resources Association, 2022.
[17] Y. Wijeratne, N. de Silva, "Sinhala Language Corpora and Stopwords from a Decade of Sri
     Lankan Facebook," arXiv, 2020, doi: 10.2139/ssrn.3650976.