1. Introduction

implemented into machine

0 Faculty of Applied Mathematics, Silesian University of Technology , Kaszubska 23, 44100 Gliwice , POLAND 1 IVUS2024: Information Society and University Studies 2024 2 Żaneta Pawelec

Language recognition algorithms play a pivotal role in various domains, offering applications ranging from automatically detecting the language of textual data to powering multilingual customer support systems. As the foundation of modern technologies like Artificial Intelligence, these algorithms enable content localization, facilitate language translation services, and drive personalized marketing strategies by analyzing linguistic patterns in customer feedback and social media interactions. This project compares five machine learning algorithms for language recognition, focusing on Bayesian classifiers and K-Nearest Neighbors (KNN). Through experimentation with different variations of these algorithms, including custom implementations, the project evaluates their effectiveness in recognizing 17 foreign languages. Methodologically, the project explores the nuances of each algorithm, discussing their underlying principles and implementation details. Experimental results reveal insights into the performance of each algorithm, providing valuable considerations for practical applications. Additionally, the project discusses the significance of precision, recall, F1-score, and accuracy metrics in assessing algorithm performance. Overall, this study contributes to advancing language recognition technology, offering valuable insights into algorithmic approaches and their real-world implications.

eol>language recognition knn clustering artificial intelligence Bayesian classifier K-Nearest Neighbors

1. Introduction

Our program aims to compare five machine learning algorithms. All algorithms calculate the effectiveness of recognizing 17 foreign languages using different variations of the Bayesian classifier [ 6 ] and K-Nearest Neighbours classifier [ 7, 8 ]. The calculations are based on a longer or shorter sentence retrieved from a database.

To get a closer look into the applied classifiers, the following paragraphs will briefly describe them to illustrate how different these calculation methods are from each other.

The Naive Bayes classifier is a probabilistic machine learning model based on Bayes’ theorem, which calculates the probability of a certain class given a set of features. It assumes that the features are conditionally independent, hence "naive." It’s widely used for classification tasks, especially in text classificationand spam filtering.

K-Nearest Neighbors (KNN) is a non-parametric supervised learning algorithm used for classificationand regression tasks. In KNN, the class of a new data point is determined by the majority class among its k nearest neighbors in the feature space. It’s simple to implement and understand but can be computationally expensive for large datasets (like the one we are using), as it requires storing all training data and computing distances for each prediction.

Both algorithms have varying time consumption, with KNN being more computationally expensive due to its need to calculate distances for each prediction. Now, let’s delve into a brief explanation of each of the applied algorithms and the underlying thought process behind their selection. The first classifier is the Bayesian classifier from the library, which provides the most effective results and thus serves as the main benchmark that we tried to achieve in the other algorithms. Next, we independently create a second Bayesian classifier aiming to mimic the version from the library. The third classifier is also a modified Bayesian classifier, determining the language by the probability of neighboring letters. In executing this algorithm, we assumed that each language has recurring sequences of letters that can enable assigning a given sentence to the language in which this sequence most commonly occurs. We derived an appropriate formula that allowed us to implement our idea into the program. The fourth classifieris a K-Nearest Neighbours from the library, but with implemented different distance calculation methods which we adjusted to our specific database. The fifth classifier is also the K-Nearest Neighbours algorithm but in this instance written by us. It was created following open-access models with an intent to achieve as high accuracy as the one from the imported KNN classifier.In order to achieve a satisfying outcome it required us to apply many adjustments in the distance calculating method. After performing the calculations, each algorithm displays a table with the results of the effectiveness of definingeach language.

2. Methodology

Data from the set is divided into subsets X, containing texts in various languages, and Y containing the language classes of the texts from set X. The initial two Bayes classifiersand both KNN algorithms operate on a dataset converted into a matrix of token counts using the CountVectorizer class from the sklearn library. This is a one-dimensional matrix of the length of the dictionary containing all the words from the dataset. Each text sequence from set X is represented by such a matrix, where the words occurring in this sequence are represented by the number of their occurrences in the appropriate matrix position and the rest are filled with zeros.

First, we used the MultinomialNB class contained in the sklearn library. For calculations, it uses the formula: where: is the probability P( | ) of feature appearing in a sample belonging to class . is the count of occurrences of parameter in class in the training set, while is the number of all parameters in set . is the smoothing prior, which in this case is Laplace smoothing - = 1. is the number of classes in set Y.

Next, we attempted to replicate the function contained in the library, aiming to obtain similar results. However, in our version of the algorithm, we did not consider the smoothing parameter.

Algorithm 1: Method ’OwnMNB.fit’ trainingthe algorithm

Data: sets x_train and y_train

Result: None 1 := set of values of _ ; 2 := empty dictionary; 3 foreach ∈ do 4 := _ ∈ ; 5 _ := sum of vectors in ; 6 [ ] := VSum / length of ; Algorithm 2: Method ’OwnMNB.predict’ performing calculations

Data: _

Result: list _ 1 _ := empty list; 2 foreach _ ∈ _ do 3 := empty dictionary; 4 foreach ∈ do 5 := vector sum of * [ ] ; 6 Append to [ ]; 7 Append to _ class with the biggest value from dictionary ;

In the third Bayes classifier, we changed the approach to the dataset. We utilized individual dependencies on the construction of each language - the probability of one letter occurring after another. The formula in this case takes the form:

where: is the probability P( | ) for contained in the same class . (,−1, | ) is the probability of the occurrence of letter after −1 in class . is the number of letters in the considered text sequence.

This time, the methods are given raw training sets X and Y, and a test set X. The ’fit’ method is responsible for creating a ’neighborhood table’ of all the letters present in the training set X divided by language classes. They contain the probabilities of the occurrence of a given pair of letters one after the other. The ’predict’ method for the test set determines membership in a class based on the probabilities from the ’neighborhood tables’.

Algorithm 3: Method ’LetterProb.fit’ training the algorithm end foreach ∈ Keys [] do [][] = [][]/; if + ∈ [] then [][ + ]+ = 1; [][ + ] := 1;

In K-Nearest Neighbors from the library, we use scalar vector multiplication to calculate distances. We multiply this value by -1 to avoid the need to compute the k-farthest neighbors further. where a,b are vectors

In our k-NN, we used the same formula for calculating distances as in the library algorithm, but additionally, we incorporated weighted computation of the nearest neighbors.

3. Experiments

To compare the different performance parameters of the used algorithms, we utilized the metrics module from the sklearn library. To improve the accuracy of the results, each algorithm was executed 10 times, and the final value is the average of all trials. The dataset containing texts in 17 languages with a total length of 10,337 records was divided into training and testing sets in a 70:30 ratio. For each algorithm, we compared parameters such as: • precision - it is a measure that determines the ratio of correctly predicted class elements to all those marked as the given class • recall - a measure informing how many elements from a given class were correctly recognized • f1-score - it is the harmonic mean between precision and recall • support- a measure of the occurrences of each class in the dataset • accuracy - it is the ratio of correctly classified samples to all cases in the test set Meaning of labels: • TP - true positive - cases that were correctly classified as positive by the classifier • TN - true negative - cases that were correctly classified as negative by the classifier • FP - false positive - an error where the test result incorrectly indicates the presence of a condition when it is not present • FN - false negative - an error where the test result incorrectly indicates the absence of a condition when it is actually present

3.1. The Bayesian algorithm from the sklearn library

Analyzing the results shown in the above table, we can observe that the algorithm matches most languages with an accuracy ranging from 98-100% (see Tab. 1). The exception is the English language, which has an accuracy of only 89%, which may be due to the fact that English words are borrowed from other languages. The method for the entire dataset has an accuracy of 98%, making it the most accurate of all the solutions we have used (Fig. 1a). (a) The effectiveness results for the Bayesian algorithm from the sklearn library (b) The effectiveness results for the selfimplemented Bayesian algorithm (c) The effectiveness results for a customwritten Bayesian algorithm for letter proximity (d) The effectiveness results for our k-nearest

neighbors (kNN) (e) The effectiveness results for k-nearest neigh- bors

(kNN) from the library

3.2. Self-implemented Bayesian algorithm

During the construction of this algorithm, our goal was to achieve results similar to the algorithm from the sklearn library. As observed, our algorithm performs worse with languages that use specific alphabets (e.g., Arabic, Hindi) and struggles more with recognizing languages belonging to the same family due to similarities in words stemming from the shared ancestry of these languages. This is particularly evident in Germanic languages: Dutch - German, Danish - Swedish, and Romance languages: Spanish, French, and Portuguese. However, the issues with languages using specific alphabets and the overall decrease in accuracy of other languages result from the lack of a smoothing parameter in the computational algorithm. Ultimately, though, in general terms, we achieved an algorithm accuracy of approximately 93%. It’s the slowest among all algorithms but has average accuracy (see Tab. 2 and Fig. 1b).

3.3. Custom Bayesian algorithm for letter proximity

The algorithm, thanks to a completely different approach to the dataset, achieved results different from the rest. As the measurements show, unlike the previous one, it performs best with languages using specific alphabets. However, it struggles more with languages belonging to the same families. For example, with Germanic languages (Danish, Swedish, and Dutch) and some Romance languages (Italian, Spanish, and Portuguese). This is due to the similar structure of these languages associated with their common ancestry. If more than one language had the same probability (taking into account the rounding error of floating-point numbers), the algorithm chose the first one in alphabetical order, hence the lower accuracy of Danish compared to Dutch, and Dutch compared to Swedish. Similarly for Romance languages. Ultimately, this algorithm has the lowest overall accuracy of the tested trio, at around 89% (Tab. 3 and Fig. 1c). However, this result exceeded our initial expectations for the algorithm.

3.4. KNN algoritms

The first test we conducted for the KNN algorithm was to assess its effectiveness for different values of k ranging from 1 to 9. As shown in Tab. 4, the algorithm exhibited different effectiveness across the different values of k. Therefore, we choose k=9, for the algorithm from the library and k=10 for our algorithm. As we can also observe, for small values of k, our algorithm has higher effectiveness, which may be related to the use of a weighting table. As k increases, the difference in effectiveness decreases, until eventually, the algorithm from the library starts to exhibit greater effectiveness.

3.5. KNN without library

To shorten the execution time of the algorithm and increase its effectiveness from around 60% using the Euclidean metric, we decided to calculate the distance as the dot product of vectors. This allowed us to save some time and increase the effectiveness to 90%. The results indicate a strong performance of the algorithm across multiple languages. High precision and recall in languages like Arabic, Greek, Kannada, and Tamil show that the algorithm is particularly effective for these languages, achieving near-perfect scores. However, there are areas for improvement, notably in Spanish, which has a lower precision (0.61) and F1 score (0.72), indicating potential difficulties in accurately classifying this language (Tab. 5 and Fig. 1d.

3.6. KNN with library

The algorithm from the library shows similar results for individual languages. Some of them achieved higher scores, while others had lower ones. However, the overall accuracy remained unchanged at 90%. The Spanish language, which our algorithm struggled with, still has a much weaker performance compared to the rest, but this result has slightly improved (Tab. 6 and Fig. 1e).

4. Conclusion

Based on our results, the Bayes algorithm from the sklearn library performs the best, achieving 98% accuracy. Our version of this algorithm ranks second with 93% accuracy. However, both KNNbased algorithms and our Bayes classifier based on letter pair probabilities performed the worst among all, still achieving relatively high scores of 90% and 89% accuracy, respectively. Although KNN algorithms handle language classification tasks well, their use in this form is not optimal in terms of both time or memory efficiency. Achieving results similar to our Bayes algorithms, they require almost two orders of magnitude more time. Similarly, in the case of the computational resources of the test platform, difference between both types of algorithms is significant. TheKNN classifierfrom the library performs calculations faster than the one we created, thanks to the use of multi-threaded processing, while our KNN classifier performs calculations using only a single CPU core. However, this impacts memory usage. During tests, the KNN from the library used over 9.5GB of available RAM on the test platform, while our KNN algorithm required approximately 5GB of memory. In contrast, the Bayes algorithms did not require more than 1GB of RAM and, despite running on a single CPU thread, did not fully load it. None of our developed algorithms came up close to 100%. One of the possible future improvements would be to combine together both our Bayes classifiers, to eliminate their separate weak points. An algorithm created this way would be much closer to 100% accuracy with only slightly lower time efficiency.

[1]

Obi ,

K. S.

Claudio ,

V. M.

Budiman ,

Achmad ,

Kurniawan , Sign language recognition system for communicating to people with disabilities , Procedia Computer Science 216 ( 2023 ) 13 - 20 .

[2]

Mengliev ,

Barakhnin ,

Abdurakhmonova ,

Eshkulov , Developing named entity recognition algorithms for uzbek: Dataset insights and implementation, Data in Brief ( 2024 ) 110413 .

[3]

Vaitkevičius ,

Taroza ,

Blažauskas ,

Damaševičius ,

Maskeliu ¯ nas, M. Woźniak, Recognition of american sign language gestures in a virtual reality using leap motion , Applied Sciences 9 ( 2019 ) 445 .

[4]

Nallakaruppan , G. Srivastava,

T. R.

Gadekallu ,

P. K.

Reddy ,

Krishnan ,

Polap , Child tracking and prediction of violence on children in social media using natural language processing and machine learning , in: International Conference on Artificial Intelligence and Soft Computing , Springer, 2023 , pp. 560 - 569 .

[5]

Dash ,

Sharma ,

Sharma , Sustainable marketing and the role of social media: an experimental study using natural language processing (nlp ), Sustainability 15 ( 2023 ) 5443 .

[6]

Langley ,

Iba ,

Thompson , et al., An analysis of bayesian classifiers , in: Aaai , volume 90 , Citeseer , 1992 , pp. 223 - 228 .

[7]

Prokop , Grey wolf optimizer combined with k-nn algorithm for clustering problem , in: IVUS 2022: 27th International Conference on Information Technology , 2022 .

[8]

Guo ,

Wang ,

Bell ,

Bi ,

Greer , Knn model-based approach in classification,in: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences , CoopIS, DOA, and ODBASE 2003, Catania , Sicily, Italy, November 3- 7 , 2003 . Proceedings, Springer, 2003 , pp. 986 - 996 .