Language recognition implemented into machine learning algorithms *

Language recognition implemented into machine learning algorithms * ŻanetaPawelec Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

GrzegorzGrochowski Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

AleksandraStarowicz Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

Information Society University Studies

2024, May 17 Kaunas Lithuania

Language recognition implemented into machine learning algorithms * 1613-0073 753DCF424E58961EF267800D3112A49D GROBID - A machine learning software for extracting information from scholarly documents language recognition knn clustering artificial intelligence Bayesian classifier K-Nearest Neighbors

Language recognition algorithms play a pivotal role in various domains, offering applications ranging from automatically detecting the language of textual data to powering multilingual customer support systems. As the foundation of modern technologies like Artificial Intelligence, these algorithms enable content localization, facilitate language translation services, and drive personalized marketing strategies by analyzing linguistic patterns in customer feedback and social media interactions. This project compares five machine learning algorithms for language recognition, focusing on Bayesian classifiers and K-Nearest Neighbors (KNN). Through experimentation with different variations of these algorithms, including custom implementations, the project evaluates their effectiveness in recognizing 17 foreign languages. Methodologically, the project explores the nuances of each algorithm, discussing their underlying principles and implementation details. Experimental results reveal insights into the performance of each algorithm, providing valuable considerations for practical applications. Additionally, the project discusses the significance of precision, recall, F1-score, and accuracy metrics in assessing algorithm performance. Overall, this study contributes to advancing language recognition technology, offering valuable insights into algorithmic approaches and their real-world implications.

Introduction

Language recognition algorithms offer numerous applications across various domains [1,2,3,4]. From automatically detecting the language of textual data to increasing performance of spam filtering and powering multilingual customer support systems. The importance of these algorithms enhances every day and becomes the crucial foundation for developing modern technologies such as Artificial Intelligence. Furthermore, they enable content localization, facilitate language translation services, and drive personalized marketing strategies [5] by analyzing linguistic patterns in customer feedback and social media interactions. We can easily spot them in our daily lives, using social media, web browsers and so on, that is why their accuracy and efficiency need to be constantly improved in order to make things easier. Moreover, language recognition algorithms underpin voice assistants and speech recognition systems, contributing to seamless user experiences. With their ability to discern linguistic nuances and patterns, language recognition algorithms continue to fuel innovation and efficiency across a wide array of real-life problems.

Our program aims to compare five machine learning algorithms. All algorithms calculate the effectiveness of recognizing 17 foreign languages using different variations of the Bayesian classifier [6] and K-Nearest Neighbours classifier [7,8]. The calculations are based on a longer or shorter sentence retrieved from a database.

To get a closer look into the applied classifiers, the following paragraphs will briefly describe them to illustrate how different these calculation methods are from each other.

The Naive Bayes classifier is a probabilistic machine learning model based on Bayes' theorem, which calculates the probability of a certain class given a set of features. It assumes that the features are conditionally independent, hence "naive." It's widely used for classification tasks, especially in text classification and spam filtering.

K-Nearest Neighbors (KNN) is a non-parametric supervised learning algorithm used for classification and regression tasks. In KNN, the class of a new data point is determined by the majority class among its k nearest neighbors in the feature space. It's simple to implement and understand but can be computationally expensive for large datasets (like the one we are using), as it requires storing all training data and computing distances for each prediction.

Both algorithms have varying time consumption, with KNN being more computationally expensive due to its need to calculate distances for each prediction. Now, let's delve into a brief explanation of each of the applied algorithms and the underlying thought process behind their selection. The first classifier is the Bayesian classifier from the library, which provides the most effective results and thus serves as the main benchmark that we tried to achieve in the other algorithms. Next, we independently create a second Bayesian classifier aiming to mimic the version from the library. The third classifier is also a modified Bayesian classifier, determining the language by the probability of neighboring letters. In executing this algorithm, we assumed that each language has recurring sequences of letters that can enable assigning a given sentence to the language in which this sequence most commonly occurs. We derived an appropriate formula that allowed us to implement our idea into the program. The fourth classifier is a K-Nearest Neighbours from the library, but with implemented different distance calculation methods which we adjusted to our specific database. The fifth classifier is also the K-Nearest Neighbours algorithm but in this instance written by us. It was created following open-access models with an intent to achieve as high accuracy as the one from the imported KNN classifier. In order to achieve a satisfying outcome it required us to apply many adjustments in the distance calculating method. After performing the calculations, each algorithm displays a table with the results of the effectiveness of defining each language.

Methodology

Data from the set is divided into subsets X, containing texts in various languages, and Y containing the language classes of the texts from set X. The initial two Bayes classifiers and both KNN algorithms operate on a dataset converted into a matrix of token counts using the CountVectorizer class from the sklearn library. This is a one-dimensional matrix of the length of the dictionary containing all the words from the dataset. Each text sequence from set X is represented by such a matrix, where the words occurring in this sequence are represented by the number of their occurrences in the appropriate matrix position and the rest are filled with zeros.

First, we used the MultinomialNB class contained in the sklearn library. For calculations, it uses the formula:

where: 𝜃 𝑦𝑖 is the probability P(𝑥 𝑖 | 𝑦) of feature 𝑥 𝑖 appearing in a sample belonging to class 𝑦.

is the count of occurrences of parameter 𝑖 in class 𝑦 in the training set, while is the number of all parameters in set 𝑦. 𝛼 is the smoothing prior, which in this case is Laplace smoothing -𝛼 = 1. 𝑛 is the number of classes in set Y.

Next, we attempted to replicate the function contained in the library, aiming to obtain similar results. However, in our version of the algorithm, we did not consider the smoothing parameter. In the third Bayes classifier, we changed the approach to the dataset. We utilized individual dependencies on the construction of each language -the probability of one letter occurring after another. The formula in this case takes the form: This time, the methods are given raw training sets X and Y, and a test set X. The 'fit' method is responsible for creating a 'neighborhood table' of all the letters present in the training set X divided by language classes. They contain the probabilities of the occurrence of a given pair of letters one after the other. The 'predict' method for the test set determines membership in a class based on the probabilities from the 'neighborhood tables'. In K-Nearest Neighbors from the library, we use scalar vector multiplication to calculate distances. We multiply this value by -1 to avoid the need to compute the k-farthest neighbors further.

where a,b are vectors

In our k-NN, we used the same formula for calculating distances as in the library algorithm, but additionally, we incorporated weighted computation of the 𝑘 nearest neighbors.

Experiments

To compare the different performance parameters of the used algorithms, we utilized the metrics module from the sklearn library. To improve the accuracy of the results, each algorithm was executed 10 times, and the final value is the average of all trials. The dataset containing texts in 17 languages with a total length of 10,337 records was divided into training and testing sets in a 70:30 ratio. For each algorithm, we compared parameters such as:

• precision -it is a measure that determines the ratio of correctly predicted class elements to all those marked as the given class

• recall -a measure informing how many elements from a given class were correctly recognized

• f1-score -it is the harmonic mean between precision and recall

• support-a measure of the occurrences of each class in the dataset • accuracy -it is the ratio of correctly classified samples to all cases in the test set Meaning of labels:

• TP -true positive -cases that were correctly classified as positive by the classifier • TN -true negative -cases that were correctly classified as negative by the classifier • FP -false positive -an error where the test result incorrectly indicates the presence of a condition when it is not present • FN -false negative -an error where the test result incorrectly indicates the absence of a condition when it is actually present

The Bayesian algorithm from the sklearn library

Analyzing the results shown in the above table, we can observe that the algorithm matches most languages with an accuracy ranging from 98-100% (see Tab. 1). The exception is the English language, which has an accuracy of only 89%, which may be due to the fact that English words are borrowed from other languages. The method for the entire dataset has an accuracy of 98%, making it the most accurate of all the solutions we have used (Fig. 1a).

Self-implemented Bayesian algorithm

During the construction of this algorithm, our goal was to achieve results similar to the algorithm from the sklearn library. As observed, our algorithm performs worse with languages that use specific alphabets (e.g., Arabic, Hindi) and struggles more with recognizing languages belonging to the same family due to similarities in words stemming from the shared ancestry of these languages. This is particularly evident in Germanic languages: Dutch -German, Danish -Swedish, and Romance languages: Spanish, French, and Portuguese. However, the issues with languages using specific alphabets and the overall decrease in accuracy of other languages result from the lack of a smoothing parameter in the computational algorithm. Ultimately, though, in general terms, we achieved an algorithm accuracy of approximately 93%. It's the slowest among all algorithms but has average accuracy (see Tab. 2 and Fig. 1b).

Custom Bayesian algorithm for letter proximity

The algorithm, thanks to a completely different approach to the dataset, achieved results different from the rest. As the measurements show, unlike the previous one, it performs best with languages using specific alphabets. However, it struggles more with languages belonging to the same families. For example, with Germanic languages (Danish, Swedish, and Dutch) and some Romance languages (Italian, Spanish, and Portuguese). This is due to the similar structure of these languages associated with their common ancestry. If more than one language had the same probability (taking into account the rounding error of floating-point numbers), the algorithm chose the first one in alphabetical order, hence the lower accuracy of Danish compared to Dutch, and Dutch compared to Swedish. Similarly for Romance languages. Ultimately, this algorithm has the lowest overall accuracy of the tested trio, at around 89% (Tab. 3 and Fig. 1c). However, this result exceeded our initial expectations for the algorithm.

KNN algoritms

The first test we conducted for the KNN algorithm was to assess its effectiveness for different values of k ranging from 1 to 9. As shown in Tab. 4, the algorithm exhibited different effectiveness across the different values of k. Therefore, we choose k=9, for the algorithm from the library and k=10 for our algorithm. As we can also observe, for small values of k, our algorithm has higher effectiveness, which may be related to the use of a weighting table. As k increases, the difference in effectiveness decreases, until eventually, the algorithm from the library starts to exhibit greater effectiveness.

KNN without library

To shorten the execution time of the algorithm and increase its effectiveness from around 60% using the Euclidean metric, we decided to calculate the distance as the dot product of vectors. This allowed us to save some time and increase the effectiveness to 90%. The results

Algorithm 1 : 5 𝑉Algorithm 2 : 6 Append1526Method 'OwnMNB.fit' training the algorithm Data: sets x_train and y_train Result: None 1 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 := set of values of 𝑦_𝑡𝑟𝑎𝑖𝑛; 2 𝑡𝑜𝑘𝑒𝑛𝑠 := empty dictionary; 3 foreach 𝑐 ∈ 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do 4 𝑥 𝑐 := 𝑥_𝑡𝑟𝑎𝑖𝑛 ∈ 𝑐; _𝑆𝑢𝑚 := sum of vectors in 𝑥 𝑐 ; 6 𝑡𝑜𝑘𝑒𝑛𝑠[𝑐] := VSum / length of 𝑥 𝑐 ; Method 'OwnMNB.predict' performing calculations Data: 𝑥_𝑡𝑒𝑠𝑡 Result: list 𝑦_𝑝𝑟𝑒𝑑 1 𝑦_𝑝𝑟𝑒𝑑 := empty list; 2 foreach 𝑥_𝑟𝑜𝑤 ∈ 𝑥_𝑡𝑒𝑠𝑡 do 3 𝑝𝑟𝑜𝑏 := empty dictionary; 4 foreach 𝑐 ∈ 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do 5 𝑝𝑟𝑎𝑤𝑑 := vector sum of 𝑥 * 𝑡𝑜𝑘𝑒𝑛𝑠[𝑐] ; 𝑝𝑟𝑎𝑤𝑑 to 𝑝𝑟𝑜𝑏[𝑐]; 7 Append to 𝑦_𝑝𝑟𝑒𝑑 class with the biggest value from dictionary 𝑝𝑟𝑜𝑏;

where: 𝜃 𝑦𝑗 is the probability P(𝑥 𝑗 | 𝑦) for 𝑥 𝑗 contained in the same class 𝑦 𝑗 . 𝑃 (𝑥 , 𝑗𝑖−1 𝑥 , 𝑗𝑖 | 𝑦) is the probability of the occurrence of letter 𝑥 𝑖 after 𝑥 𝑖−1 in class 𝑦 𝑗 . 𝑛 is the number of letters in the considered text sequence.

Algorithm 3 :3Method 'LetterProb.fit' training the algorithm Data: Sets 𝑥_𝑡𝑟𝑎𝑖𝑛 and 𝑦_𝑡𝑟𝑎𝑖𝑛 Result: None 1 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠 := empty dictionary; 2 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 := set of values of 𝑦_𝑡𝑟𝑎𝑖𝑛; 3 foreach 𝑐 ∈ 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do := 𝑥_𝑡𝑟𝑎𝑖𝑛 ∈ 𝑐; 𝑙𝑒𝑡𝑡𝑒𝑟𝐶𝑜𝑢𝑛𝑡 := 0; 𝑙𝑎𝑠𝑡 = ' '; 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐] := empty dictionary; foreach 𝑟𝑜𝑤 ∈ 𝑥 𝑐 do foreach 𝑙𝑒𝑡𝑡𝑒𝑟 ∈ 𝑟𝑜𝑤 do 𝑙𝑒𝑡𝑡𝑒𝑟𝐶𝑜𝑢𝑛𝑡+ = 1; if 𝑙𝑎𝑠𝑡 + 𝑙𝑒𝑡𝑡𝑒𝑟 ∈ 𝑑𝑖 𝑐 𝑡 𝑖 𝑜𝑛𝑎𝑟 𝑖 𝑒 𝑠[𝑐] then 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐][𝑙𝑎𝑠𝑡 + 𝑙 𝑒 𝑡 𝑡 𝑒 𝑟]+ = 1; end else 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐][ 𝑙𝑎𝑠𝑡 + 𝑙 𝑒 𝑡 𝑡 𝑒 𝑟] := 1; end 𝑙𝑎𝑠𝑡 := 𝑙𝑒𝑡𝑡𝑒𝑟; end end foreach 𝑧 ∈ Keys 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐] do 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐][𝑧] = 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠[𝑐][𝑧]/𝑙𝑒𝑡𝑡𝑒𝑟𝐶𝑜𝑢𝑛𝑡; end

(a) The effectiveness results for the Bayesian algorithm from the sklearn library (c) The effectiveness results for a customwritten Bayesian algorithm for letter proximity (b) The effectiveness results for the selfimplemented Bayesian algorithm (d) The effectiveness results for our k-nearest neighbors (kNN) (e) The effectiveness results for k-nearest neigh-bors (kNN) from the library

Figure 1 :1Figure 1: Comparison of effectiveness results for different algorithms

Table 11The effectiveness results for the Bayesian algorithm from the sklearn libraryprecision recall f1-score supportArabic1.00.970.98774.0Danish0.990.940.97621.0Dutch1.00.970.99832.0English0.891.00.942131.0French0.980.990.981503.0German1.00.980.99692.0Greek1.00.990.99556.0Hindi1.00.970.99113.0Italian1.00.980.991052.0Kannada1.00.960.98551.0Malayalam0.990.980.99881.0Portuguese0.990.990.991078.0Russian1.00.970.981054.0Spanish0.990.980.981248.0Sweedish0.990.980.981016.0Tamil1.00.980.99670.0Turkish1.00.920.96738.0accuracy0.9815510.0macro avg0.990.970.9815510.0weighted avg0.980.980.9815510.0

Table 22The effectiveness results for the self-implemented Bayesian algorithmprecision recall f1-score supportArabic0.791.00.88774.0Danish0.80.910.85621.0Dutch0.910.840.87832.0English0.970.980.982131.0French0.960.90.931503.0German0.990.880.93692.0Greek1.00.990.99556.0Hindi0.730.980.84113.0Italian0.990.950.971052.0Kannada1.00.960.98551.0Malayalam1.00.980.99881.0Portugeese0.970.910.941078.0Russian1.00.930.961054.0Spanish0.730.950.831248.0Sweedish0.980.880.931016.0Tamil1.00.980.99670.0Turkish0.990.80.89738.0accuracy0.9315510.0macro avg0.930.930.9315510.0weighted avg0.940.930.9315510.0

indicate a strong performance of the algorithm across multiple languages. High precision and recall in languages like Arabic, Greek, Kannada, and Tamil show that the algorithm is particularly effective for these languages, achieving near-perfect scores. However, there are areas for improvement, notably in Spanish, which has a lower precision (0.61) and F1 score (0.72), indicating potential difficulties in accurately classifying this language (Tab. 5 and Fig. 1d.

KNN with library

The algorithm from the library shows similar results for individual languages. Some of them achieved higher scores, while others had lower ones. However, the overall accuracy remained unchanged at 90%. The Spanish language, which our algorithm struggled with, still has a much weaker performance compared to the rest, but this result has slightly improved (Tab. 6 and Fig. 1e).

Conclusion

Based on our results, the Bayes algorithm from the sklearn library performs the best, achieving 98% accuracy. Our version of this algorithm ranks second with 93% accuracy. However, both KNNbased algorithms and our Bayes classifier based on letter pair probabilities performed the worst among all, still achieving relatively high scores of 90% and 89% accuracy, respectively. Although KNN algorithms handle language classification tasks well, their use in this form is not optimal in terms of both time or memory efficiency. Achieving results similar to our Bayes algorithms, they require almost two orders of magnitude more time. Similarly, in the case of the computational resources of the test platform, difference between both types of algorithms is significant. The KNN classifier from the library performs calculations faster than the one we created, thanks to the use of multi-threaded processing, while our KNN classifier performs calculations using only a single CPU core. However, this impacts memory usage. During tests, the KNN from the library used over 9.5GB of available RAM on the test platform, while our KNN algorithm required approximately 5GB of memory. In contrast, the Bayes algorithms did not require more than 1GB of RAM and, despite running on a single CPU thread, did not fully load it. None of our developed algorithms came up close to 100%. One of the possible future improvements would be to combine together both our Bayes classifiers, to eliminate their separate weak points. An algorithm created this way would be much closer to 100% accuracy with only slightly lower time efficiency.

Sign language recognition system for communicating to people with disabilities YObi KSClaudio VMBudiman SAchmad AKurniawan Procedia Computer Science 216 2023 Developing named entity recognition algorithms for uzbek: Dataset insights and implementation DMengliev VBarakhnin NAbdurakhmonova MEshkulov Data in Brief 110413 2024 Recognition of american sign language gestures in a virtual reality using leap motion AVaitkevičius MTaroza TBlažauskas RDamaševičius RMaskeliu¯nas MWoźniak Applied Sciences 9 445 2019 Child tracking and prediction of violence on children in social media using natural language processing and machine learning MNallakaruppan GSrivastava TRGadekallu PKReddy SKrishnan DPolap International Conference on Artificial Intelligence and Soft Computing Springer 2023 Sustainable marketing and the role of social media: an experimental study using natural language processing (nlp) GDash CSharma SSharma Sustainability 15 5443 2023 An analysis of bayesian classifiers PLangley WIba KThompson Aaai 90 1992 Citeseer Grey wolf optimizer combined with k-nn algorithm for clustering problem KProkop IVUS 2022: 27th International Conference on Information Technology 2022 Knn model-based approach in classification GGuo HWang DBell YBi KGreer On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003

Catania, Sicily, Italy

Springer November 3-7, 2003. 2003