<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>implemented into machine</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Żaneta Pawelec</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Language recognition algorithms play a pivotal role in various domains, offering applications ranging from automatically detecting the language of textual data to powering multilingual customer support systems. As the foundation of modern technologies like Artificial Intelligence, these algorithms enable content localization, facilitate language translation services, and drive personalized marketing strategies by analyzing linguistic patterns in customer feedback and social media interactions. This project compares five machine learning algorithms for language recognition, focusing on Bayesian classifiers and K-Nearest Neighbors (KNN). Through experimentation with different variations of these algorithms, including custom implementations, the project evaluates their effectiveness in recognizing 17 foreign languages. Methodologically, the project explores the nuances of each algorithm, discussing their underlying principles and implementation details. Experimental results reveal insights into the performance of each algorithm, providing valuable considerations for practical applications. Additionally, the project discusses the significance of precision, recall, F1-score, and accuracy metrics in assessing algorithm performance. Overall, this study contributes to advancing language recognition technology, offering valuable insights into algorithmic approaches and their real-world implications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;language recognition</kwd>
        <kwd>knn</kwd>
        <kwd>clustering</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>Bayesian classifier</kwd>
        <kwd>K-Nearest Neighbors</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Our program aims to compare five machine learning algorithms. All algorithms calculate
the effectiveness of recognizing 17 foreign languages using different variations of the Bayesian
classifier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and K-Nearest Neighbours classifier [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. The calculations are based on a longer or
shorter sentence retrieved from a database.
      </p>
      <p>To get a closer look into the applied classifiers, the following paragraphs will briefly describe
them to illustrate how different these calculation methods are from each other.</p>
      <p>The Naive Bayes classifier is a probabilistic machine learning model based on Bayes’ theorem,
which calculates the probability of a certain class given a set of features. It assumes that the
features are conditionally independent, hence "naive." It’s widely used for classification tasks,
especially in text classificationand spam filtering.</p>
      <p>K-Nearest Neighbors (KNN) is a non-parametric supervised learning algorithm used for
classificationand regression tasks. In KNN, the class of a new data point is determined by the
majority class among its k nearest neighbors in the feature space. It’s simple to implement and
understand but can be computationally expensive for large datasets (like the one we are using), as
it requires storing all training data and computing distances for each prediction.</p>
      <p>Both algorithms have varying time consumption, with KNN being more computationally
expensive due to its need to calculate distances for each prediction. Now, let’s delve into a brief
explanation of each of the applied algorithms and the underlying thought process behind their
selection. The first classifier is the Bayesian classifier from the library, which provides the most
effective results and thus serves as the main benchmark that we tried to achieve in the other
algorithms. Next, we independently create a second Bayesian classifier aiming to mimic the
version from the library. The third classifier is also a modified Bayesian classifier, determining the
language by the probability of neighboring letters. In executing this algorithm, we assumed that
each language has recurring sequences of letters that can enable assigning a given sentence to the
language in which this sequence most commonly occurs. We derived an appropriate formula
that allowed us to implement our idea into the program. The fourth classifieris a K-Nearest
Neighbours from the library, but with implemented different distance calculation methods
which we adjusted to our specific database. The fifth classifier is also the K-Nearest
Neighbours algorithm but in this instance written by us. It was created following open-access
models with an intent to achieve as high accuracy as the one from the imported KNN classifier.In
order to achieve a satisfying outcome it required us to apply many adjustments in the distance
calculating method. After performing the calculations, each algorithm displays a table with the
results of the effectiveness of definingeach language.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>Data from the set is divided into subsets X, containing texts in various languages, and Y
containing the language classes of the texts from set X. The initial two Bayes classifiersand
both KNN algorithms operate on a dataset converted into a matrix of token counts using the
CountVectorizer class from the sklearn library. This is a one-dimensional matrix of the length
of the dictionary containing all the words from the dataset. Each text sequence from set X is
represented by such a matrix, where the words occurring in this sequence are represented by
the number of their occurrences in the appropriate matrix position and the rest are filled with
zeros.</p>
      <p>First, we used the MultinomialNB class contained in the sklearn library. For calculations, it
uses the formula:
where:  is the probability P( |  ) of feature  appearing in a sample belonging to class  .
is the count of occurrences of parameter  in class  in the training set,
while
is the number of all parameters in set  .  is the smoothing prior, which in
this case is Laplace smoothing -  = 1.  is the number of classes in set Y.</p>
      <p>Next, we attempted to replicate the function contained in the library, aiming to obtain similar
results. However, in our version of the algorithm, we did not consider the smoothing parameter.</p>
      <p>Algorithm 1: Method ’OwnMNB.fit’ trainingthe algorithm</p>
      <p>Data: sets x_train and y_train</p>
      <p>Result: None
1  := set of values of  _ ;
2  := empty dictionary;
3 foreach  ∈  do
4  :=  _ ∈  ;
5  _ := sum of vectors in ;
6  [ ] := VSum / length of ;
Algorithm 2: Method ’OwnMNB.predict’ performing calculations</p>
      <p>Data:  _</p>
      <p>Result: list  _
1  _ := empty list;
2 foreach  _ ∈  _ do
3  := empty dictionary;
4 foreach  ∈  do
5  := vector sum of  *  [ ] ;
6 Append  to  [ ];
7 Append to  _ class with the biggest value from dictionary  ;</p>
      <p>In the third Bayes classifier, we changed the approach to the dataset. We utilized individual
dependencies on the construction of each language - the probability of one letter occurring
after another. The formula in this case takes the form:</p>
      <p>where:  is the probability P( |  ) for  contained in the same class  .  (,−1, |  ) is the
probability of the occurrence of letter  after −1 in class  .  is the number of letters in the
considered text sequence.</p>
      <p>This time, the methods are given raw training sets X and Y, and a test set X. The ’fit’ method
is responsible for creating a ’neighborhood table’ of all the letters present in the training set X
divided by language classes. They contain the probabilities of the occurrence of a given pair of
letters one after the other. The ’predict’ method for the test set determines membership in a
class based on the probabilities from the ’neighborhood tables’.</p>
      <p>Algorithm 3: Method ’LetterProb.fit’ training the algorithm
end
foreach  ∈ Keys [] do
[][] = [][]/;
if  +  ∈ [] then
[][ + ]+ = 1;
[][ + ] := 1;</p>
      <p>In K-Nearest Neighbors from the library, we use scalar vector multiplication to calculate
distances. We multiply this value by -1 to avoid the need to compute the k-farthest neighbors
further.
where a,b are vectors</p>
      <p>In our k-NN, we used the same formula for calculating distances as in the library algorithm,
but additionally, we incorporated weighted computation of the  nearest neighbors.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>To compare the different performance parameters of the used algorithms, we utilized the metrics
module from the sklearn library. To improve the accuracy of the results, each algorithm was
executed 10 times, and the final value is the average of all trials. The dataset containing texts in 17
languages with a total length of 10,337 records was divided into training and testing sets in a 70:30
ratio. For each algorithm, we compared parameters such as:
• precision - it is a measure that determines the ratio of correctly predicted class elements to
all those marked as the given class
• recall - a measure informing how many elements from a given class were correctly
recognized
• f1-score - it is the harmonic mean between precision and recall
• support- a measure of the occurrences of each class in the dataset
• accuracy - it is the ratio of correctly classified samples to all cases in the test set
Meaning of labels:
• TP - true positive - cases that were correctly classified as positive by the classifier
• TN - true negative - cases that were correctly classified as negative by the classifier
• FP - false positive - an error where the test result incorrectly indicates the presence of a
condition when it is not present
• FN - false negative - an error where the test result incorrectly indicates the absence of a
condition when it is actually present</p>
      <sec id="sec-3-1">
        <title>3.1. The Bayesian algorithm from the sklearn library</title>
        <p>Analyzing the results shown in the above table, we can observe that the algorithm matches most
languages with an accuracy ranging from 98-100% (see Tab. 1). The exception is the English
language, which has an accuracy of only 89%, which may be due to the fact that English words are
borrowed from other languages. The method for the entire dataset has an accuracy of 98%, making
it the most accurate of all the solutions we have used (Fig. 1a).
(a) The effectiveness results for the Bayesian
algorithm from the sklearn library
(b) The effectiveness results for the
selfimplemented Bayesian algorithm
(c) The effectiveness results for a
customwritten Bayesian algorithm for letter
proximity
(d) The effectiveness results for our k-nearest</p>
        <p>neighbors (kNN)
(e) The effectiveness results for k-nearest neigh- bors</p>
        <p>(kNN) from the library</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Self-implemented Bayesian algorithm</title>
        <p>During the construction of this algorithm, our goal was to achieve results similar to the algorithm from
the sklearn library. As observed, our algorithm performs worse with languages that use specific
alphabets (e.g., Arabic, Hindi) and struggles more with recognizing languages belonging to the same
family due to similarities in words stemming from the shared ancestry of these languages.
This is particularly evident in Germanic languages: Dutch - German, Danish - Swedish, and
Romance languages: Spanish, French, and Portuguese. However, the issues with languages using
specific alphabets and the overall decrease in accuracy of other languages result from the lack of a
smoothing parameter in the computational algorithm. Ultimately, though, in general terms,
we achieved an algorithm accuracy of approximately 93%. It’s the slowest among all
algorithms but has average accuracy (see Tab. 2 and Fig. 1b).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Custom Bayesian algorithm for letter proximity</title>
        <p>The algorithm, thanks to a completely different approach to the dataset, achieved results
different from the rest. As the measurements show, unlike the previous one, it performs best
with languages using specific alphabets. However, it struggles more with languages belonging to
the same families. For example, with Germanic languages (Danish, Swedish, and Dutch) and some
Romance languages (Italian, Spanish, and Portuguese). This is due to the similar structure of these
languages associated with their common ancestry. If more than one language had
the same probability (taking into account the rounding error of floating-point numbers), the
algorithm chose the first one in alphabetical order, hence the lower accuracy of Danish compared to
Dutch, and Dutch compared to Swedish. Similarly for Romance languages. Ultimately, this
algorithm has the lowest overall accuracy of the tested trio, at around 89% (Tab. 3 and Fig. 1c).
However, this result exceeded our initial expectations for the algorithm.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. KNN algoritms</title>
        <p>The first test we conducted for the KNN algorithm was to assess its effectiveness for different
values of k ranging from 1 to 9. As shown in Tab. 4, the algorithm exhibited different effectiveness
across the different values of k. Therefore, we choose k=9, for the algorithm from the library
and k=10 for our algorithm. As we can also observe, for small values of k, our algorithm has
higher effectiveness, which may be related to the use of a weighting table. As k increases, the
difference in effectiveness decreases, until eventually, the algorithm from the library starts to
exhibit greater effectiveness.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. KNN without library</title>
        <p>To shorten the execution time of the algorithm and increase its effectiveness from around
60% using the Euclidean metric, we decided to calculate the distance as the dot product of
vectors. This allowed us to save some time and increase the effectiveness to 90%. The results
indicate a strong performance of the algorithm across multiple languages. High precision
and recall in languages like Arabic, Greek, Kannada, and Tamil show that the algorithm is
particularly effective for these languages, achieving near-perfect scores. However, there are
areas for improvement, notably in Spanish, which has a lower precision (0.61) and F1 score
(0.72), indicating potential difficulties in accurately classifying this language (Tab. 5 and Fig. 1d.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. KNN with library</title>
        <p>The algorithm from the library shows similar results for individual languages. Some of them
achieved higher scores, while others had lower ones. However, the overall accuracy remained
unchanged at 90%. The Spanish language, which our algorithm struggled with, still has a much
weaker performance compared to the rest, but this result has slightly improved (Tab. 6 and Fig. 1e).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Based on our results, the Bayes algorithm from the sklearn library performs the best, achieving 98%
accuracy. Our version of this algorithm ranks second with 93% accuracy. However, both
KNNbased algorithms and our Bayes classifier based on letter pair probabilities performed the worst
among all, still achieving relatively high scores of 90% and 89% accuracy, respectively. Although
KNN algorithms handle language classification tasks well, their use in this form is not
optimal in terms of both time or memory efficiency. Achieving results similar to our Bayes
algorithms, they require almost two orders of magnitude more time. Similarly, in the case of
the computational resources of the test platform, difference between both types of algorithms is
significant. TheKNN classifierfrom the library performs calculations faster than the one we
created, thanks to the use of multi-threaded processing, while our KNN classifier performs
calculations using only a single CPU core. However, this impacts memory usage. During tests,
the KNN from the library used over 9.5GB of available RAM on the test platform, while our
KNN algorithm required approximately 5GB of memory. In contrast, the Bayes algorithms
did not require more than 1GB of RAM and, despite running on a single CPU thread, did not
fully load it. None of our developed algorithms came up close to 100%. One of the possible
future improvements would be to combine together both our Bayes classifiers, to eliminate their
separate weak points. An algorithm created this way would be much closer to 100% accuracy
with only slightly lower time efficiency.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Obi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Claudio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Budiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Achmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurniawan</surname>
          </string-name>
          ,
          <article-title>Sign language recognition system for communicating to people with disabilities</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>216</volume>
          (
          <year>2023</year>
          )
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mengliev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Barakhnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdurakhmonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eshkulov</surname>
          </string-name>
          ,
          <article-title>Developing named entity recognition algorithms for uzbek: Dataset insights and implementation, Data in Brief (</article-title>
          <year>2024</year>
          )
          <fpage>110413</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaitkevičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taroza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blažauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damaševičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Maskeliu</surname>
          </string-name>
          <article-title>¯ nas, M. Woźniak, Recognition of american sign language gestures in a virtual reality using leap motion</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>9</volume>
          (
          <year>2019</year>
          )
          <fpage>445</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nallakaruppan</surname>
          </string-name>
          , G. Srivastava,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Gadekallu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polap</surname>
          </string-name>
          ,
          <article-title>Child tracking and prediction of violence on children in social media using natural language processing and machine learning</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence and Soft Computing</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>560</fpage>
          -
          <lpage>569</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Sustainable marketing and the role of social media: an experimental study using natural language processing (nlp</article-title>
          ),
          <source>Sustainability</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>5443</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Langley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Iba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , et al.,
          <article-title>An analysis of bayesian classifiers</article-title>
          ,
          <source>in: Aaai</source>
          , volume
          <volume>90</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1992</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Prokop</surname>
          </string-name>
          ,
          <article-title>Grey wolf optimizer combined with k-nn algorithm for clustering problem</article-title>
          ,
          <source>in: IVUS 2022: 27th International Conference on Information Technology</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Greer</surname>
          </string-name>
          ,
          <article-title>Knn model-based approach in classification,in: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences</article-title>
          , CoopIS, DOA, and
          <article-title>ODBASE 2003, Catania</article-title>
          , Sicily, Italy, November 3-
          <issue>7</issue>
          ,
          <year>2003</year>
          . Proceedings, Springer,
          <year>2003</year>
          , pp.
          <fpage>986</fpage>
          -
          <lpage>996</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>