Development of a Chatbot Using Machine Learning Algorithms
 to Automate Educational Processes
 Dmitry Alekseev 1, Polina Shagalova 1 and Eleonora Sokolova 11
 1
     NNSTU n.a. R. E. Alekseev, Minina str., 24 Nizhny Novgorod, 603950, Russia

                 Abstract
                 The use of chatbots in educational processes is relevant, where point communication with each
                 student on common issues is required. A chatbot with artificial intelligence has been developed
                 to automate educational processes. The cross-platform Telegram messenger is used to interact
                 with the user. To increase the efficiency of creating a dataset, a graphical application interface
                 in Python has been developed. Using libraries for creating graphical interfaces based on the
                 Qt5 platform allows you to quickly navigate the intents, requests, responses that are already in
                 the dataset. At the stage of developing the model structure, various vectorizers with different
                 parameters were tested. To determine the intentions of users, a machine learning model was
                 developed and implemented. The accuracy of the classification of user requests after training
                 the model was 97%. An additionally developed algorithm based on the Levenshtein distance
                 increased the classification accuracy. If the user's intent is not defined, a “stub” is triggered: “I
                 did not understand the meaning of your question. Please rephrase it.” Besides, the chatbot
                 implements voice message recognition. As a result of the chatbot's interaction with users,
                 statistics on requests are collected and all events occurring in the program are recorded. All
                 information is presented graphically. After authentication, the user gets access to all statistics
                 and can send messages on behalf of the bot, so the teacher can give a detailed answer. The
                 architecture of the chatbot model allows it to be used on datasets of any educational process.

                 Keywords
                 Chatbot, machine learning, natural language processing, graphical interfaces, educational
                 process

1. Introduction

   Currently, artificial intelligence systems are developing, where one of the directions in the
development of machine programs, chatbots that have artificial intelligence and interact with many
users. The use of these technologies has great potential in the field of education, for example, in
processes where the teacher spends a lot of time consulting students on typical issues. Chatbot
development technologies are used to develop assistants for entering a higher educational institution, in
online courses, in organizing students ' time, etc. [1]. At the same time, chatbots perform, as a rule,
elementary functions, reducing time and routine work.
   The paper presents a chatbot with artificial intelligence, trained on the developed dataset, which
allows automating the process of passing the norm control by students-answers questions, sends the
necessary documents to fill out, can connect a teacher to send more answers-consultations, provides
statistics, and registers all events that occur. The chatbot selects the answer based on a given list of
possible answers, using ranking technology.

GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia
EMAIL: ada4667@yandex.ru (D. Alekseev); polli-shagalova@yandex.ru (P. Shagalova); essokolowa@gmail.com (E. Sokolova)
ORCID: 0000-0001-7826-483X (D. Alekseev); 0000-0002-6676-4228 (P. Shagalova); 0000-0003-0860-2463 (E. Sokolova)
              ©️ 2021 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
2. Creating a dataset

   For the process of passing the norm control, a dataset has been developed that represents a set of
intentions of users – intents. Each intent includes examples of questions that users can ask, and the
chatbot's answers to the questions asked (Figure 1). The implementation of the dataset includes 18
intents, the chatbot's questions, and answers were compiled on the basis of regulatory documents of the
NNSTU n.a. R.E. Alekseev [2][3][4] for the implementation of the WRC. Each intent includes up to
20-30 questions.


Figure 1: Example of an intent

   Experience in the field of natural language processing has shown that it is inconvenient and
inefficient to navigate a dataset and fill it out in a text editor. To increase the efficiency of work, a
special application was created – a text editor. Its graphical interface allows you to quickly navigate in
intents, requests and responses, that are already in the dataset (Figure 2).
   To create the application, PyQt5 was used – a set of Python libraries for creating graphical interfaces
based on the Qt5 platform [5]. The created dataset was used for training the model and the operation of
an algorithm based on the Levenshtein distance.
Figure 2: Application for parsing a dataset

3. Developing a machine learning model

   To determine the intentions of users, a machine learning model has been developed that classifies
user messages. Here, the class is the intent from the dataset. The stages of the classification algorithm
are shown in Figure 3.


Figure 3: Classification algorithm


3.1 Text Preprocessing

   Text preprocessing is a mandatory step in solving the problem of natural language processing, which
allows increasing the accuracy of the classification of user intentions. It includes reducing words to
lowercase, lemmatization, removing noise (non-letter characters), correcting grammatical errors in user
queries.
   In order for the machine learning model to perceive the same words written using different registers
as the same user's intention, all words are reduced to lower case in the developed model. For
lemmatization, the pymorphy2 library was chosen, the use of which showed the best results for a small
amount of data at the input. As a result of lemmatization, word forms are reduced to a normal
(dictionary) form [6] for their subsequent analysis. Noise removal consists in removing non-letter
characters – numbers, punctuation marks, extra spaces, special characters, or, for example, html tags.
Removing noise and reducing words to lowercase are implemented using the string library. To correct
grammatical errors in user requests, the pyaspeller library is used, whose tools, in case of an incorrect
word (grammatical error), replace it with the closest correct form of the word.

3.2 Vectorization

   Vectorization is the process of converting text into a numeric vector. To create a chatbot, an analysis
of existing algorithms for creating vector representations of texts was performed, the following
vectorization algorithms were selected and investigated [7]: CountVectorizer, TfidfVectorizer,
HashingVectorizer. As a result of the research, the optimal values of the vectorizer parameters are
found, presented in Table 1.

Table 1
Parameters of vectorizers
       Vectorizer                                                  Parameters of vectorizers
    CountVectorizer
                              creating 2-3       the threshold for the frequency of ignoring terms
                                                 when building a dictionary is 0.85 (if the threshold is
                                                 exceeded, the terms are ignored)

    TfidfVectorizer        character n-grams     - the threshold for the frequency of ignoring terms
                                                 when building a dictionary is 0.85;
                                                 - linear scaling


  HashingVectorizer         to be extracted         not


3.3 Classification

    To develop the classifier of intents, classical machine learning methods were considered [8]: the
kNearest Neighbors method; the Decision Tree Classifier; the Naive Bayes method; Logistic
Regression; the Support Vector Machines method.
    All the parameters of the classification algorithms that were used in training are shown in
Table 2.
    For machine learning, the dataset was divided into 2 parts: 2/3 of the sample was used for training
the model, 1/3 was used for testing. The data was divided into samples using stratification, which made
it possible to increase the classification accuracy for intents with an unequal number of examples in the
dataset. To select the optimal parameters of each classification algorithm, a cross-validation mechanism
was used, implemented using GridSearch, a tool of the sklearn library for automatically selecting the
best hyperparameters of the model in a fixed grid of possible values.
    To assess the effectiveness of the classification algorithms, the following metrics were considered
[9]: the proportion of correct answers of the classification algorithm (accuracy); accuracy (precision) –
a number of intents that are really objects of this class among all intents assigned by the chatbot to this
class; completeness (recall)-the proportion of found intents of the class among all intents of the class.
Table 2
Configuring classifier parameters
          Classifier                   Parameters          Initial value     Final value         Step
     K-Nearest Neighbors         number of                       2               10                1
                                 "neighbors"

   Decision Tree Classifier      maximum tree depth             2                 10               1
                                 a function for           - entropy (the more homogeneous the set,
                                 dividing data into       the less entropy)
                                 subclasses               - error of the 1st kind (the frequency of a
                                                          randomly selected example of a training
                                                          sample will be classified incorrectly, gini)

   Naive Bayes                   a priori probabilities   they are selected automatically
                                 of classes
   Logistic Regression           regularization           methods are investigated: Lasso regression,
                                 method                   Ridge regression, Elastic-net

                                 class weights            - balanced (inversely proportional to the
                                                          frequencies of classes in the input data),
                                                          - they are selected automatically
                                 maximum number of                              100
                                 training iterations


  Support Vector Machines        regularization                1,0               2,0             0,25
                                 parameter
                                 (selection of
                                 significant features)
                                 class weights             - balanced (inversely proportional to the
                                                          frequencies of classes in the input data),
                                                          - they are selected automatically
                                 maximum number of                             100
                                 training iterations


3.4 The results obtained

   Based on the data obtained, it is concluded that the best options for the vectorizer and classifier will
be the CountVectorizer and the support vector method (Figure 4). The accuracy of the classification of
user requests by the model was 97%.
Figure 4: The result of training the model

4. An algorithm based on the Levenshtein distance

    To improve the accuracy of intent recognition and reduce the number of situations when the chatbot
will not be able to determine the user's intention and answer his question (in this case, a "stub" is
triggered), a modified Levenshtein algorithm was developed and applied [10], based on the calculation
of the "editorial distance" metric – the difference between two sequences of characters. The value of
the Levenshtein metric is determined by the minimum number of operations of replacing, inserting,
deleting one character when converting one string (word) to another. This algorithm is included in the
processing of intents if the machine learning model cannot classify the user's intention. In this case, the
user's message is compared with messages from the dataset and the Levenshtein distance is calculated.
The ratio distances of the Levenshtein to the length of the message from the dataset is taken as a
configurable parameter of the modified algorithm:

                                      𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑢𝑠𝑒𝑟 𝑟𝑒𝑞𝑢𝑒𝑠𝑡)
                            𝑙𝑒𝑛𝑔𝑡ℎ(𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑜𝑓 𝑎 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 𝑓𝑟𝑜𝑚 𝑎 𝑑𝑎𝑡𝑎𝑠𝑒𝑡)
                                                                        <0,2,                       (1)

where distance is the Levenshtein distance; length is a function that calculates the number of characters
in a string.
    The threshold value is empirically determined to be 0.2. If the ratio value is less than the threshold,
then the user's intention coincides with the intention to which the example from the dataset is attributed.

5. Creating a Telegram сhatbot

    To interact with users, the free cross-platform messenger Telegram was used [11], which allows
users to exchange text, voice messages, as well as media files of various formats. The Python-
telegrambot library, created for the development of bots for Telegram, provides the ability to add
various functionalities for bots, for example, sending messages, files, processing commands (a line
starting with a slash character "/", including up to 32 characters of the Latin alphabet, numbers, and
underscores), etc. [12].
    The developed machine learning model and an algorithm based on the Levenshtein distance were
integrated into a chatbot in Telegram. To start working with a chatbot, you need to type the name of the
bot — Normobot in the Telegram search. When you enter the start command, a welcome message
appears with an explanation of the chatbot's operation (Figure 5).
     At the request of the user, the bot can send the necessary regulatory documents. So, in Figure 6,
 under the message from the bot with an explanation of what the norm control is, there are two buttons,
 "Regulation on the verification procedure", "Regulation by type of activity". When the button is
 clicked, the corresponding document will be sent to the user.
     For the convenience of the user when communicating with the chatbot, voice message recognition
 has been added. The algorithm voices for processing messages includes, obtaining the id of a file with
 a voice message; downloading this files with the OGG extension; converting it to a file with the WAV
 extension; reading a file with a message, and converting it into a text form; transmitting a text message
 to a machine learning model. Converting a file from the OGG extension to the WAV extension is
 implemented using the pydub library. The audio recording is converted to text using the Speech
 Recognition library.
     Figure 7 shows an example of a voice request. During the recording of the message, the question
 was asked, "what to do after passing the norm control?".
     There are two buttons under the input line, "Checklist" and "Main errors". When you click on the
 "Checklist" button, the user will receive a list of documents necessary for passing the norm control
 procedure. After clicking the "Basic errors" button, the user will receive information about the mistakes
 that students make most often when preparing documents for standard control.
     To work with the chatbot the administrator mode is implemented. After entering the password, the
 user gets additional features, such as viewing the program logs, viewing statistics on requests to the
 chatbot and sending messages on behalf of the chatbot (Figure 8). The added convenient function of
 sending messages on behalf of the chatbot allows the teacher to give an extended answer to the
 questions asked.


Figure 5: Getting started with a chatbot
Figure 6: Getting started with a chatbot


Figure 7: Example of a voice request

   The Logging library is used for logging, i.e. writing data about the program's operation to a file on
the disk, which is called a log or log. The logging data is displayed in the console and saved in a file
(Figure 9).
   Information about user requests is saved in a file with the CSV extension. This functionality was
implemented using the built-in CSV library. A screenshot of the file with user requests is shown in
Figure 10. The password required to enter the administrator mode is located in a separate file. This
allows you to change the password without reassembling the project.
Figure 8: Administrator Mode


Figure 9: Logging
Figure 10: Request file

6. Conclusion

   As a result of the research, a chatbot with artificial intelligence has been developed to automate the
process of standard control of the WRC, which is able to send documents to the student for standard
control, give advice on the design of an explanatory note and other regulatory documents, check the
correctness of the design of documents, and enable teachers to give advice to students on behalf of the
bot. An extension of this project is the development of an algorithm for automatically filling out the
documents necessary for passing the standard control and automating the process of checking the WRC
for anti-plagiarism, as well as using it in other educational processes on the corresponding datasets.

7. References

[1] O. A. Yudin, I. A. Yudin, Writing a chatbot assistant for entering a higher educational institution.
     Modern science: Actual problems of theory and practice, Series: natural and technical sciences
     6(2) (2019) 117–122.
[2] National standard GOST R 7.0.100–2018. Bibliographic record. Bibliographic description.
     General requirements and rules of compilation. Moscow: Standartinform, 2018. 128 p. URL:
     http://www.skunb.ru/data/upload/documents/files/ibo/GOST_new.pdfD0%A2_%D0%A0_7_0_1
     00_2018_1204.pdf.
[3] NNSTU LDPE 11.2/34-18. Position by type of activity. About the final qualifying work on
     educational programs of higher education of NNSTU-Nizhny Novgorod: NNSTU named after
     R.E. Alekseev, 2018, 38 p. URL: https://www.nntu.ru/frontend/web/ngtu/files/org_structura/
     upravleniya/umu/docs/norm_docs_ngt u/pologenie_vipysk_rab_opop.pdf?23-04.
[4] NNSTU-LDPE-11.3-04-17. Regulations on the procedure for checking final qualifying works for
     the amount of borrowing and their placement in the electronic library system of NNSTU-Nizhny
     Novgorod: NNSTU named                 after      R. E. Alekseev,  2017,      12      p.      URL:
     https://www.nntu.ru/frontend/web/ngtu/files/org_structura/upravleniya/umu/docs/norm_docs_ngt
     u/polog_o_poryadke_proverki_vkr.pdf.
[5] PyQt5, Python Package Index, 2021. URL: https://pypi.org/project/PyQt5.
[6] I. Akhmetov, A. Krassovitsky, I. Ualiyeva, R. Mussabayev, A. Gelbukh, Lemmatization of russian
     language by tree regression models, Research in Computing Science 149(3) (2020) 147-153.
[7] 4 methods of text vectorization, Python School, 2020. URL: https://python-
     school.ru/nlpvectorization-methods/
[8] Overview of classification methods in machine Learning using Scikit-Learn, Tproger, 2019. URL:
     https://tproger.ru/translations/scikit-learn-in-python/
[9] Metrics in machine learning tasks, Habr, 2017. URL: https://habr.com/ru/company/ods/
     blog/328372/
[10] J. McConnell. Analysis of algorithms. Active learning approach: a textbook-Moscow:
     Technosphere, 2018, 416 p.
[11] 6 indisputable advantages of Telegram bots over mobile applications, sites and groups in social
     networks, Hab, 2015. URL: https://habr.com/ru/post/296388/
[12] Library      in      Python      python-telegram-bot,       DOCS    Python3,      2021.      URL:
     https://docspython.ru/packages/biblioteka-python-telegram-bot-python/.