=Paper=
{{Paper
|id=Vol-3348/paper4
|storemode=property
|title=Speech-To-Text Software Design for the High Education Learning
|pdfUrl=https://ceur-ws.org/Vol-3348/paper4.pdf
|volume=Vol-3348
|authors=Elena Yashina,Tetiana Rubanik,Andriy Chukhray
|dblpUrl=https://dblp.org/rec/conf/profitai/YashinaRC22
}}
==Speech-To-Text Software Design for the High Education Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-3348/paper4.pdf</pdf>
<pre>
Speech-To-Text Software Design for the High Education Learning
Elena Yashina, Tetiana Rubanik and Andriy Chukhray
     National Aerospace University «Kharkiv Aviation Institute», Chkalova St., 17, 61070 Kharkiv, Ukraine


                 Abstract
                 The article is dedicated to the application of Speech-to-text technologies and software in the
                 high education. The computer-aided and learning assistive technologies plays an important role
                 in the educational process of a modern university both in online and face-to-face learning. The
                 article considers the use of Automatic speech recognition and Speech-to-text technology and
                 software in the education. Reviewed available Speech-to-text services. The architecture design
                 for an application for the preparing of lectures notes is proposed. The need to integrate the
                 application with other learning assistive technologies (text editors, graphical editors, Image-
                 to-text and other) is noted. Difficulties of Speech-to-text technologies applying and
                 development perspectives are considered.

                 Keywords 1
                 Speech-to-text, computer-aided learning, learning assistive technologies.

1. Introduction
    Modern universities actively use various e-learning and learning assistive technologies [1]. One of
them is Automatic speech recognition and Speech-to-text technology that allows you to quickly get a
recording of a lecture or transcribe a pre-made audio recording. Initially, these technologies were used
in teaching special categories of learners: hearing impaired or foreign students. However, the progress
of technology, growth in the speed and accuracy of conversion, as well as the emergence of accessible
services have made it possible to significantly expand the scope of their application. Providing
information in different forms: audial, visual, textual allows better to meet the needs of students with
different learning styles
    The COVID-19 pandemic has caused a massive shift to online learning [2]. The role of digital
learning platforms has dramatically increased. Various resources, including lecture recordings, should
be available to students at any time.
    Even more significant changes occurred after the beginning of the war in Ukraine in 2022.
Thousands of students lost access to classrooms and libraries of their universities. Synchronized online
sessions interrupted by air raid alerts and blackouts. Asynchronous learning mode becomes a priority.
This places high requirements on the completeness and availability of educational resources. The
combined impact of the pandemic and war has had a destructive impact on all areas of life [3].
Overcoming the encountered problems is impossible without the digitalization of various areas, first of
all education.
    Usually the lecture is delivered in the form of a video recording. But the presence of text transcript
provides additional opportunities. The text is easy to edit, supplement, correct inaccuracies. It is easy
to give a text document a structure and a table of contents. It is possible to search by keywords, etc.
However, lectures on technical and natural sciences are full of illustrative material: formulas, drawings,
diagrams, etc. Turning a text transcript into a full-fledged lecture notes requires the use of additional
editing tools.


2nd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2022), December 2-4, 2022, Łódź, Poland
EMAIL:o.yashina@khai.edu (E. Yashina); t.m.rubanik@student.khai.edu (T. Rubanik); achukhray@gmail.com (A. Chukhray)
ORCID: 0000-0003-2459-1151 (E. Yashina); 0000-0002-8277-1575 (T. Rubanik); 0000-0002-8075-3664 (A. Chukhray)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
   The aim of this article is improving the availability of educational resources such as lecture records
via the Speech-to-text software development and implementation.

2. Automatic speech recognition and Speech-to-text technology in the
   education
   Speech-to-text (STT) technology was previously mainly used to help certain groups of students (i.e.,
students with learning or physical disabilities or foreign students) in order to guarantee them the equal
access to learning. However, as time passed by, the target users involved into research on STT
technology has got broader. That is, nowadays STT technology is adopted to assist not only students
with special needs but also general population of students for more educational purposes, such as
enhancing students’ understanding of a presented learning content during and after academic activities
   The paper [4] reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition
technology has been applied to enhance learning. Four main areas of use of STT in teaching have been
identified.
   1. Students with cognitive or physical disabilities. It is extremely difficult for these students to
        focus their visual attention on note-taking and the instructor (or interpreter) simultaneously.
        Therefore, it was suggested to apply assistive technologies, such as a speech-to-text support
        service, to enhance computer-assisted learning for students with different types of disabilities.
   2. Non-native speakers and foreign students. The speech-to-text recognition technology is a
        potentially reliable tool for non-native speaker students to better understand a speech given in a
        foreign language.
   3. Online students. Network traffic congestion can cause poor quality of audio communication in
        a synchronous cyber classroom. Under such condition, students are not able to hear a speaker
        clearly. This issue was viewed as one technological challenge. It negatively affects online
        teaching and learning activities as it hinders students’ understanding of a delivered speech, and
        it also hampers students from engaging in classroom participation and interaction. STT
        technology allows you to create real-time transcripts or subtitles that are simultaneously
        displayed to students on their computer screens. In this way Students could listen to the speaker
        and read transcripts at the same time. More importantly, the text generated by STT was saved
        for further revision to fix some recognition errors, and students could get a near verbatim
        transcript to review after class.
   4. Students in traditional learning environment. the adoption of the STT technology in traditional
        learning environment has several benefits. One of them is to improve teaching methods and to
        enhance learning opportunities. For example, by using the STT, teachers can take a proactive,
        rather than a reactive approach to teach students with different learning styles. It provides
        educators with a practical means of making their teaching accessible and improves the quality
        of instruction in the process.
   Two different methods of STT-mediated lecture absorption are most commonly used, such as real-
time subtitles and transcription after the lecture When lecture transcripts were available, students were
able to pay more attention to the instructor instead of focusing on recording complete class notes, and
with the lecture transcripts, they could review the lecture material for several times. Besides, students
were able to take notes, make comments and remarks, and look for specific text by searching keywords
and time periods. The students who had access to post lecture transcriptions received higher scores.
However, most students claimed that the accuracy rate of STT technology was not precise enough, and
text generated with many errors could distract their attention from the lecture.
   Review [4] shows that participants in most studies on STT, no matter what category of users they
belong to and no matter what learning environment they learn in, had positive perceptions toward
usefulness of STT transcripts for learning.
   According to survey [5] 20% of British higher education students using various assistive
technologies in 2020. 9% of students using dictation (speech to text) use technology. 51% students says
their organization had offered them support in using assistive technologies. Approximately half of
students (54%) said they enjoyed trying out new and innovative technologies, and less than half (43%)
of students said they were comfortable using mainstream technologies. 89% of higher education
students had access to online course materials, e-books and journals and recorded lectures at their
organization whenever they needed them.
   The 2020/21 survey [6] was taken at a time when students, faculty and colleges continued to
experience disruptions caused by the COVID19 pandemic. Colleges were supposed to respond quickly
to a changing environment to maintain and reimagine training and support they were able to suggest
and also solve many other operational aspects of delivery. 58% of students evaluated quality of online
learning materials as well designed although less than half of learners (41%) agreed that the online
learning materials were engaging and motivating. Two thirds of learners (66%) had accessed course
materials and notes. Substantial numbers (63%) had also submitted coursework, taken part in live online
lectures/teaching sessions.
   Learners were asked to say what they thought were the most positive and negative aspects of online
learning. Their responses reveal that learning preferences are very individual – what some learners
really like, others do not. Lecture recordings which is interesting as lecture recordings have not
traditionally been so available/used. Students also noted easy and convenient access to learning
resources, materials and information.
   The study [7] shows that students' attendance and engagements have significantly dropped during
live online delivery due to the impact of the COVID-19 pandemic. However, during the pandemic the
way technology has been used to deliver learning using recorded lectures and seminars on a virtual
platform, attendance and engagement in higher education seem to lose their importance since students
do not have to attend classes to get access to the course material.
   Recording lectures and seminars became the norm in higher education for synchronous and
asynchronous teaching during COVID-19. The students who was graduated in 2020-21 are much better
prepared than last year’s graduating students. These new cohorts of students have a very clear
understanding of the role and implications of the online recorded lectures and seminars on their learning
and knowledge.

3. Speech-to-text methods and technologies
    When sounds come out of someone's mouth to create words, it creates a series of vibrations. Speech-
to-text technology works by picking up these vibrations and translating them into digital language using
an analog-to-digital converter. An analog-to-digital converter takes sounds from an audio file, measures
the waves in great detail, and filters them to distinguish the corresponding sounds. Sounds are broken
down into hundredths or thousandths of a second and then matched with phonemes. A phoneme is a
sound unit that distinguishes one word from another in any given language. For example, there are
approximately 40 phonemes in the English language. Once broken down, the phonemes are run through
a mathematical model that compares them to known sentences, words and phrases. The result is
provided as text based on the most likely version of the audio.
    A computer program uses linguistic algorithms, such as Automatic speech recognition (ASR), to
sort audio signals from spoken words and translate those signals into text. Speech-to-text works using
a complex machine learning model that involves several steps [8].

    3.1.        Automatic speech recognition methods

    The function of an ASR is to take input of a sound wave and convert the spoken speech into text
form; the input could be either taken directly using a microphone or as an audio file [9]. This multimedia
tools and applications problem can be explained in the following way: for a given sequence input
sequence X, where 𝑋 = 𝑋1, 𝑋2, … . , 𝑋𝑛 , where n is the length of the input sequence, the function of an
ASR is to find a corresponding output sequence Y, where 𝑌 = 𝑌1, 𝑌2, … . , 𝑌𝑚 , where m is the length
of the output sequence. And the output sequence Y has the highest posterior probability P(Y|X), where
P(Y|X) can be calculated using the given formula:

                                               𝑃(𝑊)𝑃(𝑋|𝑊)
                                𝑊 = argmax                ,                                          (1)
                                                  𝑃(𝑋)
where 𝑃(𝑊) is the probability of the occurrence of the word, 𝑃(𝑋) is the probability that X is present
in the signal, and 𝑃(𝑋|𝑊) is the probability of the acoustic signal W occurring in correspondence to the
word X.
    An ASR can generally be divided into four modules: a pre-processing module, a feature extraction
module, a classification model, and a language model. Usually the input given to an ASR is captured
using a microphone. This implies that noise may also be carried alongside the audio. The goal of
preprocessing the audio is to reduce the signal-to-noise ratio. There are different filters and methods
that can be applied to a sound signal to reduce the associated noise. Framing, normalization, end-point
detection and pre-emphasis are some of the frequently used methods to reduce noise in a signal.
Preprocessing methods also vary based on the algorithm being used for feature extraction. Certain
feature extraction algorithms require a specific type of pre-processing method to be applied to its input
signal. After pre-processing, the clean speech signal is then passed through the feature extraction
module. The performance and efficiency of the classification module are highly dependent upon the
extracted features.
    There are different methods of extracting features from speech signals. Features are usually the
predefined number of coefficients or values that are obtained by applying various methods on the input
speech signal. The feature extraction module should be robust to different factors, such as noise and
echo effect. Most commonly used feature extraction methods are Melfrequency cepstral coefficients,
linear predictive coding, and discrete wavelet transform.
    The third and final module is the classification model; this model is used to predict the text
corresponding to the input speech signal. The classification models take input of the features extracted
from the previous stage to predict the text. Like the feature extraction module, there are different types
of approaches that can be applied to perform the task of speech recognition.
    The first type of approach uses joint probability distribution formed using the training dataset, and
that joint probability distribution is used to predict the future output. This approach is called a generative
approach; hidden Markov model and Gaussian mixture models are the most commonly used models
based on this approach.
    The second approach calculates a parametric model using a training set of input vectors and their
corresponding output vectors. This approach is called the discriminative approach; support vector
machines and artificial neural network are its most common examples. Hybrid approaches can also be
used for classification purposes; one example of such a hybrid model is that of a hidden Markov model
and artificial neural network. The language model is the last module of the ASR; it consists of various
types of rules and semantics of a language. Language models are necessary for recognizing the phoneme
predicted by the classifier; and is also used to form trigrams, words or sentences using all of the
predicted phonemes of a given input. Most modern ASRs are designed to work without Language
Models as well. Such ASRs can predict words and sentences spoken in the given input, but their
efficiency can be increased significantly by using a language model.

    3.2.         ASR accuracy evaluation
    Evaluation is one of the most important aspects of a conducted research because of its importance
this section explains in detail different metrics that can be used to evaluate the performance of an ASR.
The performance of a speech recognition system usually depends on two factors, the accuracy of the
output produced as well as the processing speed of the ASR.
    The following methods can be used to measure the accuracy of an ASR. The accuracy of an ASR is
hard to calculate as the output produced by the ASR may not have the same length as the ground truth.
Word error rate (WER) is the commonly used metric to estimate the performance of an ASR, as it
calculates error on word level rather than phoneme level [10]. The WER can be calculated using the
following formula:

                                                 𝑆+𝐷+𝐼
                                       𝑊𝐸𝑅 =           ,                                                 (2)
                                                   𝑁
where S is the number of substitutions performed in the output text as compared to the ground truth, D
is the number of deletions performed, and I is the number of insertions performed. N is the total number
of words in the ground truth. Word recognition rate Word Recognition Rate (WRR) is a variation of
WER that can also be used to evaluate the performance of an ASR. It can be calculated using the
following formula:

                                       𝑊𝑅𝑅 = 1 − 𝑊𝐸𝑅.                                                (3)

   Other metrics is Match error rate (MER)

                                             𝑆+𝐷+𝐼
                                   𝑀𝐸𝑅 =                ,                                            (4)
                                           𝐻+𝑆+𝐷+𝐼
                                       𝐻 = 𝑁 − (𝑆 + 𝐷),                                              (5)

where parameters H, S, D and I correspond to the total number of word hits, substitutions, deletions and
insertions [11].
   Word information lost (WIL) and Word information preserve (WIP) calculating by formulas:

                                               𝐻 𝐻
                                        𝑊𝐼𝑃 =       ,                                                (6)
                                              𝑁1 𝑁2
                                       𝑊𝐼𝐿 = 1 − 𝑊𝐼𝑃.                                                (7)

where 𝑁1 and 𝑁2 are respectively the number of words in groundtruth text and the output transcripts.
    The lower are WER, MER and WIL, the better the performance is [12].
    Different from WER, BLEU can evaluate whether the transcription maintains the context and
organization of the sentence. BLEU was originally proposed for neural machine translation and it claims
to be highly correlated with human assessment. BLEU is based on the precision of n-grams, which
compares the n-grams of reference text 𝑇 ∗ with the n-grams of its transcription T. Let be 𝑁𝐺(𝑛, 𝑇)the
set of n-grams of text t, the n-gram precision Pn (Equation 2) between texts 𝑇 ∗ and T:

                                     |𝑁𝐺(𝑛, 𝑇 ∗ ) ∩ 𝑁𝐺(𝑛, 𝑇)|
                                𝑃𝑛 =                          .                                      (8)
                                           𝑁𝐺(𝑛, 𝑇)

   BLEU is calculated as the geometric mean of 𝑃𝑛 , for n = 1, 2, 3, 4 multiplied by a factor that
penalizes transcriptions shorter than the referenced text. The 𝐵𝐿𝐸𝑈𝑝𝑒𝑛𝑎𝑙𝑡𝑦 factor is 1 if |T| > |𝑇 ∗ | and
     ∗
𝑒1−|𝑇 |/|T| , otherwise. BLEU is defined by formula:

                          𝐵𝐿𝐸𝑈 = 4√𝑃1𝑃2𝑃3𝑃4.× 𝐵𝐿𝐸𝑈𝑝𝑒𝑛𝑎𝑙𝑡𝑦                                            (9)

   METEOR was proposed to fix limitations of BLEU, such as the fact that it does not require explicit
word-to-word matching. Another limitation is that its score results in zero whenever one of the n-gram
precision is zero, which means the score at sentence level can be meaningless. METEOR is based on
the harmonic mean of unigram precision and recall, multiplied by a penalty factor. METEOR defined
by formulas:

                                  |𝑁𝐺(𝑛, 𝑇 ∗ ) ∩ 𝑁𝐺(𝑛, 𝑇)|
                                𝑅𝑛 =                       ,                                        (10)
                                        𝑁𝐺(𝑛, 𝑇 ∗ )
                                   10𝑃1𝑅1
                         𝑀𝐸𝑇𝐸𝑂𝑅 =             𝑀𝐸𝑇𝐸𝑂𝑅𝑝𝑒𝑛𝑎𝑙𝑡𝑦 .                                       (11)
                                   𝑅1 + 9𝑃1
   To calculate the 𝑀𝐸𝑇𝐸𝑂𝑅𝑝𝑒𝑛𝑎𝑙𝑡𝑦 , the unigrams in 𝑁𝐺(𝑛, 𝑇 ∗ ) ∩ 𝑁𝐺(𝑛, 𝑇)are grouped in chunks,
such as each chunk has the maximum number of unigrams in adjacent positions in both 𝑇 ∗ and T. The
fewer the chunks, the better system transcription matches with the reference transcription [13]:

                                              1       #𝑐ℎ𝑢𝑛𝑘𝑠
                       𝑀𝐸𝑇𝐸𝑂𝑅𝑝𝑒𝑛𝑎𝑙𝑡𝑦 =                                   .                           (12)
                                              2 |𝑁𝐺(𝑛, 𝑇 ∗ ) ∩ 𝑁𝐺(𝑛, 𝑇)|

   These metrics allow you to evaluate various aspects of the ASR accuracy and performance.

    3.3.         Speech-to-text tools
    The Speech-to-text software recognize and translate spoken language into text using ASR and other
computational linguistics. This technology is directly related to computer speech recognition (voice
recognition). Certain applications, tools, and devices can transcribe audio streams in real-time to display
and interact with text using Unicode characters. Existing speech recognition platforms provide APIs to
developers, executing received requests on their own servers, allowing them to be used as "black
boxes".
    The Speech-to-Text, interprets the words from the user as audio, and converts them to text in written
form by utilizing deep learning techniques. Lots applications with automated speech recognition or
Speech-To-Text over the past few years have been developed. Thanks to the substantial development
of deep neural network, the performance of STT has been drastically improved.
    The widely used commercial ASR online services are: Google Cloud Speech-to-Text, is integrated
in the widely used platform Google Cloud; IBM Watson Speech-to-Text; Microsoft Azure Cognitive
Speech Services; Amazon Transcribe and other.
    In the study [12] the accuracy and efficiency of the widely used commercial ASR online services
was investigated by metrics WER by formula (2), MER by formula (4), WIP by formula (6) and in the
study [13] its accuracy and efficiency was evaluated by metrics WER by formula (2), BLUE by formula
(9) and METEOR by formula (11). The main method of evaluating ASR engines is finding out their
Word Error Rate. Generally speaking, the lower the WER, the more accurate the speech recognition is.
According to this studies, the accuracy of the models by WER is:
    1. Google Cloud Speech-to-Text: 11.58– 20.00%
    2. IBM Watson Speech-to-Text: 14.81% – 28.57%
    3. Amazon Transcribe: 10.27 – 20.00%
    4. Microsoft Azure Cognitive Speech Services: 8.14 – 11.11%
    Accuracy depends on the dataset used, languages and the presence of noise . Models have
comparable accuracy. The choice of service is based on the availability of the API, its functionality, the
possibility of integration with other services, etc. Amazon Transcribe service was chosen to develop
the system
    Amazon Transcribe is part of Cloud Computing Services from Amazon Web Services (AWS).
Amazon Transcribe uses a deep learning model to perform ASR to quickly and accurately convert
speech to text. In this conversion, the data needs to be first uploaded to Amazon Simple Storage Service
(Amazon S3). Then Transcribe calls the objects from S3 for transcription. Though Transcribe jobs can
be treated on batch mode (up to 100 parallel jobs). The model provided automatically adds punctuation
and number formatting, so that the output closely matches the quality of manual transcription at a
fraction of the time and expense. Numbers are also transcribed into digits or “normal form” instead of
words.

4. Proposed application design
   4.1.    Application architecture and tools

  The application will have two main functions: speech to text translation and file editing. In order to
improve the understanding of the material after conversion into text, it must be edited::
      • make corrections to sentences that the system recognized incorrectly;
        • break down the material by structure - separate sections and subtopics, etc.;
        • add formatting;
        • select and add graphic materials to illustrate difficult points, etc.
    Software with editor functions of the corresponding file type is suitable for performing the specified
steps. Some platforms provide their own Application Program Interfaces (API), Software Developer
Kits (SDK), and other ways to integrate ready-made solutions into a new project. The SDK provides a
comprehensive collection of tools to create a flexible, intuitive interface.
    Amazon Transcribe (aws.amazon.com/transcribe/) — is an automatic speech recognition service
that makes it easy to add speech-to-text capabilities to any application. The platform supports speech-
to-text conversion both in real time and from pre-recorded audio files, recognition of interlocutors,
determination of the language in which the interview is conducted, etc. It has API, communication with
S3 storage, SDK for multi-language programming.
    In order to interact with the Amazon Transcribe API, it is convenient to use the appropriate client
classes from the SDK package.
    Language to text translation takes time. Due to this, a call to the API in synchronous mode will stop
the entire application for at least half a minute, which makes such an approach inappropriate. The
"queue" mechanism was chosen as a means of asynchronous work. Thus, the application will send the
request asynchronously, notify the user that their audio has been accepted for processing, and return the
control flow.
    Figure 1 shows the algorithm for working with Amazon Transcribe using classes from the SDK.


Figure 1: The algorithm for interacting with AWS to extract text from audio

   User data is forwarded to the server, processed and stored in a MySQL database. The system uses
asynchronous requests that will check the status of TranscriptionJob with a time interval of 10-20
seconds. Thus, the system will not waste resources on maintaining the connection and will be able to
provide the user with the result with minimal delay. Figure 2 shows the interaction of the developed
system with the Amazon service.
Figure 2: Interaction of the application under development with Amazon services

   The tools to be used during the further development of the application were selected, in particular:
      • Synfony as a PHP framework (symfony.com);
      • Vue as a JavaScript framework (vuejs.org);
      • Docker as an environment modeling tool: application server, database server, etc.
           (www.docker.com);
      • MySql as a database (www.mysql.com);
      • Redis as a database for queue management (redis.com);
      • Supervisord as a client/server system that allows you to monitor a number of processes in
           UNIX-like operating systems (supervisord.org);
      • AWS SDK PHP as a set of tools for interacting with Amazon (github.com/aws/aws-sdk-
           php);

    4.2.         Application structure
   Symfony was chosen as the framework for application development. This dictates the choice of a
pattern for interacting with the database: a layer of repository classes that emulate
ServiceEntityRepository. Entity classes will accordingly be collected in a separate group -
Entity.
   Services will be the layer containing business logic (Figure 3). This will separate the application
logic from the models and controllers (thin controller pattern). Algorithms to be executed
asynchronously will request handler classes that implement the MessageHandlerInterface
provided by Symfony. Messenger centers around two different classes: the message class that contains
the data, and the handler class that will be called when that message is sent. The handler class will read
the message class and perform one or more tasks. The system under development will have two
handlers: for the request to start transcription and for the request for the status of the started transcription
(Figure 4).
Figure 3: System services for the implementation of the functionality of the speech-to-text
application


Figure 4: Handler classes, corresponding storage classes, and validator classes

   Management of access to the system, control of access to functions according to the user's role will
be carried out by classes from the Security group. Symfony is delivered with many authenticators,
and third-party packages also implement more complex cases such as JWT and oAuth 2.0. The
application will use JWTAuthenticator and a separate authenticator for the first login and receiving
the access token. The Voter component provided by the framework allows you to implement access
restrictions. According to the entities and roles of the account, the user will be blocked (return an error
with a reference to the access level) from some functions of the controllers (viewing accounts,
downloading the transcription result, etc.).
    4.3.        User interface design
   The layout of the application interface has been developed.
   After logging into the account, the user should see the transcription creation page. The next step will
be viewing the list of audio files with existing transcriptions (Figure 5).


Figure 5: A list of transcriptions

   The user interface for the lecture file editing section will be built using the SDK. Pdftron editor
(www.pdftron.com) and related products are choiced. The editing window will correspond to the demo
displayed on the official page of the tool (Figure 6).


Figure 6: Transcription editing

   The user can edit the transcription text by adding headings, links and graphic elements (Figure 7).
   The prepared text can be converted to a convenient format (for example, PDF) and placed on the
learning management system (Figure 8).

5. Conclusions and perspectives
   Educational assistive technology significantly improve the quality and effectiveness of learning both
online and face-to-face. SST allows to speed up the preparation and delivery of educational resources
to learning management system. The lectures transcription allows students to better understand the
topic.
   However, you need to keep in mind the disadvantages of Speech-to-text:
        • insufficient performance of existing systems;
        • the transcript contains some errors and inaccuracies;
        • the quality of transcription depends on the clarity of pronunciation, requires a good
           microphone and the absence of external noise;
       •   the best transcription quality is provided for English, the recognition of Ukrainian is much
           worse;
       •   transcription of a discussion with several participants can be worse than a lecture by a single
           teacher.


Figure 7: Editing the text page


Figure 8: Document prepared for publication
    Promising areas of research and development are related to the improvement of technologies, as
well as the development of means for integrating various tools. The proposed system will be used in
conjunction with other available software tutoring tools [14, 15]
    Speech-to-Text, Text-to-Speech, Image-to-Text integration will significantly improve efficiency of
e-learning technology [16]. In turn, the integration of assistive technologies based on a unified digital
platform will significantly expand the possibilities of learning management system.
    However, it must be taken into account that training in an e-learning environment happens
differently than in the traditional classroom and can present new challenges to teachers and students in
this online learning environment [17]. The introduction of tools requires not only the availability of
hardware and software, but also significant changes in the culture and style of teaching, enhancing the
digital experience of teachers and students.

6. References

[1] D.A. Huffaker, S.L. Calvert., The new science of learning: Active learning, metacognition, and
     transfer of knowledge in e-learning applications, Journal of Educational Computing Research 29,
     no. 3 (2003): 325-334. doi: 10.2190/4T89-30W2-DHTM-RTQ2
[2] V.J. García-Morales, A. Garrido-Moreno, R. Martín-Rojas, The transformation of higher
     education after the COVID disruption: Emerging challenges in an online learning scenario,
     Frontiers in Psychology 12 (2021): 616059. doi: 10.3389/fpsyg.2021.616059
[3] D. Chumachenko, P. Pyrohov, I. Meniailov, T. Chumachenko, Impact of war on COVID-19
     pandemic in Ukraine: the simulation study, Radioelectronic and Computer Systems 2 (2022): 6-
     23. doi: 10.32620/reks.2022.2.01
[4] R. Shadiev, W.Y. Hwang, N.S. Chen, Y.M. Huang, Review of speech-to-text recognition
     technology for enhancing learning, Journal of Educational Technology & Society 17, no. 4 (2014):
     65-84. URL: https://www.jstor.org/stable/jeductechsoci.17.4.65
[5] M. Langer-Crame, C. Killen, H. Beetham, Student digital experience insights survey 2020:
     question by question analysis of findings from students in UK further and higher education, (2020).
     URL: https://www.voced.edu.au/content/ngv:91499
[6] Student digital experience insights survey 2020/21 [findings from pulse 1: October-December
     2020], Bristol, England: JISC, 2021. URL: https://www.voced.edu.au/content/ngv:91498
[7] S. Ghosh, Y. Liang, Recorded teaching materials and their impact on students attendance,
     engagement and performance during Covid-19, Academy of Marketing Conference 2021:
     Reframing Marketing           Priorities,   05-07    July 2021,        Virtual    (2021).     URL:
     https://eprints.bournemouth.ac.uk/35731/
[8] M. Anniss, How Does Voice Recognition Work?. The Rosen Publishing Group, 2013.
[9] M. Malik, M.K. Malik, K. Mehmood, I. Makhdoom, Automatic speech recognition: a survey,
     Multimedia Tools and Applications 80, no. 6 (2021): 9411-9457. doi: 10.1007/s11042-020-10073-
     7
[10] B. Favre, K. Cheung, S. Kazemian, A. Lee, Y. Liu, C. Munteanu, A. Nenkova et al, Automatic
     human utility evaluation of ASR systems: Does WER really predict performance?,
     INTERSPEECH,             pp.       3463-3467.        2013.       URL:         https://pageperso.lis-
     lab.fr/~benoit.favre/papers/favre_interspeech2013.pdf
[11] A.C. Morris, V. Maier, P. Green, From WER and RIL to MER and WIL: improved evaluation
     measures for connected speech recognition, Eighth International Conference on Spoken Language
     Processing                     2004.                     URL:                    https://www.isca-
     speech.org/archive_v0/archive_papers/interspeech_2004/i04_2765.pdf
[12] B. Xu, C. Tao, Z. Feng, Y. Raqui, S. Ranwez, A benchmarking on cloud based speech-to-text
     services for french speech and background noise effect, APIA 2021 - Conférence Nationale sur les
     Applications Pratiques de l’Intelligence Artificielle (événement affilié à PFIA 2021), Jun 2021,
     Bordeaux, France. p. 102-107. URL: https://hal.mines-ales.fr/hal-03277773
[13] R.P. Magalhães, D.J.R. Vasconcelos, G.S. Fernandes, L.A. Cruz, M.X. Sampaio, J.A.F. de
     Macêdo, T.L.C. da Silva, Evaluation of Automatic Speech Recognition Approaches, Journal of
     Information and Data Management 13, no. 3 (2022). doi: 10.5753/jidm.2022.2514
[14] A. Chukhray, O. Havrylenko, The engineering skills training process modeling using dynamic
     bayesian nets, Radioelectronic and Computer Systems 2 (2021): 87-96. doi:
     10.32620/reks.2021.2.08.
[15] A. Chukhray, E. Yashina, Models and Software for Intelligent Web-Based Testing System in
     Mathematics, CEUR Workshop Proceedings 3003 (2021): 1-10. 2021. URL: https://ceur-
     ws.org/Vol-3003/paper1.pdf
[16] A.S. Deshpande, S.V. Shreyas, P.B. Swami, P.R. Jaiswal, Integration of Speech, Image & Text
     Processing            Technologies              (2017):             251-254.           URL:
     https://www.academia.edu/download/53485551/IRJET-V4I450.pdf
[17] A.M. Tirziu, Andreea-Maria, C. Vrabie, Education 2.0: E-learning methods, Procedia-Social and
     Behavioral Sciences 186 (2015): 376-380. doi: 10.1016/j.sbspro.2015.04.213

</pre>