Andrii M. Striuk et al. CEUR Workshop Proceedings                                                                                                     415–427


                         Research and development of a subtitle management
                         system using artificial intelligence
                         Andrii M. Striuk1,2,3 , Vladyslav V. Hordiienko1
                         1
                           Kryvyi Rih National University, 11 Vitalii Matusevych Str., Kryvyi Rih, 50027, Ukraine
                         2
                           Kryvyi Rih State Pedagogical University, 54 Universytetskyi Ave., Kryvyi Rih, 50086, Ukraine
                         3
                           Academy of Cognitive and Natural Sciences, 54 Universytetskyi Ave., Kryvyi Rih, 50086, Ukraine


                                     Abstract
                                     Subtitles play a vital role in making video content accessible to a wider audience, including individuals with
                                     hearing impairments and those who do not understand the spoken language. However, the manual creation
                                     of subtitles is a time-consuming and labor-intensive process. This paper proposes an AI-powered subtitle
                                     management system that automates the generation and management of subtitles for video content. The system
                                     leverages state-of-the-art automatic speech recognition (ASR) and machine translation (MT) technologies to
                                     generate accurate and synchronized subtitles in multiple languages. The proposed system architecture consists
                                     of a speech recognition module, a machine translation module, a subtitle segmentation and formatting module,
                                     and a user-friendly interface. The paper provides a comprehensive literature review of the related work in
                                     the field of AI-based subtitle generation, covering key aspects such as speech recognition techniques, machine
                                     translation approaches, multimodal methods, and evaluation methodologies. The implications of the proposed
                                     system for subtitle generation pipelines are discussed, highlighting its potential to enhance efficiency, scalability,
                                     and accessibility. The limitations of the current system and directions for future research are also outlined. This
                                     research contributes to the advancement of AI-powered subtitle generation and aims to make video content more
                                     inclusive and accessible to a global audience.

                                     Keywords
                                     subtitles, artificial intelligence, speech recognition, machine translation, video accessibility


                         1. Introduction
                         1.1. Background and motivation
                         In today’s digital age, video content has become an integral part of communication, education, and
                         entertainment. However, the accessibility of video content remains a challenge for individuals with
                         hearing impairments or those who do not understand the spoken language. Subtitles play a crucial role
                         in making video content more inclusive and accessible to a wider audience [1].
                            Despite the importance of subtitles, the process of manually creating them is time-consuming, labor-
                         intensive, and prone to errors [2]. It requires skilled human translators to listen to the audio, transcribe
                         the dialogue, and synchronize the subtitles with the video timestamps. This manual process often
                         results in delays in the availability of subtitles and limits the scalability of subtitle generation for large
                         volumes of video content.
                            Advancements in artificial intelligence (AI) technologies, particularly in the fields of automatic speech
                         recognition (ASR) and machine translation (MT), have opened up new possibilities for automating the
                         subtitle generation process [3, 4]. AI-powered systems can significantly reduce the time and effort
                         required for subtitle creation while maintaining high levels of accuracy and quality.


                          CS&SE@SW 2024: 7th Workshop for Young Scientists in Computer Science & Software Engineering, December 27, 2024, Kryvyi
                          Rih, Ukraine
                          " andrey.n.stryuk@gmail.com (A. M. Striuk)
                          ~ http://mpz.knu.edu.ua/andrij-stryuk/ (A. M. Striuk)
                           0000-0001-9240-1976 (A. M. Striuk)
                                     © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                           415
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                415–427


1.2. Research objectives
The primary objective of this research is to develop an AI-powered subtitle management system that
automates the generation and management of subtitles for video content. The proposed system aims to
leverage state-of-the-art ASR and MT technologies to generate accurate and synchronized subtitles in
multiple languages.
   Furthermore, this research aims to provide a comprehensive literature review of the existing tech-
niques, applications, and evaluation methodologies in the field of AI-based subtitle generation. By
synthesizing the current state of knowledge, we aim to identify the challenges, opportunities, and future
research directions in this domain.

1.3. Paper contributions and organization
The main contributions of this paper are as follows:

    • We propose an AI-powered subtitle management system that automates the generation and
      management of subtitles for video content, leveraging state-of-the-art ASR and MT technologies.
    • We provide an extensive literature review of the related work in the field of AI-based subtitle
      generation, covering key aspects such as speech recognition, machine translation, multimodal
      approaches, and evaluation methodologies.
    • We present the architecture and key components of the proposed subtitle management system,
      including the speech recognition module, machine translation module, subtitle segmentation and
      formatting module, and user interface.

  The remainder of this paper is organized as follows: section 2 provides a comprehensive overview
of the related work in the field of AI-based subtitle generation. Section 3 describes the proposed
AI-powered subtitle management system, including its architecture, components, and functionalities.
Finally, section 4 concludes the paper and summarizes the key findings and contributions.


2. Related work
Extensive research has been conducted in the field of AI-based subtitle generation, spanning various
techniques, applications, and evaluation methodologies. This section provides a comprehensive overview
of the related work, focusing on key aspects such as speech recognition, machine translation, multimodal
approaches, and subtitle evaluation metrics.

2.1. Speech recognition for subtitle generation
Automatic speech recognition (ASR) plays a crucial role in the subtitle generation pipeline by converting
spoken audio into textual transcripts. Researchers have explored various ASR techniques to improve the
accuracy and efficiency of subtitle generation. Radha and Pradeep [5] proposed an automated subtitle
generation system using hidden Markov models (HMMs) for speech recognition. They demonstrated the
effectiveness of their approach on English-language videos and highlighted the importance of accurate
speech recognition for subtitle quality.
   Convolutional neural networks (CNNs) [6] have also been employed for ASR in subtitle generation
tasks. Ramani et al. [7] developed an automatic subtitle generation system using CNNs for speech
recognition, achieving promising results on real-time video subtitling. They emphasized the significance
of audio preprocessing techniques and the choice of media player for seamless subtitle integration.
   Recurrent neural networks (RNNs), particularly long short-term memory (LSTM) networks, have
gained popularity in ASR for subtitle generation. Kiran et al. [8] proposed a subtitle generation
system using sequence-to-sequence RNNs for speech recognition and video scene indexing. Their
approach demonstrated improved accuracy and the ability to handle longer video sequences compared
to traditional methods.


                                                    416
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                 415–427


   The application of ASR techniques to specific domains, such as lecture videos, has also been explored.
Che et al. [9] developed an automatic lecture subtitle generation system using ASR and evaluated
its performance against manual subtitling. They found that the ASR-generated subtitles significantly
reduced the time and effort required for subtitle creation while maintaining comparable quality. Similarly,
Sridhar et al. [10] proposed a hybrid approach combining acoustic and linguistic features for subtitle
generation in computer science lecture videos, achieving improved accuracy in detecting discourse
boundaries.

2.2. Machine translation for multilingual subtitles
Machine translation (MT) is essential for generating subtitles in multiple languages, enabling video
content to reach a wider global audience. Researchers have investigated various MT approaches,
including statistical and neural models, to improve the quality and efficiency of multilingual subtitle
generation.
   Karakanta et al. [11] conducted a comparative study of different MT approaches for subtitle genera-
tion, including phrase-based statistical MT and neural MT. They evaluated the performance of these
approaches on a multilingual subtitle corpus and highlighted the challenges in preserving linguistic
and cultural nuances in translated subtitles.
   Neural MT architectures, such as sequence-to-sequence models with attention mechanisms, have
shown promising results in subtitle translation tasks. Du and Lu [4] proposed a neural MT system
specifically designed for subtitle translation, incorporating features such as character-level encoding and
domain adaptation. Their system achieved significant improvements in translation quality compared to
traditional MT approaches.
   The quality of AI-generated subtitles compared to human translations has also been a focus of
research. Calvo-Ferrer [12] conducted a study comparing the quality of subtitles generated by machine
translation systems with those created by human translators. They found that while MT systems have
made significant progress, human translators still outperform them in terms of accuracy and contextual
understanding.

2.3. Multimodal and end-to-end subtitle generation
Multimodal approaches that leverage both visual and linguistic information have emerged as promising
directions for subtitle generation. These approaches aim to capture the contextual and visual cues
present in the video to enhance the accuracy and coherence of the generated subtitles.
   Shanmugam et al. [13] proposed a multimodal subtitle generation system that combines visual
features extracted from the video frames with linguistic information from the audio transcripts. Their
approach demonstrated improved synchronization and contextual relevance of the generated subtitles
compared to unimodal methods.
   Martín et al. [14] developed a multimodal subtitle generation framework that incorporates visual,
acoustic, and linguistic features using deep learning techniques. They evaluated their system on a
dataset of educational videos and showed significant improvements in subtitle quality and alignment.
   End-to-end subtitle generation, where the entire process from speech recognition to subtitle gen-
eration is performed by a single model, has also gained attention. Valor Miró et al. [15] proposed an
end-to-end subtitle generation system that directly translates speech into subtitles in multiple lan-
guages. Their approach achieved comparable performance to pipeline-based methods while reducing
the complexity and error propagation.
   Hotta et al. [16] developed an end-to-end speech-to-text translation system specifically designed for
subtitle generation. Their system incorporated techniques such as attention mechanisms and beam
search to improve the quality and fluency of the generated subtitles.


                                                    417
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                 415–427


2.4. Subtitle evaluation metrics and methodologies
Evaluating the quality and effectiveness of AI-generated subtitles is crucial for assessing their usability
and acceptability. Researchers have proposed various evaluation metrics and methodologies to measure
the performance of subtitle generation systems.
   Automatic evaluation metrics, such as word error rate (WER) and bilingual evaluation understudy
(BLEU), have been widely used to assess the accuracy and fluency of generated subtitles. Ramani et al.
[7] employed WER as a metric to evaluate the performance of their CNN-based subtitle generation
system, demonstrating its effectiveness in measuring transcript accuracy.
   Kaulage et al. [17] utilized the BLEU score to evaluate the quality of machine-translated subtitles in
their multilingual subtitle generation system. They highlighted the importance of considering both the
accuracy and fluency of the translations when assessing subtitle quality.
   Human evaluation methodologies have also been employed to assess the subjective quality and user
experience of AI-generated subtitles. Al Sawi and Allam [18] conducted a comparative analysis of
human-generated and AI-generated Arabic subtitles, using qualitative and quantitative approaches to
evaluate the subtitle quality and viewer comprehension.
   Kuroiwa et al. [19] proposed a human-in-the-loop approach for subtitle generation, combining AI
techniques with human intervention to improve the accuracy and cultural appropriateness of the
generated subtitles. They emphasized the importance of human expertise in overcoming the limitations
of AI systems in understanding cultural nuances.

2.5. Applications in education and entertainment
AI-based subtitle generation has found significant applications in various domains, particularly in
education and entertainment. Researchers have explored the benefits and challenges of deploying
AI-powered subtitle systems in these contexts.
   In the educational domain, Qiu [20] proposed an automatic subtitle generation system for teaching
videos using cloud computing techniques. They demonstrated the effectiveness of their approach
in reducing the time and effort required for subtitle creation, thereby enhancing the accessibility of
educational content.
   Martín et al. [14] developed an automatic subtitle generation system specifically tailored for edu-
cational videos produced by the Government of La Rioja, Spain. Their system aimed to improve the
accessibility of important educational content for individuals with hearing impairments.
   In the entertainment industry, Malakul and Park [1] investigated the effects of using an auto-subtitle
system in educational videos on learning comprehension, cognitive load, and satisfaction. They found
that AI-generated subtitles significantly improved the learning experience for non-native speakers and
individuals with hearing impairments.
   Kuroiwa et al. [19] explored the challenges and opportunities of AI-based subtitle generation for
anime content. They proposed a hybrid approach combining AI techniques with human intervention
to improve the accuracy and cultural appropriateness of the generated subtitles, highlighting the
importance of human expertise in overcoming linguistic and cultural barriers.
   The related work discussed in this section highlights the diverse techniques, applications, and
evaluation methodologies in the field of AI-based subtitle generation. Researchers have made significant
strides in developing effective speech recognition, machine translation, and multimodal approaches
for subtitle generation. However, challenges remain in terms of improving the accuracy, fluency, and
contextual understanding of AI-generated subtitles, particularly in handling linguistic and cultural
nuances. The evaluation of subtitle quality using both automatic metrics and human assessment is
crucial for ensuring the usability and acceptability of AI-generated subtitles in real-world applications.
   As the demand for accessible and multilingual video content continues to grow, AI-based subtitle
generation systems are expected to play an increasingly important role in facilitating the creation
and dissemination of subtitles. Future research directions include the development of more advanced
and integrated AI techniques, the incorporation of domain-specific knowledge, and the exploration of


                                                    418
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                     415–427


user-centric evaluation methodologies to ensure the effectiveness and user satisfaction of AI-generated
subtitles.


3. Proposed subtitle management system
This section presents the proposed AI-powered subtitle management system, which aims to automate
the generation and management of subtitles for video content. The system leverages state-of-the-
art speech recognition and machine translation technologies to generate accurate and synchronized
subtitles in multiple languages. The proposed system architecture, key components, and functionalities
are described in detail.

3.1. System architecture overview
The proposed subtitle management system follows a modular architecture, consisting of several inter-
connected components that work together to achieve automated subtitle generation and management.
Figure 1 provides an overview of the system architecture, highlighting the main modules and their
interactions.


Figure 1: Proposed subtitle management system architecture.


  The system architecture consists of the following main components:

    • Speech recognition module is responsible for converting the audio content of the video into
      textual transcripts. It employs advanced acoustic and language models to accurately recognize
      speech and generate time-aligned transcriptions.
    • Machine translation module takes the transcripts generated by the speech recognition module
      and translates them into the desired target languages. It utilizes state-of-the-art neural machine
      translation techniques to produce high-quality translations while preserving the context and
      meaning of the original content.
    • Subtitle segmentation and formatting module module handles the segmentation of the translated
      text into appropriate subtitle blocks and applies proper formatting and styling to ensure readability
      and compliance with subtitle standards.
    • The system includes a user-friendly interface that allows users to upload videos, select target
      languages, and manage generated subtitles.
    • The database stores video metadata, transcripts, translations, and subtitle files for efficient retrieval
      and management.


                                                     419
Andrii M. Striuk et al. CEUR Workshop Proceedings                                               415–427


   The modular architecture of the proposed system enables flexibility, scalability, and ease of mainte-
nance. Each component can be independently developed, tested, and updated, allowing for continuous
improvement and adaptation to advancements in speech recognition and machine translation technolo-
gies.

3.2. Speech recognition module
The speech recognition module plays a crucial role in the subtitle generation pipeline by accurately
converting the spoken audio content into textual transcripts. Figure 2 illustrates the workflow of the
speech recognition module.

                                                 Audio input


                                           Audio preprocessing


                                Acoustic model                 Language model


                              Speaker diarization


                                                    Decoding


                                             Transcript output


Figure 2: Speech recognition module workflow.


  The speech recognition module incorporates the following key components and techniques:

    • The acoustic model is trained on a large dataset of speech samples and their corresponding
      transcriptions. It learns the relationship between audio features and phonemes, enabling it to
      recognize speech patterns and map them to textual representations.
    • The language model captures the statistical properties of the target language, including word
      sequences and grammar. It helps in improving the accuracy of speech recognition by providing
      contextual information and constraining the search space of possible transcriptions.
    • Speaker diarization is the process of segmenting the audio stream into speaker-specific segments.
      It allows the system to identify and differentiate between multiple speakers in the video, enabling
      accurate attribution of subtitles to the corresponding speakers.
    • Before feeding the audio content into the speech recognition module, various preprocessing
      techniques are applied to enhance the quality and remove noise. These techniques include audio
      normalization, noise reduction, and speaker adaptation.


                                                      420
Andrii M. Striuk et al. CEUR Workshop Proceedings                                               415–427


  The speech recognition module employs state-of-the-art deep learning architectures, such as CNN
and RNN, to achieve high accuracy in transcribing speech. The module is trained on diverse speech
datasets, including various accents, dialects, and languages, to ensure robustness and generalization
capabilities.

3.3. Machine translation module
The machine translation module is responsible for translating the transcripts generated by the speech
recognition module into the desired target languages. Figure 3 depicts the architecture of the machine
translation module.

                                                     Encoded source
                              Encoder                                       Decoder


                         Domain
                          Embedding
                                adaptation
                                    layer                             Named entity handling


                                                                      Attention mechanism


                        Parallel training data


Figure 3: Machine translation module architecture.


   The machine translation module utilizes an encoder-decoder architecture, which has become the de
facto standard in neural machine translation. The key components of the machine translation module
are as follows:

    • The encoder takes the source language transcript as input and converts it into a fixed-length
      vector representation. It employs techniques such as word embeddings and recurrent neural
      networks to capture the semantic and syntactic information of the input sequence.
    • The decoder takes the encoded representation produced by the encoder and generates the target
      language translation. It uses attention mechanisms to selectively focus on relevant parts of the
      input sequence during the decoding process, enabling the generation of accurate and fluent
      translations.
    • The machine translation module incorporates techniques to handle out-of-vocabulary words and
      named entities. This includes subword tokenization, which breaks down rare words into smaller
      units, and named entity recognition, which identifies and preserves named entities during the
      translation process.
    • To improve translation quality for specific domains, such as educational or entertainment content,
      the machine translation module can be fine-tuned on domain-specific parallel corpora. This


                                                      421
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                  415–427


      allows the module to learn domain-specific terminology and style, resulting in more accurate and
      contextually relevant translations.

  The machine translation module is trained on large-scale parallel corpora, consisting of sentence
pairs in the source and target languages. Advanced training techniques, such as teacher forcing and
back-translation, are employed to improve the quality and fluency of the generated translations.

3.4. Subtitle segmentation and formatting
The subtitle segmentation and formatting module takes the translated text and performs necessary
segmentation and formatting to generate properly structured subtitle files. Figure 4 illustrates the
process of subtitle segmentation and formatting.

                                              Translated text


                                            Text segmentation


                                         Timing synchronization


                                          Formatting and styling


                                         SRT/WebVTT subtitle file


Figure 4: Subtitle segmentation and formatting process.


  The subtitle segmentation and formatting module incorporates the following key steps:

   1. The translated text is segmented into appropriate subtitle blocks based on factors such as sentence
      boundaries, dialogue turns, and reading speed. The segmentation ensures that each subtitle block
      is concise, readable, and synchronized with the audio.
   2. Timing synchronization module aligns the segmented subtitle blocks with the corresponding
      timestamps in the video. It takes into account the start and end times of each subtitle block,
      ensuring that the subtitles appear at the appropriate moments and remain synchronized with the
      audio.
   3. Formatting and styling module applies proper formatting and styling to the subtitle text, following
      established subtitle standards and guidelines. This includes setting font properties, such as size
      and color, and applying text formatting, such as italics or bold, to emphasize specific words or
      phrases.
   4. The segmented and formatted subtitle blocks are combined to generate standard subtitle file
      formats, such as SubRip Text (SRT) or Web Video Text Tracks (WebVTT). These subtitle files can
      be easily integrated with video players and streaming platforms.

  The subtitle segmentation and formatting module ensures that the generated subtitles adhere to
industry standards and best practices, enhancing the readability and usability of the subtitles for viewers.


                                                    422
Andrii M. Striuk et al. CEUR Workshop Proceedings                                          415–427


3.5. User interface and interaction design
The proposed subtitle management system includes a user-friendly interface that allows users to
seamlessly interact with the system and manage the subtitle generation process. Figure 5 presents a
high-level overview of the user interface and interaction design.


Figure 5: User interface and interaction design.


                                                    423
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                  415–427


  The user interface incorporates the following key features and functionalities:

    • Video upload: users can easily upload their video files to the system through a simple and intuitive
      interface. The system supports various video formats and provides options for selecting the
      desired target languages for subtitle generation.
    • Language selection: the interface allows users to choose the target languages for subtitle generation.
      Users can select multiple languages simultaneously, enabling the creation of multilingual subtitles
      for their videos.
    • Subtitle preview and editing: the system provides a subtitle preview feature that allows users
      to review the generated subtitles alongside the video. Users can make necessary edits and
      adjustments to the subtitles, ensuring their accuracy and synchronization with the video content.
    • Subtitle download and integration: once the subtitles are generated and reviewed, users can easily
      download the subtitle files in standard formats. The interface provides instructions and guides on
      how to integrate the subtitle files with popular video players and platforms.
    • Subtitle management: the system offers a centralized subtitle management feature, allowing users
      to organize, search, and manage their generated subtitles. Users can view their subtitle history,
      update existing subtitles, and delete unwanted subtitle files.

   The user interface is designed with usability and accessibility in mind, ensuring that users with
varying technical backgrounds can easily navigate and utilize the subtitle management system. The
interface incorporates responsive design principles, enabling access from different devices, including
desktops, laptops, tablets, and smartphones.


4. Conclusion and future work
4.1. Summary of findings
This research presented an AI-powered subtitle management system that automates the generation
and management of subtitles for video content. The proposed system leveraged state-of-the-art speech
recognition and machine translation technologies to generate accurate and synchronized subtitles in
multiple languages.
   The system architecture was designed to be modular, scalable, and adaptable to advancements in
AI technologies. It consisted of key components such as the speech recognition module, machine
translation module, subtitle segmentation and formatting module, and user interface.
   The speech recognition module utilized advanced acoustic and language models, along with tech-
niques like speaker diarization and audio preprocessing, to accurately convert spoken audio into textual
transcripts. The machine translation module employed an encoder-decoder architecture with attention
mechanisms to translate the transcripts into desired target languages while preserving context and
meaning.
   The subtitle segmentation and formatting module ensured that the translated text was properly
segmented, synchronized, and formatted according to subtitle standards and guidelines. The user
interface provided a user-friendly and intuitive platform for users to upload videos, select target
languages, preview and edit subtitles, and manage their subtitle files.

4.2. Implications for subtitle generation pipelines
The proposed AI-powered subtitle management system has significant implications for the efficiency
and scalability of subtitle generation pipelines. By automating the process of speech recognition,
translation, and subtitle formatting, the system can greatly reduce the time and effort required for
manual subtitle creation.
   The modular architecture of the system allows for easy integration with existing video platforms
and workflows. It enables content creators, educational institutions, and entertainment providers to
generate high-quality subtitles for their video content quickly and cost-effectively.


                                                    424
Andrii M. Striuk et al. CEUR Workshop Proceedings                                                                   415–427


   The system’s ability to generate subtitles in multiple languages opens up new opportunities for content
localization and global accessibility. It facilitates the dissemination of educational and entertainment
content to a wider audience, breaking down language barriers and promoting inclusivity.

4.3. Limitations and directions for further research
While the proposed subtitle management system demonstrates promising results, there are certain
limitations and areas for further research:

     • Language coverage: the current system focuses on a limited set of languages for subtitle generation.
       Expanding the language coverage to include more diverse and low-resource languages would
       enhance the system’s applicability and reach.
     • Domain adaptation: the performance of the speech recognition and machine translation modules
       can be further improved by fine-tuning them on domain-specific datasets. Investigating techniques
       for domain adaptation, such as transfer learning and unsupervised adaptation, would enhance the
       system’s effectiveness in various domains like education, entertainment, and specialized fields.
     • Contextual understanding: although the system incorporates techniques to handle named entities
       and preserve context during translation, there is room for improvement in capturing and convey-
       ing subtle nuances, idiomatic expressions, and cultural references. Exploring advanced natural
       language processing techniques, such as contextual embeddings and knowledge graphs, could
       enhance the system’s ability to generate more contextually accurate and culturally appropriate
       subtitles.
     • Incorporating user feedback and interaction mechanisms into the system could greatly improve its
       usability and adaptability. Allowing users to provide feedback on generated subtitles, suggest cor-
       rections, and contribute to the system’s learning process would lead to continuous improvement
       in subtitle quality and user satisfaction.
     • Exploring the integration of visual and acoustic cues from the video content, such as scene changes,
       speaker identification, and emotion recognition, could further enhance the accuracy and synchro-
       nization of the generated subtitles.

   Future research directions could focus on addressing these limitations and expanding the capabilities
of the AI-powered subtitle management system. Collaborations between researchers, language experts,
and industry stakeholders would be crucial in driving innovation and advancing the state of the art in
automated subtitle generation.
Declaration on Generative AI: During the preparation of this work, the authors used Claude 3 Opus in order to: Drafting
content, Generate literature review. After using this service, the authors reviewed and edited the content as needed and takes
full responsibility for the publication’s content.


References
 [1] S. Malakul, I. Park, The effects of using an auto-subtitle system in educational videos to facilitate
     learning for secondary school students: learning comprehension, cognitive load, and satisfaction,
     Smart Learning Environments 10 (2023) 4. doi:10.1186/s40561-023-00224-2.
 [2] A. Mathur, T. Saxena, R. Krishnamurthi, Generating subtitles automatically using audio extraction
     and speech recognition, in: Proceedings - 2015 IEEE International Conference on Computational
     Intelligence and Communication Technology, CICT 2015, Institute of Electrical and Electronics
     Engineers Inc., 2015, pp. 621–626. doi:10.1109/CICT.2015.46.
 [3] V. B. Aswin, M. Javed, P. Parihar, K. Aswanth, C. R. Druval, A. Dagar, C. V. Aravinda, NLP-Driven
     Ensemble-Based Automatic Subtitle Generation and Semantic Video Summarization Technique, in:
     N. N. Chiplunkar, T. Fukao (Eds.), Advances in Artificial Intelligence and Data Engineering, volume
     1133 of Advances in Intelligent Systems and Computing, Springer Nature Singapore, Singapore,
     2021, pp. 3–13. doi:10.1007/978-981-15-3514-7_1.


                                                             425
Andrii M. Striuk et al. CEUR Workshop Proceedings                                               415–427


 [4] J. Du, J. Lu, A Comparative Study on the Translation Quality between Human and Machine-
     Generated Subtitles, in: 2024 6th International Conference on Natural Language Processing,
     ICNLP 2024, Institute of Electrical and Electronics Engineers Inc., 2024, pp. 62–66. doi:10.1109/
     ICNLP60986.2024.10692675.
 [5] N. Radha, R. Pradeep, Automated subtitle generation, International Journal of Applied Engineering
     Research 10 (2015) 24741–24746.
 [6] V. Mukovoz, T. Vakaliuk, S. Semerikov, Road Sign Recognition Using Convolutional Neural
     Networks, in: E. Faure, Y. Tryus, T. Vartiainen, O. Danchenko, M. Bondarenko, C. Bazilo, G. Zaspa
     (Eds.), Information Technology for Education, Science, and Technics, volume 222 of Lecture Notes
     on Data Engineering and Communications Technologies, Springer Nature Switzerland, Cham, 2024,
     pp. 172–188. doi:10.1007/978-3-031-71804-5_12.
 [7] A. Ramani, A. Rao, V. Vidya, V. R. B. Prasad, Automatic Subtitle Generation for Videos, in: 2020
     6th International Conference on Advanced Computing and Communication Systems, ICACCS
     2020, Institute of Electrical and Electronics Engineers Inc., 2020, pp. 132–135. doi:10.1109/
     ICACCS48705.2020.9074180.
 [8] S. Kiran, U. Patil, P. S. Shankar, P. Ghuli, Subtitle Generation and Video Scene Indexing using
     Recurrent Neural Networks, in: Proceedings of the 3rd International Conference on Inventive
     Research in Computing Applications, ICIRCA 2021, Institute of Electrical and Electronics Engineers
     Inc., 2021, pp. 847–854. doi:10.1109/ICIRCA51532.2021.9544837.
 [9] X. Che, S. Luo, H. Yang, C. Meinel, Automatic Lecture Subtitle Generation and How It Helps, in:
     R. Huang, R. Vasiu, Kinshuk, D. G. Sampson, N.-S. Chen, M. Chang (Eds.), Proceedings - IEEE 17th
     International Conference on Advanced Learning Technologies, ICALT 2017, Institute of Electrical
     and Electronics Engineers Inc., 2017, pp. 34–38. doi:10.1109/ICALT.2017.11.
[10] R. Sridhar, S. Aravind, H. Muneerulhudhakalvathi, M. Sibi Senthur, A hybrid approach for Discourse
     Segment Detection in the automatic subtitle generation of computer science lecture videos, in:
     D. E. Comer, P. Mueller, B. Mallick, S. Mukherjea, S. M. Thampi, D. Krishnaswamy, A. Sikora (Eds.),
     Proceedings of the 2014 International Conference on Advances in Computing, Communications
     and Informatics, ICACCI 2014, Institute of Electrical and Electronics Engineers Inc., 2014, pp.
     284–287. doi:10.1109/ICACCI.2014.6968422.
[11] A. Karakanta, F. Buet, M. Cettolo, F. Yvon, Evaluating Subtitle Segmentation for End-to-end
     Generation Systems, in: N. Calzolari, F. Bechet, P. Blache, K. Choukri, C. Cieri, T. Declerck,
     S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), 2022 Language
     Resources and Evaluation Conference, LREC 2022, European Language Resources Association
     (ELRA), 2022, pp. 3069–3078. URL: https://aclanthology.org/2022.lrec-1.328/.
[12] J. R. Calvo-Ferrer, Can you tell the difference? A study of human vs machine-translated subtitles,
     Perspectives: Studies in Translation Theory and Practice 32 (2024) 1115–1132. doi:10.1080/
     0907676X.2023.2268149.
[13] D. D. Shanmugam, S. F. Syed, S. Dinesh, S. Chitrakala, VAR: An Efficient Silent Video to Speech
     System with Subtitle Generation using Visual Audio Recall, in: Proceedings of the 5th International
     Conference on Inventive Research in Computing Applications, ICIRCA 2023, Institute of Electrical
     and Electronics Engineers Inc., 2023, pp. 814–821. doi:10.1109/ICIRCA57980.2023.10220944.
[14] M. S. Martín, J. Heras, G. Mata, Automatic Generation of Subtitles for Videos of the Government
     of La Rioja, in: B. Dorronsoro, F. Chicano, G. Danoy, E.-G. Talbi (Eds.), Optimization and Learning,
     volume 1824 of Communications in Computer and Information Science, Springer Nature Switzerland,
     Cham, 2023, pp. 393–402. doi:10.1007/978-3-031-34020-8_30.
[15] J. D. Valor Miró, J. A. Silvestre-Cerdà, J. Civera, C. Turró, A. Juan, Efficient Generation of
     High-Quality Multilingual Subtitles for Video Lecture Repositories, in: G. Conole, T. Klobučar,
     C. Rensing, J. Konert, E. Lavoué (Eds.), Design for Teaching and Learning in a Networked
     World, volume 9307, Springer International Publishing, Cham, 2015, pp. 485–490. doi:10.1007/
     978-3-319-24258-3_44.
[16] M. Hotta, C. S. Leow, N. Kitaoka, H. Nishizaki, Evaluation of Speech Translation Subtitles Generated
     by ASR with Unnecessary Word Detection, in: GCCE 2024 - 2024 IEEE 13th Global Conference on


                                                    426
Andrii M. Striuk et al. CEUR Workshop Proceedings                                              415–427


     Consumer Electronics, Institute of Electrical and Electronics Engineers Inc., 2024, pp. 815–819.
     doi:10.1109/GCCE62371.2024.10760522.
[17] A. Kaulage, A. Walunj, A. Bhandari, A. Dighe, A. Sagri, Edu-lingo: A Unified NLP Video System
     with Comprehensive Multilingual Subtitles, in: 2nd IEEE International Conference on Data Science
     and Information System, ICDSIS 2024, Institute of Electrical and Electronics Engineers Inc., 2024.
     doi:10.1109/ICDSIS61070.2024.10594128.
[18] I. Al Sawi, R. Allam, Exploring challenges in audiovisual translation: A comparative analysis
     of human- and AI-generated Arabic subtitles in Birdman, PLoS ONE 19 (2024). doi:10.1371/
     journal.pone.0311020.
[19] S. Kuroiwa, C. Oshima, T. Koita, Exploring a Hybrid System Combining AI and Human Intervention
     for Subtitle Creation in Entertainment Content, in: N. C. Callaos, E. Gaile-Sarkane, N. Lace,
     B. Sanchez, M. Savoie (Eds.), Proceedings of World Multi-Conference on Systemics, Cybernetics
     and Informatics, WMSCI, volume 2024-September, International Institute of Informatics and
     Cybernetics, 2024, pp. 72–73. doi:10.54808/WMSCI2024.01.72.
[20] X. Qiu, Study on Automatic Generation of Teaching Video Subtitles Based on Cloud Com-
     puting, Smart Innovation, Systems and Technologies 156 (2020) 309–314. doi:10.1007/
     978-981-13-9714-1_34.


                                                    427