=Paper= {{Paper |id=Vol-3777/short4 |storemode=property |title=Artificial Intelligence-Driven Text-to-Tactile Graphics Generation for Visual Impaired People |pdfUrl=https://ceur-ws.org/Vol-3777/short4.pdf |volume=Vol-3777 |authors=Yehor Dzhurynskyi,Volodymyr Mayik,Lyudmyla Mayik |dblpUrl=https://dblp.org/rec/conf/profitai/DzhurynskyiMM24 }} ==Artificial Intelligence-Driven Text-to-Tactile Graphics Generation for Visual Impaired People== https://ceur-ws.org/Vol-3777/short4.pdf
                                Artificial Intelligence-Driven Text-to-Tactile Graphics
                                Generation for Visual Impaired People
                                Yehor Dzhurynskyi1, Volodymyr Mayik2 and Lyudmyla Mayik2
                                1
                                    Ukrainian Academy of Printing, 19, Pid Holoskom Str., Lviv, 79020, Ukraine
                                2
                                    Lviv Polytechnic National University, 28a, Stepan Bandera Str., Lviv, 79013, Ukraine

                                                                    Abstract
                                                                    This research presents the development of a text-conditional tactile graphics generation model using the
                                                                    Bidirectional and Auto-Regressive Transformer (BART) and Vector Quantized Variational Auto-Encoder
                                                                    (VQ-VAE). The model leverages a modified organization of the latent space, divided into two independent
                                                                    components: textual and graphic. The study addresses the challenge of the limited availability of tactile
                                                                    graphics samples by expanding the training dataset with custom samples, enhancing the model's capability
                                                                    to convert textual information into graphical representations. The proposed method improves the creation
                                                                    of tactile graphics for visually impaired individuals, offering increased variability, controllability, and
                                                                    quality in synthesized tactile graphics. This advancement enhances both the technical and economic aspects
                                                                    of the production process for inclusive educational materials.

                                                                    Keywords 1
                                                                    Artificial intelligence, tactile graphics, visual impairment, natural language processing, model, machine
                                                                    learning


                                1. Introduction
                                The dynamics of modern inclusive society development emphasize the need to integrate people with
                                visual impairments into active social life. The problem of socializing individuals with visual
                                impairments involves various aspects that complicate their education, training, and full participation
                                in society [1]. Specifically, people with visual impairments have limited access to information, as
                                many materials are produced only in the usual printed or digital formats. This issue is further
                                exacerbated by the increasing prevalence of information in graphic form, designed for more effective
                                perception by readers. The aforementioned problems hinder the ability of individuals with visual
                                impairments to receive quality education and professional development [2, 3, 4, 5, 7].
                                   An analysis [6] of the activities of publishing and printing industry enterprises that produce
                                educational and methodological literature (textbooks, manuals, etc.) for people with visual
                                impairments revealed problems related to the creation or adaptation of images and illustrative
                                materials, which are particularly crucial for this type of publication. When creating or adapting
                                graphic materials, enterprises encounter the following issues: an insufficient number of trained
                                specialists with specific competencies related to the technical implementation of tactile graphics;
                                additional time and financial costs for training specialists; and the high labor intensity and cost of
                                the process of creating or adapting tactile graphics. Consequently, the production issues surrounding
                                tactile graphics remain one of the primary factors contributing to the low level of access to graphical
                                information for people with visual impairments.




                                ProfIT AI 2024: 4th International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2024), September 25–27,
                                2024, Cambridge, MA, USA
                                   y.a.dzhurynskyi@gmail.com (Y. Dzhurynskyi); vol.mayik.2015@gmail.com (V. Mayik); ludmyla.maik@gmail.com (L.
                                Mayik)
                                   0000-0002-6650-2703 (V. Mayik); 0000-0001-8552-0942 (L. Mayik)
                                                               Β© 2024 Copyright for this paper by its authors.
                                                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Wor
                                    Pr
                                       ks
                                        hop
                                     oceedi
                                          ngs
                                                ht
                                                I
                                                 tp:
                                                   //
                                                    ceur
                                                       -
                                                SSN1613-
                                                        ws
                                                         .or
                                                       0073
                                                           g
                                                               CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work
Scientists are working to solve the problem of producing tactile graphics by developing models for
the automatic generation of tactile images [8, 9, 10, 11, 12, 13]. The task of most existing models is to
transform the content of a photo image into a tactile one.
    Models [8, 9, 10, 11] that attempt to directly convert the content of an image into a tactile format
usually utilize computer vision and have the following disadvantages: they violate the requirements
for tactile graphics [14, 15]; they display redundant elements of the image that are difficult to read
and interfere with the overall interpretation of the graphic material.
    In models [12, 13] whose principle of operation involves the detection and subsequent recognition
of individual image elements, replacing these elements with their tactile representations from a
limited sample, there is no variability in the synthesized image samples (it is impossible to synthesize
new samples, and the attractiveness of the synthesized image for people with visual impairments
decreases). Despite the mentioned drawback, it should be noted that the method effectively conveys
the content of the original photo at a high level in compliance with the requirements for tactile
graphics.
    Additionally, such methods require supplementary source graphic information (e.g.,
photographs), the search for or creation of which slows down the process of preparing material for
the production of tactile images.
    The development of information technologies, particularly in the field of deep machine learning,
has opened new opportunities for addressing the aforementioned problems. Recently, significant
advancements have been demonstrated by information technologies based on artificial intelligence
[16, 17, 18], which enable the generation of images based on user text prompts. However, according
to the analysis [19, 20], confirmed by a series of experiments, the information technologies built upon
these mathematical models have proven ineffective for creating tactile graphics. Despite this, the
concept of text-guided image generation was chosen as the foundation for this work.

3. Text-conditional tactile graphics generation model
The text-conditional tactile graphics generation model is built upon the Bidirectional and Auto-
Regressive Transformer (BART) [21] and Vector Quantized Variational Auto-Encoder (VQ-VAE)
[22]. The subject of its modeling is the process of converting text information into graphic
information. To do this, the embedded space of the transformer, which was formed during language
modeling on pretraining task, was divided into two independent embedded spaces: text and graphics,
instead of a shared one. At the same time, the parameters of the graphic embedded space were
adjusted so that the dimension of the embedded space was equal to the size of the "codebook" [22],
and the dimensionality of the vectors of the graphic embedded space was equal to the dimensionality
of the latent space vectors of the variational image synthesis model. The parameters of the text
embedded space remained the same as during language modeling.
    Before obtaining text tokens using the BPE [22, 23] tokenization model, the original text
components are normalized by bringing them to a uniform format (uppercase letters were converted
to lowercase letters).
    Formally, the process of converting text tokens into graphic tokens using a text-conditional tactile
graphics generation model is described in successive stages.
    The first step is to generate a bounded sequence of text tokens based on a text prompt: 𝑑̅ =
         $%&
{𝑑! ∈ 𝑉}!"#!"#,% , 𝑑̅ where is a sequence of text tokens of dimension π‘†π‘’π‘ž'(),+ = 64; 𝑉 is a dictionary
of tokens. If the size of the generated sequence of text tokens exceeds the value, its size is reduced to
the maximum value, discarding the excess tokens. If the size of the generated sequence of text tokens
is smaller than the value, its size is increased to the maximum value by adding utility tokens βŒ©π‘ƒπ΄π·βŒͺ
that do not affect the simulation result.
   In the next step, the text tokens that form the sequence 𝑑̅ are mapped to the text embedded space
vectors 𝑒, , forming a subset of it:
                                                        $%&
                          𝑒2+ = {𝑒,+ ∈ 𝐸 + 4π‘˜ = 𝑑! ∈ 𝑑̅}!"#!"#,% ; 𝑒2+ βŠ† 𝐸 + ,                         (1)

   where 𝑑̅ is the sequence of text tokens; 𝐸 + is a text embedded space; 𝑒,+ are elements of the text
embedded space. Elements 𝑒2+ reflect the semantic meaning of text tokens in the embedded space.
   Next, the vectors of the text embedded space 𝐸 + are transformed by the transformer's bidirectional
encoder, which is formed from several layers, forming hidden states ;;;β„Ž+ . The bidirectionality of the
encoder means that it analyzes the full context of an individual vector of the embedded space,
considering both the previous and the following elements of the sequence:
                                   ;;;+ = πΈπ‘›π‘π‘œπ‘‘π‘’@𝑒2+ A; ;;;
                                   β„Ž                    β„Ž+ βŠ† 𝐸 + ,                                     (2)

   where β„Ž;;;+ is the hidden state of the encoder; πΈπ‘›π‘π‘œπ‘‘π‘’(βˆ™) is the transformer’s encoding operation
defined within [21].
   The hidden state of the encoder ;;; β„Ž+ is then converted by linear layers and a nonlinear activation
function to the hidden state of the decoder (i.e., graphic information), forming a subset of the graphic
embedded space 𝐸 - :
                             ;;
                             β„Ž;;                           ;;;+ A,
                               - = πΏπ‘–π‘›π‘’π‘Žπ‘Ÿ ∘ π‘…π‘’πΏπ‘ˆ ∘ πΏπ‘–π‘›π‘’π‘Žπ‘Ÿ @β„Ž
                                         .               #                                             (3)

   where ;;;                                            ;;;;
           β„Ž+ βŠ† 𝐸 + is the hidden state of the encoder; β„Ž  - βŠ† 𝐸 - is the hidden state of the decoder;

πΏπ‘–π‘›π‘’π‘Žπ‘Ÿ! is a linear layer; π‘…π‘’πΏπ‘ˆ ≝ max (0, π‘₯) is a non-linear layer, activation function.
   At the next stage, an autoregressive [25, 26] transformer decoder is used. This means that the
decoder generates one graphics token per iteration, considering the context of the previously
generated graphics tokens. Thus, during the decoding process, the model performs calculations based
on the hidden state ;;;
                      β„Ž+ and pre-generated elements of the vector sequence of the graphic embedded
       -
space 𝑒, , or ;;
              β„Ž;;:
                -

                            -          ;;;+ , 𝑒 - , 𝑒 - , … , 𝑒 - A; 𝑖 ≀ 𝑑0 ,
                           𝑒! = π·π‘’π‘π‘œπ‘‘π‘’@β„Ž                                                               (4)
                                               #     .         !/#
            -                                                                                      -
     where 𝑒! is the i-th element of the vector sequence of the graphic embedded space 𝐸 - ; 𝑒1 ; 𝑗 < 𝑖
are previously generated vectors of the graphic embedded space; ;;      β„Ž;;
                                                                          - is the hidden state of the

decoder; 𝑑0 is the size of the final sequence 𝑒;;; βŠ† 𝐸 ; π·π‘’π‘π‘œπ‘‘π‘’(βˆ™) is the transformer’s decoding
                                                   -    -

operation defined within [21].
     Decoding occurs in an iterative manner until the sequence ;;;
                                                                 𝑒 - size is equal to 𝑑0 (i.e., the size of
the latent space vector of the VQ-VAE model).
Once the decoding is complete, the resulting sequence of vectors of the graphics embedded space
𝑒;;;
   - is converted by a linear layer and π‘†π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯ function into a sequence of probability distributions

from which the element with the highest probability is selected, determining the selected graphics
token:
                              2                                                  -
                     𝑔̅ = {𝑔! }!"#
                                &
                                   ; 𝑔! = π‘Žπ‘Ÿπ‘” π‘šπ‘Žπ‘₯ (π‘†π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯ ∘ πΏπ‘–π‘›π‘’π‘Žπ‘Ÿ(𝑒! )),                              (5)
                                                                              -
    where 𝑔̅ is the generated sequence of graphic tokens of size 𝑑0 ; 𝑒! ∈ ;;;    𝑒 - is an element of the
vector sequence of the graphic embedded space 𝐸 - .
    In the next step, on the basis of graphic tokens (5), a sequence of latent quantized vectors is formed
𝑧& , which is defined by the formula (6). Each graphic token: 𝑔! 1 ≀ 𝑔! ≀ 𝐾; 𝑖 = 1. . 𝑑0 , is the
positional number of the quantized vector in the "codebook" of the VQ-VAE model:
                                                            2
                              𝑧& = {𝑒, ∈ 𝑍|π‘˜ = 𝑔! ∈ 𝑔̅ }!"#
                                                         &
                                                            ; 𝑧& βŠ† 𝑍,                                   (6)
   where 𝑍 is the set of latent quantized vectors, or "codebook"; 𝑧& βŠ† 𝑍 is a sequence of latent
quantized vectors; 𝑔! ∈ 𝑔̅ is a graphic token; 𝑑0 is the size of the sequence of latent quantized
vectors.
   The final step is the synthesis of tactile graphics using a sequence-based variational image
synthesis model decoder (6):
                                     π‘Œ = πΌπ‘šπ·π‘’π‘π‘œπ‘‘π‘’@𝑧& A,                                           (7)
   where 𝑧& is the sequence of latent quantized vectors; πΌπ‘šπ·π‘’π‘π‘œπ‘‘π‘’(βˆ™) is an image decoding
operation based on latent representation defined within [22]; π‘Œ is a generated tactile image.
The diagram of the text-conditional tactile graphics generation model is shown in Figure 1.




Figure 1: Structural and functional diagram of the text-conditional tactile graphics generation
model

4. Experiment
In this experiment, the proposed model was trained using the parameters presented in Tables 1 and
2 for the BART and VQ-VAE models, respectively. It is important to note that the size of the decoder’s
dictionary and the length of the sequence are each increased by one unit compared to the original
values. This adjustment is necessary to introduce an additional image service token (i.e., SOS token),
which is added at the beginning of the sequence to facilitate autoregressive image generation.

Table 1
BART’s parameters
 Parameter                                            Encoder’s value       Decoder’s value
 Dictionary size                                      8192                  513
 Sequence size                                        64                    65
 Number of layers                                     3                     3
 Layer dimension                                      512                   512
 FFN dimension                                        1024                  1024
 Number of attention heads                            8                     8
    The language modeling has been done using the BrUK corpus [27] consisting of Ukrainian texts
from different sources. Unlike textual datasets (i.e., corpora), which are widely accessible, tactile
graphics samples are much less common. A significant obstacle in modeling tactile graphics
generation using machine learning is the insufficient number of publicly available samples, as the
tactile graphics production industry is less prevalent compared to the traditional one.
    Nevertheless, a collection of plant and animal images stored in the APH Tactile Graphics Library
[28] was chosen as the original set of images for the model to learn to reproduce. Additionally, the
training dataset was expanded with 41 custom tactile image samples, increasing the total number of
samples to 179. The custom samples are formed from simple images of animals and were used at one
of the enterprises of Ukraine, which provides preschool education for children with visual
impairments.

Table 2
VQ-VAE’s parameters
 Parameter                                         Value
 Image dimension                                   256 Γ— 256 Γ— 1
 β€œCodebook” size                                   512
 Latent vectors size                               16
 Number of hidden layers                           5
 Hidden layers dimension                           16

   The results of the experiment include samples of generated tactile graphics images based on
various types of text prompts, such as monosyllabic prompts, prompts with numerals, and prompts
with epithets. These samples are presented in Figure 2.




      β€œa daisy”                  β€œa cow”           β€œTop view of butterfly”            β€œa tree”




   β€œthree daisies”          β€œa spotted cow”        β€œa butterfly (side view)”      β€œa naked tree”




       β€œa leaf”                  β€œa dog”                  β€œa turkey”           β€œa deer (side view)”

Figure 2: Samples of generated images determined by a text prompt given below the corresponding
image
   The model's performance was evaluated separately for each component: BART and VQ-VAE. The
results of this evaluation are presented in Table 3. The Cross-Entropy metric reflects how well the
model converts text prompts into appropriate graphic tokens, and Perplexity represents the
uncertainty in the model's predictions. Lower values indicate better performance, meaning the model
is more confident in its generation process. For tactile graphics, FID measures how similar the
generated tactile images are to real ones in the latent space of the model. A lower FID score indicates
that the generated tactile graphics are closer to real tactile images in terms of visual and tactile
features.
   Additionally, the overall performance of the model was evaluated using the CLIP Score metric
[29], which reflects the model's capability in converting textual information into graphical
information. The average CLIP Score of the developed model is 23,7.

Table 3
Evaluation results
 Component                            Metric                                   Value
 BART                                 Cross-Entropy                            5,709
                                      Perplexity                               301,662
 VQ-VAE                               MSE (image space)                        0,0144
                                      MSE (latent space)                       0,0058
                                      FID                                      0,242

5. Limitations
The current dataset used for training includes relatively simple images (e.g., animals, plants, basic
objects). One limitation of the model is its potential difficulty in scaling to more complex images,
such as those with intricate details (e.g., architectural blueprints, detailed scientific diagrams). The
model’s ability to capture fine details may be limited by the size of the latent space and the number
of hidden layers used in the VQ-VAE model. Complex tactile graphics might require a more fine-
grained representation, which could lead to inefficiencies or inaccuracies in generation if the model
architecture remains unchanged.
   Besides, while the model performs well on simpler prompts (e.g., "a cow," "a tree"), more complex
and nuanced prompts (e.g., "a group of children playing soccer with a spotted ball") might pose
challenges. This is because the Transformer’s encoding of textual information becomes more
demanding as the semantic richness and length of the prompt increase. The model may struggle to
disentangle and appropriately represent all components of a complex scene in tactile graphics form,
leading to loss of information or oversimplification.
   Regarding computational requirements, the training process of the proposed model, which
integrates both the BART Transformer and the VQ-VAE, requires significant computational
resources. Due to the autoregressive nature of the model and the need to process both textual and
graphical latent spaces, training is computationally expensive. It requires powerful GPUs or TPUs,
large memory capacity, and extended training time, particularly as the dataset grows. This makes
scaling to larger datasets or higher-dimensional image outputs challenging without access to
advanced computing infrastructure.
   One of the key ethical concerns in the development of tactile graphics is ensuring that the
generated images do not misrepresent the information. For visually impaired users, the tactile
graphic is a primary means of understanding visual content, and any distortion or inaccuracy could
lead to misunderstandings. For example, if a generated tactile graphic oversimplifies or omits
important details, users might receive an incomplete or misleading representation of the intended
information. To mitigate this risk, it's important to validate the model outputs rigorously against
established standards for tactile graphics and seek feedback from visually impaired users to ensure
that the tactile representations are both accurate and understandable.
6. Conclusion
As a result of this research, a text-conditional tactile graphics generation model was developed using
BART and VQ-VAE. The model employs a modified organization of the latent space, divided into
two independent components: textual and graphic.
   The method of creating tactile graphics for publications aimed at individuals with visual
impairments has been improved. This enhancement increases the variability, controllability, and
quality of synthesized tactile graphics, thereby improving the technical and economic aspects of the
production process.
   This technology can bridge the gap in access to educational materials, allowing visually impaired
individuals to better engage with subjects that rely heavily on visual content, such as science,
mathematics, and geography. The availability of automated tactile graphics can facilitate greater
independence in learning and enhance participation in inclusive classrooms and professional
environments.
   An important direction of further research is to increase the size and diversity of the training
sample to improve the general ability of the model to generalize and ensure its stable operation in
various scenarios.

References
[1] GBD 2019 Blindness and Vision Impairment Collaborators; Vision Loss Expert Group of the
     Global Burden of Disease Study. Trends in prevalence of blindness and distance and near vision
     impairment over 30 years: an analysis for the Global Burden of Disease Study, Lancet Glob
     Health (2021) e130-e143. doi: 10.1016/S2214-109X(20)30425-3.
[2] P. Ackland, Serge Resnikoff, R. Bourne, World blindness and visual impairment: Despite many
     successes, the problem is growing, Community Eye Health Journal (2018) 71–73. PMID:
     29483748.
 [3] K. Zebehazy and A. Wilton, "Graphic Reading Performance of Students with Visual Impairments
     and Its Implication for Instruction and Assessment," Journal of Visual Impairment & Blindness,
     vol. 115, pp. 215-227, 2021.
[4] M. Mukhiddinov and K. Soon-Young, "A Systematic Literature Review on the Automatic
     Creation of Tactile Graphics for the Blind and Visually Impaired," 2021.
[5] F. Bara, "The Effect of Tactile Illustrations on Comprehension of Storybooks by Three Children
     with Visual Impairments: An Exploratory Study," Journal of Visual Impairment & Blindness,
     vol. 112, pp. 759-765, 2018.
[6] V. Mayik, T. Dudok, L. Mayik, N. Lotoshynska, I. Izonin, J. Kusmierczyk, An Approach Towards
     Vacuum Forming Process Using PostScript for Making Braille, in: Advances in Computer
     Science for Engineering and Manufacturing, Springer International Publishing, 2022, pp. 38–48.
     doi: 10.1007/978-3-031-03877-8_4.
[7] Y. Dzhurynskyi and V. Mayik, "Analysis of the process of preparing illustrations for inclusive
     literature," Qualilogy of the book, vol. 41, pp. 7-15, 2022.
[8] T. Way and K. Barner, "Towards Automatic Generation of Tactile Graphics," Rehabilitation
     Engineering and Assistive Technology Society of North America, pp. 161-163, 1996.
[9] T. Way and K. Barner, "Automatic visual to tactile translation - Part I: Human factors, access
     methods, and image manipulation," IEEE Transactions on Rehabilitation Engineering, pp. 81-94,
     1997.
[10] T. Way and K. Barner, "Automatic visual to tactile translation. II. Evaluation of the TACTile
     image creation system," IEEE Transactions on Rehabilitation Engineering, pp. 95-105, 1997.
[11] T. Ferro and D. Pawluk, "Automatic image conversion to tactile graphic," in Proceedings of the
     15th International ACM SIGACCESS Conference on Computers and Accessibility, Bellevue
     Washington, 2013.
[12] K. PakΔ—naitΔ—, P. Nedelev, E. Kamperou, M. Proulx and P. Hall, "Communicating Photograph
     Content Through Tactile Images to People With Visual Impairments," Frontiers in Computer
     Science, vol. 3, 2022.
[13] K. Pakenaite, E. Kamperou, M. J. Proulx, A. Sharma and P. Hall, "Pic2Tac: Creating Accessible
     Tactile Images using Semantic Information from Photographs," in Proceedings of the Eighteenth
     International Conference on Tangible, Embedded, and Embodied Interaction, Cork, 2024.
[14] Polish association of the blind, "Instructions for creating and adapting illustrations and
     typhlographic materials for blind students," 2016.
[15] Braille Authority of North America & Canadian Braille Authority, "Guidelines and Standards
     for Tactile Graphics," 2022. [Online]. Available: https://www.brailleauthority.org/guidelines-
     and-standards-tactile-graphics. [Accessed 20 April 2024].
[16] J. Oppenlaender, "The Creativity of Text-to-Image Generation," in Academic Mindtrek '22:
     Proceedings of the 25th International Academic Mindtrek Conference, New York, 2022.
[17] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu and M. Chen, "Hierarchical Text-Conditional Image
     Generation with CLIP Latents," ArXiv, vol. abs/2204.06125, 2022.
[18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser and B. Ommer, "High-Resolution Image Synthesis
     with Latent Diffusion Models," in 2022 IEEE/CVF Conference on Computer Vision and Pattern
     Recognition (CVPR), New Orleans, Louisiana, 2022.
[19] Y. Dzhurynskyi and V. Mayik, "Preparation of illustrations for inclusive literature using artificial
     intelligence models of image synthesis from text," Proceedings, vol. 66, no. 1, pp. 155-163, 2023.
[20] Y. Dzhurynskyi, "Generation of illustrations for inclusive literature using Midjourney artificial
     intelligence model," in Β«Scientific method: reality and future trends of researchingΒ»: collection
     of scientific papers Β«SCIENTIAΒ» with Proceedings of the II International Scientific and
     Theoretical Conference, Zagreb, 2023.
[21] Y. Yingchen, Z. Fangneng, W. Rongliang, P. Jianxiong, C. Kaiwen , L. Shijian, M. Feiying, X.
     Xuansong and M. Chunyan, "Diverse Image Inpainting with Bidirectional and Autoregressive
     Transformers," arXiv, vol. abs/2104.12335, 2021.
[22] A. van den Oord, O. Vinyals and K. Kavukcuoglu, "Neural Discrete Representation Learning,"
     CoRR, vol. abs/1711.00937, 2017.
[23] Kulchytska, Kh., Semeniv, M., Kovalskyi, B., Pysanchyn, N., Selmenska, Z.: Influence of
     Hadamard matrices canonicity on image processing. In: Hu, Z., Petoukhov, S., Yanovsky, F., He,
     M. (eds.) ISEM ’21, LNCS, vol. 463, pp. 329–338. Springer, Cham (2022) doi:10.1007/978-3-031-
     03877-8_29
[24] V. Zouhar, C. Meister, J. Luis Gastaldi, L. Du, T. Vieira, M. Sachan and R. Cotterell, "A Formal
     Perspective on Byte-Pair Encoding," in Findings of the Association for Computational
     Linguistics: ACL 2023, Toronto, 2023.
[25] M. Dalal, A. C. Li and R. Taori, "Autoregressive Models: What Are They Good For?," CoRR, vol.
     abs/1910.07737, 2019.
[26] A. Graves, "Generating Sequences With Recurrent Neural Networks," CoRR, vol. abs/1308.0850,
     2013.
[27] A. Rysin, "LanguageTool API NLP UK," 2022. [Online]. Available: https://github.com/brown-
     uk/nlp_uk. [Accessed 21 April 2024].
[28] American Printing House, "Tactile Graphic Image Library," [Online]. Available:
     https://imagelibrary.aph.org/portals/aphb/#page/welcome. [Accessed 21 April 2024].
[29] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras and Y. Choi, "CLIPScore: A Reference-free
     Evaluation Metric for Image Captioning," CoRR, vol. abs/2104.08718, 2021.