Artificial Intelligence-Driven Text-to-Tactile Graphics Generation for Visual Impaired People

Artificial Intelligence-Driven Text-to-Tactile Graphics Generation for Visual Impaired People YehorDzhurynskyi y.a.dzhurynskyi@gmail.com Ukrainian Academy of Printing

19, Pid Holoskom Str 79020 Lviv Ukraine

VolodymyrMayik Lviv Polytechnic National University

28a, Stepan Bandera Str 79013 Lviv Ukraine

LyudmylaMayik ludmyla.maik@gmail.com Lviv Polytechnic National University

28a, Stepan Bandera Str 79013 Lviv Ukraine

2024 Cambridge MA USA

Artificial Intelligence-Driven Text-to-Tactile Graphics Generation for Visual Impaired People 1613-0073 9B8B0E4139998848A27B3B10D0FEF927 GROBID - A machine learning software for extracting information from scholarly documents Artificial intelligence tactile graphics visual impairment natural language processing model machine learning

This research presents the development of a text-conditional tactile graphics generation model using the Bidirectional and Auto-Regressive Transformer (BART) and Vector Quantized Variational Auto-Encoder (VQ-VAE). The model leverages a modified organization of the latent space, divided into two independent components: textual and graphic. The study addresses the challenge of the limited availability of tactile graphics samples by expanding the training dataset with custom samples, enhancing the model's capability to convert textual information into graphical representations. The proposed method improves the creation of tactile graphics for visually impaired individuals, offering increased variability, controllability, and quality in synthesized tactile graphics. This advancement enhances both the technical and economic aspects of the production process for inclusive educational materials.

Introduction

The dynamics of modern inclusive society development emphasize the need to integrate people with visual impairments into active social life. The problem of socializing individuals with visual impairments involves various aspects that complicate their education, training, and full participation in society [1]. Specifically, people with visual impairments have limited access to information, as many materials are produced only in the usual printed or digital formats. This issue is further exacerbated by the increasing prevalence of information in graphic form, designed for more effective perception by readers. The aforementioned problems hinder the ability of individuals with visual impairments to receive quality education and professional development [2,3,4,5,7].

An analysis [6] of the activities of publishing and printing industry enterprises that produce educational and methodological literature (textbooks, manuals, etc.) for people with visual impairments revealed problems related to the creation or adaptation of images and illustrative materials, which are particularly crucial for this type of publication. When creating or adapting graphic materials, enterprises encounter the following issues: an insufficient number of trained specialists with specific competencies related to the technical implementation of tactile graphics; additional time and financial costs for training specialists; and the high labor intensity and cost of the process of creating or adapting tactile graphics. Consequently, the production issues surrounding tactile graphics remain one of the primary factors contributing to the low level of access to graphical information for people with visual impairments.

Related Work

Scientists are working to solve the problem of producing tactile graphics by developing models for the automatic generation of tactile images [8,9,10,11,12,13]. The task of most existing models is to transform the content of a photo image into a tactile one.

Models [8,9,10,11] that attempt to directly convert the content of an image into a tactile format usually utilize computer vision and have the following disadvantages: they violate the requirements for tactile graphics [14,15]; they display redundant elements of the image that are difficult to read and interfere with the overall interpretation of the graphic material.

In models [12,13] whose principle of operation involves the detection and subsequent recognition of individual image elements, replacing these elements with their tactile representations from a limited sample, there is no variability in the synthesized image samples (it is impossible to synthesize new samples, and the attractiveness of the synthesized image for people with visual impairments decreases). Despite the mentioned drawback, it should be noted that the method effectively conveys the content of the original photo at a high level in compliance with the requirements for tactile graphics.

Additionally, such methods require supplementary source graphic information (e.g., photographs), the search for or creation of which slows down the process of preparing material for the production of tactile images.

The development of information technologies, particularly in the field of deep machine learning, has opened new opportunities for addressing the aforementioned problems. Recently, significant advancements have been demonstrated by information technologies based on artificial intelligence [16,17,18], which enable the generation of images based on user text prompts. However, according to the analysis [19,20], confirmed by a series of experiments, the information technologies built upon these mathematical models have proven ineffective for creating tactile graphics. Despite this, the concept of text-guided image generation was chosen as the foundation for this work.

Text-conditional tactile graphics generation model

The text-conditional tactile graphics generation model is built upon the Bidirectional and Auto-Regressive Transformer (BART) [21] and Vector Quantized Variational Auto-Encoder (VQ-VAE) [22]. The subject of its modeling is the process of converting text information into graphic information. To do this, the embedded space of the transformer, which was formed during language modeling on pretraining task, was divided into two independent embedded spaces: text and graphics, instead of a shared one. At the same time, the parameters of the graphic embedded space were adjusted so that the dimension of the embedded space was equal to the size of the "codebook" [22], and the dimensionality of the vectors of the graphic embedded space was equal to the dimensionality of the latent space vectors of the variational image synthesis model. The parameters of the text embedded space remained the same as during language modeling.

Before obtaining text tokens using the BPE [22,23] tokenization model, the original text components are normalized by bringing them to a uniform format (uppercase letters were converted to lowercase letters).

Formally, the process of converting text tokens into graphic tokens using a text-conditional tactile graphics generation model is described in successive stages.

The first step is to generate a bounded sequence of text tokens based on a text prompt:

𝑡 ̅ = {𝑡 ! ∈ 𝑉} !"# $%& !"#,% , 𝑡 ̅

where is a sequence of text tokens of dimension 𝑆𝑒𝑞 '(),+ = 64; 𝑉 is a dictionary of tokens. If the size of the generated sequence of text tokens exceeds the value, its size is reduced to the maximum value, discarding the excess tokens. If the size of the generated sequence of text tokens is smaller than the value, its size is increased to the maximum value by adding utility tokens 〈𝑃𝐴𝐷〉 that do not affect the simulation result.

In the next step, the text tokens that form the sequence 𝑡 ̅ are mapped to the text embedded space vectors 𝑒 , , forming a subset of it:

𝑒 + 2 = {𝑒 , + ∈ 𝐸 + 4𝑘 = 𝑡 ! ∈ 𝑡 ̅ } !"# $%& !"#,% ; 𝑒 + 2 ⊆ 𝐸 + ,(1)

where 𝑡 ̅ is the sequence of text tokens; 𝐸 + is a text embedded space; 𝑒 , + are elements of the text embedded space. Elements 𝑒 + 2 reflect the semantic meaning of text tokens in the embedded space.

Next, the vectors of the text embedded space 𝐸 + are transformed by the transformer's bidirectional encoder, which is formed from several layers, forming hidden states ℎ + ;;; . The bidirectionality of the encoder means that it analyzes the full context of an individual vector of the embedded space, considering both the previous and the following elements of the sequence:

ℎ + ;;; = 𝐸𝑛𝑐𝑜𝑑𝑒@𝑒 + 2 A; ℎ + ;;; ⊆ 𝐸 + ,(2)

where ℎ + ;;; is the hidden state of the encoder; 𝐸𝑛𝑐𝑜𝑑𝑒(•) is the transformer's encoding operation defined within [21].

The hidden state of the encoder ℎ + ;;; is then converted by linear layers and a nonlinear activation function to the hidden state of the decoder (i.e., graphic information), forming a subset of the graphic embedded space 𝐸 -:

ℎ - ;;;; = 𝐿𝑖𝑛𝑒𝑎𝑟 . ∘ 𝑅𝑒𝐿𝑈 ∘ 𝐿𝑖𝑛𝑒𝑎𝑟 # @ℎ + ;;; A,(3)

where ℎ + ;;; ⊆ 𝐸 + is the hidden state of the encoder; ℎ - ;;;; ⊆ 𝐸 -is the hidden state of the decoder;

𝐿𝑖𝑛𝑒𝑎𝑟 ! is a linear layer; 𝑅𝑒𝐿𝑈 ≝ max (0, 𝑥) is a non-linear layer, activation function. At the next stage, an autoregressive [25,26] transformer decoder is used. This means that the decoder generates one graphics token per iteration, considering the context of the previously generated graphics tokens. Thus, during the decoding process, the model performs calculations based on the hidden state ℎ + ;;; and pre-generated elements of the vector sequence of the graphic embedded space 𝑒 , -, or ℎ - ;;;; :

𝑒 ! -= 𝐷𝑒𝑐𝑜𝑑𝑒@ℎ + ;;; , 𝑒 # -, 𝑒 . -, … , 𝑒 !/# -A; 𝑖 ≤ 𝑑 0 ,(4)

where 𝑒 ! -is the i-th element of the vector sequence of the graphic embedded space 𝐸 -; 𝑒 1 -; 𝑗 < 𝑖 are previously generated vectors of the graphic embedded space; ℎ - ;;;; is the hidden state of the decoder; 𝑑 0 is the size of the final sequence 𝑒 - ;;; ⊆ 𝐸 -; 𝐷𝑒𝑐𝑜𝑑𝑒(•) is the transformer's decoding operation defined within [21].

Decoding occurs in an iterative manner until the sequence 𝑒 - ;;; size is equal to 𝑑 0 (i.e., the size of the latent space vector of the VQ-VAE model). Once the decoding is complete, the resulting sequence of vectors of the graphics embedded space 𝑒 - ;;; is converted by a linear layer and 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 function into a sequence of probability distributions from which the element with the highest probability is selected, determining the selected graphics token:

𝑔̅ = {𝑔 ! } !"# 2 & ; 𝑔 ! = 𝑎𝑟𝑔 𝑚𝑎𝑥 (𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ∘ 𝐿𝑖𝑛𝑒𝑎𝑟(𝑒 ! -)), (5)

where 𝑔̅ is the generated sequence of graphic tokens of size 𝑑 0 ; 𝑒 ! -∈ 𝑒 - ;;; is an element of the vector sequence of the graphic embedded space 𝐸 -. In the next step, on the basis of graphic tokens (5), a sequence of latent quantized vectors is formed 𝑧 & , which is defined by the formula (6). Each graphic token: 𝑔 ! 1 ≤ 𝑔 ! ≤ 𝐾; 𝑖 = 1. . 𝑑 0 , is the positional number of the quantized vector in the "codebook" of the VQ-VAE model: where 𝑍 is the set of latent quantized vectors, or "codebook"; 𝑧 & ⊆ 𝑍 is a sequence of latent quantized vectors; 𝑔 ! ∈ 𝑔̅ is a graphic token; 𝑑 0 is the size of the sequence of latent quantized vectors.

The final step is the synthesis of tactile graphics using a sequence-based variational image synthesis model decoder (6):

𝑌 = 𝐼𝑚𝐷𝑒𝑐𝑜𝑑𝑒@𝑧 & A,(7)

where 𝑧 & is the sequence of latent quantized vectors; 𝐼𝑚𝐷𝑒𝑐𝑜𝑑𝑒(•) is an image decoding operation based on latent representation defined within [22]; 𝑌 is a generated tactile image. The diagram of the text-conditional tactile graphics generation model is shown in Figure 1.

Experiment

In this experiment, the proposed model was trained using the parameters presented in Tables 1 and 2 for the BART and VQ-VAE models, respectively. It is important to note that the size of the decoder's dictionary and the length of the sequence are each increased by one unit compared to the original values. This adjustment is necessary to introduce an additional image service token (i.e., SOS token), which is added at the beginning of the sequence to facilitate autoregressive image generation. The language modeling has been done using the BrUK corpus [27] consisting of Ukrainian texts from different sources. Unlike textual datasets (i.e., corpora), which are widely accessible, tactile graphics samples are much less common. A significant obstacle in modeling tactile graphics generation using machine learning is the insufficient number of publicly available samples, as the tactile graphics production industry is less prevalent compared to the traditional one.

Nevertheless, a collection of plant and animal images stored in the APH Tactile Graphics Library [28] was chosen as the original set of images for the model to learn to reproduce. Additionally, the training dataset was expanded with 41 custom tactile image samples, increasing the total number of samples to 179. The custom samples are formed from simple images of animals and were used at one of the enterprises of Ukraine, which provides preschool education for children with visual impairments. The results of the experiment include samples of generated tactile graphics images based on various types of text prompts, such as monosyllabic prompts, prompts with numerals, and prompts with epithets. These samples are presented in Figure 2.

"a daisy"

"a cow" "Top view of butterfly" "a tree" "three daisies" "a spotted cow" "a butterfly (side view)" "a naked tree" "a leaf" "a dog" "a turkey" "a deer (side view)" The model's performance was evaluated separately for each component: BART and VQ-VAE. The results of this evaluation are presented in Table 3. The Cross-Entropy metric reflects how well the model converts text prompts into appropriate graphic tokens, and Perplexity represents the uncertainty in the model's predictions. Lower values indicate better performance, meaning the model is more confident in its generation process. For tactile graphics, FID measures how similar the generated tactile images are to real ones in the latent space of the model. A lower FID score indicates that the generated tactile graphics are closer to real tactile images in terms of visual and tactile features.

Additionally, the overall performance of the model was evaluated using the CLIP Score metric [29], which reflects the model's capability in converting textual information into graphical information. The average CLIP Score of the developed model is 23,7.

Limitations

The current dataset used for training includes relatively simple images (e.g., animals, plants, basic objects). One limitation of the model is its potential difficulty in scaling to more complex images, such as those with intricate details (e.g., architectural blueprints, detailed scientific diagrams). The model's ability to capture fine details may be limited by the size of the latent space and the number of hidden layers used in the VQ-VAE model. Complex tactile graphics might require a more finegrained representation, which could lead to inefficiencies or inaccuracies in generation if the model architecture remains unchanged.

Besides, while the model performs well on simpler prompts (e.g., "a cow," "a tree"), more complex and nuanced prompts (e.g., "a group of children playing soccer with a spotted ball") might pose challenges. This is because the Transformer's encoding of textual information becomes more demanding as the semantic richness and length of the prompt increase. The model may struggle to disentangle and appropriately represent all components of a complex scene in tactile graphics form, leading to loss of information or oversimplification.

Regarding computational requirements, the training process of the proposed model, which integrates both the BART Transformer and the VQ-VAE, requires significant computational resources. Due to the autoregressive nature of the model and the need to process both textual and graphical latent spaces, training is computationally expensive. It requires powerful GPUs or TPUs, large memory capacity, and extended training time, particularly as the dataset grows. This makes scaling to larger datasets or higher-dimensional image outputs challenging without access to advanced computing infrastructure.

One of the key ethical concerns in the development of tactile graphics is ensuring that the generated images do not misrepresent the information. For visually impaired users, the tactile graphic is a primary means of understanding visual content, and any distortion or inaccuracy could lead to misunderstandings. For example, if a generated tactile graphic oversimplifies or omits important details, users might receive an incomplete or misleading representation of the intended information. To mitigate this risk, it's important to validate the model outputs rigorously against established standards for tactile graphics and seek feedback from visually impaired users to ensure that the tactile representations are both accurate and understandable.

Conclusion

As a result of this research, a text-conditional tactile graphics generation model was developed using BART and VQ-VAE. The model employs a modified organization of the latent space, divided into two independent components: textual and graphic.

The method of creating tactile graphics for publications aimed at individuals with visual impairments has been improved. This enhancement increases the variability, controllability, and quality of synthesized tactile graphics, thereby improving the technical and economic aspects of the production process.

This technology can bridge the gap in access to educational materials, allowing visually impaired individuals to better engage with subjects that rely heavily on visual content, such as science, mathematics, and geography. The availability of automated tactile graphics can facilitate greater independence in learning and enhance participation in inclusive classrooms and professional environments.

An important direction of further research is to increase the size and diversity of the training sample to improve the general ability of the model to generalize and ensure its stable operation in various scenarios.

Figure 1 :1Figure 1: Structural and functional diagram of the text-conditional tactile graphics generation model

Figure 2 :2Figure 2: Samples of generated images determined by a text prompt given below the corresponding image

Table 11BART's parametersParameterEncoder's valueDecoder's valueDictionary size8192513Sequence size6465Number of layers33Layer dimension512512FFN dimension10241024Number of attention heads8

Table 2 VQ-VAE's parameters2ParameterValueImage dimension256 × 256 × 1"Codebook" size512Latent vectors size16Number of hidden layers5Hidden layers dimension16

Table 33Evaluation resultsComponentMetricValueBARTCross-Entropy5,709Perplexity301,662VQ-VAEMSE (image space)0,0144MSE (latent space)0,0058FID0,242

𝑧 & = {𝑒 , ∈ 𝑍|𝑘 = 𝑔 ! ∈ 𝑔̅ } !"# 2 & ; 𝑧 & ⊆ 𝑍,(6)

Blindness and Vision Impairment Collaborators; Vision Loss Expert Group of the Global Burden of Disease Study. Trends in prevalence of blindness and distance and near vision impairment over 30 years: an analysis for the Global Burden of Disease Study Gbd 10.1016/S2214-109X(20)30425-3 Lancet Glob Health 2019. 2021 World blindness and visual impairment: Despite many successes, the problem is growing PAckland SergeResnikoff RBourne 29483748 Community Eye Health Journal 2018 Graphic Reading Performance of Students with Visual Impairments and Its Implication for Instruction and Assessment KZebehazy AWilton Journal of Visual Impairment & Blindness 115 2021 A Systematic Literature Review on the Automatic Creation of Tactile Graphics for the Blind and Visually Impaired MMukhiddinov KSoon-Young 2021 The Effect of Tactile Illustrations on Comprehension of Storybooks by Three Children with Visual Impairments: An Exploratory Study FBara Journal of Visual Impairment & Blindness 112 2018 An Approach Towards Vacuum Forming Process Using PostScript for Making Braille VMayik TDudok LMayik NLotoshynska IIzonin JKusmierczyk 10.1007/978-3-031-03877-8_4 Advances in Computer Science for Engineering and Manufacturing Springer International Publishing 2022 Analysis of the process of preparing illustrations for inclusive literature YDzhurynskyi VMayik Qualilogy of the book 41 2022 Towards Automatic Generation of Tactile Graphics TWay KBarner Rehabilitation Engineering and Assistive Technology Society of North America 1996 Automatic visual to tactile translation -Part I: Human factors, access methods, and image manipulation TWay KBarner IEEE Transactions on Rehabilitation Engineering 1997 Automatic visual to tactile translation. II. Evaluation of the TACTile image creation system TWay KBarner IEEE Transactions on Rehabilitation Engineering 1997 Automatic image conversion to tactile graphic TFerro DPawluk Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility the 15th International ACM SIGACCESS Conference on Computers and Accessibility

Bellevue Washington

2013 Communicating Photograph Content Through Tactile Images to People With Visual Impairments KPakėnaitė PNedelev EKamperou MProulx PHall Frontiers in Computer Science 3 2022 Pic2Tac: Creating Accessible Tactile Images using Semantic Information from Photographs KPakenaite EKamperou MJProulx ASharma PHall Proceedings of the Eighteenth International Conference on Tangible, Embedded, and Embodied Interaction the Eighteenth International Conference on Tangible, Embedded, and Embodied Interaction

Cork

2024 Instructions for creating and adapting illustrations and typhlographic materials for blind students 2016 Polish association of the blind Guidelines and Standards for Tactile Graphics 2022. 20 April 2024 Braille Authority of North America & Canadian Braille Authority The Creativity of Text-to-Image Generation JOppenlaender Proceedings of the 25th International Academic Mindtrek Conference the 25th International Academic Mindtrek Conference

New York

2022 Academic Mindtrek '22 Hierarchical Text-Conditional Image Generation with CLIP Latents ARamesh PDhariwal ANichol CChu MChen ArXiv, vol. abs/2204.06125 2022 High-Resolution Image Synthesis with Latent Diffusion Models RRombach ABlattmann DLorenz PEsser BOmmer 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

New Orleans, Louisiana

2022 Preparation of illustrations for inclusive literature using artificial intelligence models of image synthesis from text YDzhurynskyi VMayik Proceedings 66 1 2023 Generation of illustrations for inclusive literature using Midjourney artificial intelligence model YDzhurynskyi SCIENTIA» with Proceedings of the II International Scientific and Theoretical Conference

Zagreb

2023 collection of scientific papers Diverse Image Inpainting with Bidirectional and Autoregressive Transformers YYingchen ZFangneng WRongliang PJianxiong CKaiwen LShijian MFeiying XXuansong MChunyan arXiv, vol. abs/2104.12335 2021 Neural Discrete Representation Learning AVan Den Oord OVinyals KKavukcuoglu . abs/1711.00937 CoRR 2017 Influence of Hadamard matrices canonicity on image processing KhKulchytska MSemeniv BKovalskyi NPysanchyn ZSelmenska 10.1007/978-3-031-03877-8_29 ISEM '21 ZHu SPetoukhov FYanovsky MHe

Cham

Springer 2022 463 A Formal Perspective on Byte-Pair Encoding VZouhar CMeister JLuis LGastaldi TDu MVieira RSachan Cotterell Findings of the Association for Computational Linguistics: ACL 2023

Toronto

2023 Autoregressive Models: What Are They Good For? MDalal ACLi RTaori . abs/1910.07737 CoRR 2019 Generating Sequences With Recurrent Neural Networks AGraves . abs/1308.0850 CoRR 2013 LanguageTool API NLP UK ARysin 2022. 21 April 2024 Tactile Graphic Image Library American Printing House 21 April 2024 CLIPScore: A Reference-free Evaluation Metric for Image Captioning JHessel AHoltzman MForbes RLeBras YChoi . abs/2104.08718 CoRR 2021