=Paper=
{{Paper
|id=Vol-3736/paper7
|storemode=property
|title=Enhancing adapted print publication accessibility via text-to-image synthesis
|pdfUrl=https://ceur-ws.org/Vol-3736/short7.pdf
|volume=Vol-3736
|authors=Rostyslav Zatserkovnyi,Petro Kutsyk,Roksoliana Zatserkovna,Volodymyr Maik,Peter T. Popov
|dblpUrl=https://dblp.org/rec/conf/icyberphys/ZatserkovnyiKZM24
}}
==Enhancing adapted print publication accessibility via text-to-image synthesis==
<pdf width="1500px">https://ceur-ws.org/Vol-3736/short7.pdf</pdf>
<pre>
                                Enhancing adapted print publication accessibility via
                                text-to-image synthesis
                                Rostyslav Zatserkovnyi1,∗,†, Petro Kutsyk1,†, Roksoliana Zatserkovna2,†, Volodymyr
                                Maik2,† and Peter T. Popov3,†
                                1 Lviv University of Trade and Economics, 10 Tuhan-Baranovskyi Str., 79008 Lviv, Ukraine
                                2 Ukrainian Academy of Printing, 19 Pid Goloskom Str., 79020 Lviv, Ukraine
                                3 City University of London, Northampton Square, London, EC1V 0HB, United Kingdom


                                                Abstract
                                                One of the most pressing concerns in the field of adapted printed publications – that is, publications
                                                with additional supporting features to make them easily accessible by a wide variety of audiences –
                                                is preparing illustrations that can clearly convey visual information to the viewer. These illustrations
                                                need to be created while accounting for the needs of a diverse inclusive audience, whose requirements
                                                may be affected by disabilities such as visual impairment. Currently, there is a limited number of
                                                illustrators who can appropriately produce a large number of illustrations which satisfy these
                                                requirements; therefore, illustrating print publications is a time-consuming and expensive process
                                                for non-profit organizations which are responsible for their production.
                                                This article proposes a method for enhancing illustrations within print publications, given the source
                                                file (such as a PDF file) of a print publication. Based on modern text-to-image generators, this method
                                                extracts all illustrations from a print publication; converts them into textual prompts for a modern
                                                text-to-image generator; and finally, produces a series of adapted alternatives for each of the chosen
                                                illustrations based on the textual prompts. This allows publishers to obtain accessible illustrations for
                                                their publication in a manner of minutes, speeding up the adaptation process and enhancing its
                                                accessibility.

                                                Keywords
                                                Accessible publishing, artificial intelligence, image synthesis, information technologies. 1


                                1. Introduction
                                In recent years, assistive technologies have enabled diverse audiences of readers to freely access
                                written information, allowing them to work, study, and participate in civil society. One of the
                                most essential tools for this category of readers is accessible print publications. While people
                                with disabilities, such as visual impairment, often rely on e-readers; these may not always be
                                available or preferred. A recent study suggests that 65% of surveyed Americans have recently


                                ICyberPhyS-2024: 1st International Workshop on Intelligent & CyberPhysical Systems, June 28, 2024, Khmelnytskyi,
                                Ukraine
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   zatserkovnyi.rostyslav@gmail.com (R. Zatserkovnyi); kutsykpetro@gmail.com (P. Kutsyk);
                                zatserkovna.r@gmail.com (R. Zatserkovna); vol_maik@meta.ua (V. Maik); p.t.popov@city.ac.uk (P. Popov)
                                   0000-0001-6991-2866 (R. Zatserkovnyi); 0000-0001-5795-9704 (P. Kutsyk); 0000-0003-1011-053X (R.
                                Zatserkovna); 0000-0002-6650-2703 (V. Maik); 0000-0002-3434-5272 (P. Kutsyk)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
read a print book, while only 30% have read an e-book, indicating that print books still remain
the most popular format for general readers [1]. In a classroom setting in particular, print books
provide unique benefits such as ease of use and improved notetaking [2]. Thus, by providing
access to appropriately adapted books, educational institutions as well as independent
organizations can make sure that visually impaired people can succeed.
   Incorrectly adapted printed books can pose various issues for readers with visual
impairment, such as small text size, poor font styling, insufficient contrast, graphics and tables
as well as other design factors. Unlike e-books, where these parameters can be custom-tailored
to the needs of a particular reader, printed publications are inflexible in their design,
exacerbating these issues [3]. This article focuses on improving the accessibility of illustrations
in particular, proposing a multi-step method to enhance the accessibility of all illustrations
within a given publication:

    1.   Extract all illustrations from the source file of a future print publication.
    2.   Convert all illustrations into prompts for a text-to-image synthesizer.
    3.   Synthesize new, adapted illustrations based on modified versions of these prompts.
    4.   Review adapted illustrations and re-introduce them into the source file.

2. Related works
Existing studies in the field most often focus on e-book accessibility. Since e-book formatting is
not rigid and preset by the publisher, but rather determined by the e-reader, the focus on e-
book accessibility lies primarily in correctly marking important parts of the publication (such
as chapters or specific phrases which link to footnotes), as well as developing robust, flexible e-
reader software that allows users to adjust fonts, text sizes, as well as other parameters [4].
    However, print books still have notable advantages, especially in the field of education.
Research suggests that readers tend to understand text slightly better when it’s printed rather
than viewed on-screen [5], and one study suggests the haptic feedback of a touch screen (or PC
monitor) is different than that of a paper book, providing a less immersive experience [6].
    When it comes to adapting illustrations within print books, existing research focuses on
multimodal illustrations – for instance, tactile illustrations, which combine visual illustrations
with tactile Braille overlays, which is useful for readers with legal blindness [7, 8]. Still, there
remains the issue of creating effective adapted illustrations before their tactile component is
factored in. In recent years, AI algorithms, such as machine learning algorithms and neural
networks, have been trained to produce a variety of media content – and creating such
illustrations from scratch is one potential application of these generative methods [9].

3. Proposed methodology
3.1. Converting images to text prompts
In order to synthesize new adapted illustrations for a print publication, our method must first
obtain the inputs – often known as “prompts” – for a well-known image generator such as
Stable Diffusion or Midjourney. Although in technical publications, illustrations often come
with captions describing the image, these captions are not always present, and often offer brief
interpretations which do not capture the full nuance of the captioned image. Thus, the images
must be converted to textual prompts using an AI model known as an image captioner. Acting
as the inverse of common image generators, image captioners are a form of feature extraction
models which convert images into text, functioning at the crossroads between computer vision
and natural language processing [10].
     Our chosen image captioner model is the CLIP Interrogator [11], based on Salesforce’s BLIP
model. This baseline is a multi-task model which is capable of both image understanding and
image generation, and can operate in three possible modes: an unimodal encoder, an image-
grounded text encoder, and an image-grounded text decoder.
     Specifically, the model’s captioner is an image-grounded text decoder. Its intent is to generate
synthetic captions 𝑇𝑇𝑠𝑠 given training images 𝐼𝐼𝑤𝑤 collected from web datasets. In the BLIP learning
framework, this is combined with the filter, an image-grounded text encoder which removes
texts that are predicted to not match a given image – this is applied to both synthetic captions
𝑇𝑇𝑠𝑠 and real captions 𝑇𝑇𝑤𝑤 found inside training datasets. This is combined with a set {(𝐼𝐼ℎ , 𝑇𝑇ℎ )} of
human-annotated images and texts to produce a robust training dataset for a ML algorithm [12]:

                          𝐷𝐷 = {(𝐼𝐼𝑤𝑤 , 𝑇𝑇𝑤𝑤 )} + {(𝐼𝐼𝑤𝑤 , 𝑇𝑇𝑠𝑠 )} + {(𝐼𝐼ℎ , 𝑇𝑇ℎ )}       (1)
   Our method works primarily with this model’s captioner. We opt to use the BLIP-Image-
Captioning-Large sub-model, pre-trained on the COCO dataset to produce human-readable
captions for input images. However, human-readable captions are not the most effective way
to produce the inputs for an image generator. An image generator’s input text, known as a
“prompt”, needs to be detailed and describe multiple keywords pertaining to an image: the
subject of the image itself (which is typically produced by an image captioner out of the box),
as well as the style, resolution, color, lightning and other details.
   To that end, we use the CLIP Interrogator to first generate a baseline caption using BLIP,
and then simplify the caption while adding additional keywords which most closely match the
target image. These come from a predefined dataset known as “flavors”, and include keywords
such as “highly detailed”, “sharp focus”, “intricate”, “digital painting” as well as phrases
referring to specific objects and entities located within an image. The Interrogator selects the
most appropriate keywords and phrases from “flavors” dataset by measuring the distance
between a target image and each separate phrase.

3.2. Synthesizing images from text prompts
After the images within an adapted print publication are converted to text prompts, the next
step is to transfer them to an AI image generator such as Stable Diffusion to obtain a new set of
images designed with accessibility in mind. Since the CLIP Interrogator’s prompt generator has
been designed with Stable Diffusion in mind, its newest stable version, SDXL v 1.0, has been
selected as the image generator of choice [13]. Its open-source nature means that it can be
deployed on any local machine as part of our overall method.
   This image generator is a latent diffusion model, an improvement on traditional diffusion
models. Traditional diffusion models work by first “corrupting” training data, such as images,
by adding noise to their inputs in a step-by-step-process. At each time step, Gaussian noise is
added to a data distribution 𝑥𝑥0 ~ 𝑞𝑞(𝑥𝑥0 ) with variance 𝛽𝛽𝑡𝑡 ∈ (0, 1), resulting in the iterative
process over the distribution of the variable:
                                                       𝑇𝑇

                          𝑞𝑞(𝑥𝑥1 , … , 𝑥𝑥𝑇𝑇 | 𝑥𝑥𝑜𝑜 ) = � 𝑞𝑞(𝑥𝑥𝑡𝑡 | 𝑥𝑥𝑡𝑡−1 )                 (2)
                                                     𝑡𝑡=1
                         𝑞𝑞(𝑥𝑥𝑡𝑡 | 𝑥𝑥𝑡𝑡−1 ) = 𝑁𝑁(𝑥𝑥𝑡𝑡 ; �1 − 𝛽𝛽𝑡𝑡 𝑥𝑥𝑡𝑡−1 , 𝛽𝛽𝑡𝑡 𝐼𝐼)          (3)
   This process is called forward diffusion, and concludes once the distribution 𝑞𝑞 is sufficiently
similar to pure Gaussian noise.
   Reverse diffusion is the process of recovering the original image from the resulting noise. The
overall workflow of diffusion models after they have been trained is to generate new images,
given random noise as input. Latent diffusion models perform this process within latent space
– a mathematical representation of data where similar items are grouped.
   Aside from a shortened version of the CLIP Interrogator’s keyword-based output, we append
several keywords designed at simplifying and matching them to a more clear and simplified
style, such as “illustration for children”, “monochrome”, “very low detail” and “no shading”.
This ensures that, while the objects and entities denoted by the keywords are included within
the final image, it remains simplified without obstructing valuable information by noisy
elements. Keywords can be modified or adjusted as needed – for instance, “monochrome” may
be removed should we require a full-color illustration.
   The complete workflow of our adaptation method can be seen in Figure 1.


Figure 1: The illustration adaptation method’s overall workflow.

4. Results and Discussion
In Figure 1, we can see the results of this image-to-text generation model applied to one of the
illustrations from a selected Ukrainian adapted textbook. Initially, the caption generated by the
model is a verbose human-readable representation; the CLIP Interrogator modifies this to an
image generator based on matching keywords from a preset list, which is less human-readable,
but more effective as an input for a future image generation step.
   The number of keywords within the text prompt can be modified at will by restricting the
number of phrases returned within the Interrogator, prioritizing those which most closely
match the image. While a single image-to-text conversion is shown on Figure 1, our method is
a batch process. This means that after all images have been extracted from a printed
publication’s source file, they are converted into textual prompts with no additional human
interference, significantly speeding up the captioning process as compared to a human
captioner.


Figure 2: An example of image-to-text generation with both human-readable captions
(intended for “visual question answering” tasks), as well as prompts for image generators.

   An example of image generations, which are the final output of our method, can be seen in
Figure 3. Within this example, the leftmost image, also shown separately as Figure 4, is
particularly appropriate for accessible publications, as it creates an image in a simplified cartoon
style with minimal shading that still keeps its core elements (the girls in the foreground and
trees in the background) legible. This can be translated directly into a mixed-format illustration
which combines visual elements with Braille-like tactile dots.
Figure 3: An example of text-to-image generation, and the model’s final result. Prompt: “There
are two girls that are running in the park together, girl running, girl is running, little kid,
illustration for children, monochrome, very low detail, no shading, simple line art”.


Figure 4: The generated image chosen as the final adapted illustration.

    The software implementation of our method uses the Python-based Jupyter Notebook
environment to integrate several steps of the process: the PyMuPDF library is used to extract
images from a printed publication’s source file; the CLIP Interrogator (internally based on the
PyTorch library & its torchvision extension) is responsible for extracting prompts for images;
while the SDXL 1.0 generator is used to create new illustrated images. On a RTX 4090 GPU, our
pipeline takes ~22 seconds to convert an original illustrated image into four generated variants
ready to be reviewed by a publication’s editor, meaning that the entirety of a print publication’s
illustrations can be regenerated within hours.

5. Conclusions
This article describes a method for adapting illustrations within the source file print publication,
with no or minimal human supervision, that can significantly speed up the process of making
a publication more accessible to a wide audience of readers, such as people with vision
impairment. This method enables both educational and volunteer organizations to produce
high-quality illustrations for an adapted print publication.
    Potential areas for further research include improvements to the prompts used within our
image generation step – for instance, adding support for multi-colored, yet clean and simplified
illustrations. The image generation model itself also has potential for improvement; as AI image
generation is a rapidly developing field of research, our machine learning pipeline can be
periodically revisited to make use of the newest models and techniques.

References
[1] M. Faverio, A. Perrin, Three-in-ten Americans now read e-books, Pew Research Center,
     2022. URL: https://www.pewresearch.org/short-reads/2022/01/06/three-in-ten-americans-
     now-read-e-books/.
[2] A. Amirtharaj, D. Raghavan, J. Arulappan, Preferences for printed books versus E−books
     among university students in a Middle Eastern country, Heliyon (2023) e16776. doi:
     10.1016/j.heliyon.2023.e16776.
[3] R. Zatserkovnyi et al., Application for Determining the Usability of Adapted Textbooks by
     People with Low Vision, in: Proceedings of the 2023 IEEE 18th International Conference
     on Computer Science and Information Technologies (CSIT), Lviv, Ukraine, 19–21 October
     2023. doi: 10.1109/csit61576.2023.10324055.
[4] L. Salmerón et al., Reading comprehension on handheld devices versus on paper: A
     narrative review and meta-analysis of the medium effect and its moderators, Journal of
     Educational Psychology (2023).
[5] Enhancing Reach: The Fundamentals of eBook Accessibility, Ingram Content Group, 2024.
     URL: https://www.ingramcontent.com/publishers-blog/fundamentals-of-ebook-
     accessibility .
[6] A. Mangen, A. Weel, The evolution of reading in the age of digitisation: an integrative
     framework for reading research, Literacy (2016) Vol. 50, no. 3, pp. 116–124.
     doi: 10.1111/lit.12086.
[7] D. Valente et al., Comprehension of a multimodal book by children with visual
     impairments, British Journal of Visual Impairment (2023) 42(2), 026461962311720. doi:
     10.1177/02646196231172071.
[8] K. Zebehazy, A. Wilton, Graphic Reading Performance of Students with Visual
     Impairments and Its Implication for Instruction and Assessment, Journal of Visual
     Impairment           &      Blindness      (2021)   Vol. 115,      no. 3,      pp. 215–227.
     doi: 10.1177/0145482x211016918.
[9] A. Kuzmin, O. Pavlova. Analysis of Artificial Intelligence Based Systems for Automated
     generation of Digital Content, Computer Systems and Information Technologies (2024) no.
     1, pp. 82-88. doi: 10.31891/csit-2024-1-10
[10] H. Udo, T. Koshinaka, Image Captioners Sometimes Tell More Than Images They See,
     arXiv.org, 2023. URL: https://arxiv.org/abs/2305.02932.
[11] pharmapsychotic/clip-interrogator: Image to prompt with BLIP and CLIP, GitHub, 2024.
     URL: https://github.com/pharmapsychotic/clip-interrogator.
[12] J. Li et al., BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language
     Understanding and Generation, arXiv.org, 2022. URL: https://arxiv.org/abs/2201.12086.
[13] Stable Diffusion XL - SDXL 1.0 Model, 2024. URL: https://stablediffusionxl.com/.

</pre>