1. Introduction

SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation

Sumin Yu

Taesup Moon

0 1 0 Department of Electrical and Computer Engineering, Seoul National University , Seoul , South Korea 1 IPAI / ASRI / INMC, Seoul National University , Seoul , South Korea

2025

While difusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity-adjusting guidance strength based on the prompt-and selectivity-targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation. WARNING: This paper contains AI-generated images that may be ofensive. Sensitive contents are masked.

eol>Risk Management for Trustworthy AI Safe Generative AI Text-to-Image Difusion Model Safe Image Generation

1. Introduction

The rapid advancements in text-to-image (T2I) difusion models [ 1, 2 ] have enabled the generation of high-quality images based on textual inputs. However, the extensive training data often contain unsafe content and inherent biases [ 1, 3 ], posing significant risks of generating unexpected unsafe images [ 4 ]. There are also concerns about malicious users exploiting model vulnerabilities to create harmful images by generating attacking prompts [ 5, 6, 7 ].

To mitigate these risks, existing defenses fall into two main categories. Detection-based methods [ 8, 1, 7 ] attempt to identify harmful images, but often sufer from false positives that block benign content [ 9 ]. Removal-based methods intervene before or during generation by adjusting the difusion process at inference time [ 10 ], editing model weights [ 11, 12 ], or optimizing prompts [ 13 ]. Most existing methods struggle to handle multiple harmful concepts simultaneously. Weight-editing and promptbased approaches require retraining for new unsafe concepts. In contrast, inference-time methods enable safe generation through lightweight manipulations. One notable approach in this category is Safe Latent Difusion (SLD) [ 10 ], which utilizes classifier-free guidance to adjust noise estimates away from unsafe concept directions, even when multiple concepts are present. Despite its efectiveness, we observe that SLD sometimes fails to remove harmfulness from images, even under maximum guidance (SLD-max). Moreover, its guidance is applied inconsistently across prompts, i.e., some prompts become suficiently safe while others remain unsafe with the same configuration – see Fig. 1.

In light of these limitations, we propose SP-Guard, an inference-time method emphasizing the importance of selective and prompt-adaptive safe guidance to prevent unsafe image generation. We suspect that the reason SLD fails is that it does not reflect how unsafe the generated image will be. Therefore, before presenting our method in detail, we underscore the importance of adapting safety guidance (i.e., unsafe concept removal) to each individual prompt. We demonstrate this in Section 2.2 SLD-medium through a comparative analysis of images generated in a straightforward experiment. SP-Guard, detailed in Section 2.3, is based on the intuition that the similarity between the noise predictions conditioned on the prompt and those conditioned on unsafe concepts can serve as a proxy for estimating the unsafe degree of the generated image. Specifically, SP-Guard proactively estimates the unsafe degree of a prompt and provides safe guidance during inference. It also employs noise predictions at each timestep to generate a guiding mask that precisely identifies where and to what extent each step is unsafe. Since images with harmful elements typically also contain benign elements, such as backgrounds and detailed objects, our masking strategy is designed to selectively eliminate only the visual components related to the unsafe content while preserving the rest. The efectiveness of SP-Guard is shown in Fig. 1. While SLD yields inconsistent results under the same guidance level (i.e., SLD-medium in the second row), due to the varying degrees of prompt harmfulness, SP-Guard consistently produces safe images regardless of the initial prompt harmfulness. Moreover, SP-Guard employs a precise masking strategy that selectively captures regions associated with unsafe concepts, whereas SLD often applies guidance more broadly, afecting unrelated areas and resulting in images that diverge from the original intent (third row).

In Section 3, we conduct both quantitative and qualitative evaluations of SP-Guard on four benchmark datasets. Our findings indicate that SP-Guard achieves a lower detection rate of unsafe content, efectively preserving the integrity of the original image content. Qualitative analyses further verify that SP-Guard consistently converts unsafe elements into safe content, regardless of the potential harm of the prompt. Furthermore, SP-Guard maintains image fidelity and text alignment comparable to SD, supporting its practicality. The underlying idea of prompt-adaptive and selective guidance opens up new opportunities for broader applications in safe and controllable image generation. Moreover, by concretely analyzing the limitations of previous approaches and highlighting the importance of estimating the prompt-specific potential harmfulness, SP-Guard contributes to the transparency and controllability of safe image generation. We further elaborate on these aspects in Section 4.

2. Method 2.1. Preliminaries

Difusion-based T2I and Classifier-Free Guidance. Difusion models [ 14 ] are generative models that create samples from Gaussian noise, progressively denoising based on a learned data distribution. The model iteratively predicts an estimate of the noise to be removed. For text-based image generation [ 1, 2 ], the estimated noises are conditioned on the text prompt. Classifier-free guidance approach [ 15 ] allows conditioning without an additional pre-trained classifier, training the model with or without text prompts randomly to handle both conditional and unconditional images. During inference, the model uses noise estimates ˜ at each steps formulated as follows:

˜ (z, c) := (z) + ( (z, c) − (z)), (1) (a) (b) (c) (d) (e)

SLD-weak SLD-medium SLD-strong SLD-max t=T t=0 (a) Examples generated by original SD and SLD. (b) Visualization of of SLD for applying safe guidance. Figure 2: Limitations of SLD in safety guidance. See Section 2.2 for details. where z is the latent variable at timestep , and c is the text embedding for the prompt . is the guidance scale that controls the strength of conditioning.

Semantic and Safe Guidance at Inference. Controlling T2I models to faithfully reflect user intentions in generated images remains a challenging task. One line of research addresses this challenge by classifier-free-guidance to enable semantic control at inference time [ 16, 10 ]. This approach introduces a semantic guidance term, (z, c, c) into Eq. (1), where is a concept capturing the user’s intent. This results in ˜ (z, c, c) = (z) + (︀ (z, c) − (z) + (z, c, c)︀) . (2) To reflect the concept in the image, applies positive guidance in the direction of (z, c). Conversely, to prevent the appearance of , negative guidance is applied. In the realm of safe T2I, where the objective is to exclude unsafe concepts from the generated image, is specifically defined as, (z, c, c) = − (c, c) · ( (z, c) − (z)), (3) where adjusts the guidance strength to avoid generating unsafe content. Schramowski et al. [ 10 ] propose SLD which defines as follows: (c, c) = {︃min(1, | |), if (z, c) ⊖ (z, c) < 0, otherwise , = ( (z, c) − (z, c)). (4) represents the scaled diference between the noise estimates conditioned on the prompt and those conditioned on the unsafe concept. It is used to modulate based on a predefined threshold . Intuitively, SLD increases the guidance strength when the current generation direction is close to an unsafe concept, and otherwise turns the guidance of. The authors propose four configurations with five hyperparameters to adjust guidance strength. More details of Eq. (4) will be discussed in Section 2.2.

2.2. Safety Considerations in T2I models

This section delineates critical safety considerations for ensuring safe image generation in T2I models, particularly highlighting the limitations inherent in the SLD framework evidenced by Eq. (4). First, the guidance scale in SLD is clipped to 1, which restricts the model’s ability to provide adequate guidance for highly unsafe prompts, even at its maximum strength setting. As shown in Fig. 2a (d) and (e), even with maximal guidance (i.e.,SLD-max), the generated images retain unsafe content. Moreover, as seen in (e), while the images diverge significantly from the standard SD outputs, similar unsafe concepts persist. This issue stems from the masking condition specified in Eq. (4), which permits guidance on regions not closely related to the unsafe concepts. This efect is visible in the mask visualizations shown in Fig. 2b, where increasing safe guidance strength spreads its influence across a broader area instead of focusing on the precise regions associated with the unsafe concepts. Under SLD-max settings, this dispersion can result in the generation of diferent yet equally unsafe images. In addition, increases with the diference in noise estimates between the input prompt and the unsafe concept, resulting in stronger guidance where they diverge. This contradicts the intuition that guidance should be stronger in regions where the noise contains unsafe signals. Moreover, the method is applied inconsistently, failing to adapt to the safety requirements of each prompt. As shown in Fig. 2a (a)-(d), the efectiveness of safety configurations varies significantly across prompts. Based on these observations, we argue that efective safety control requires evaluating the potential risk of unsafe content from the input prompt and applying targeted guidance to relevant regions accordingly. To our knowledge, our work provides the first in-depth analysis of the SLD framework, identifying why its performance, reported in previous studies [ 11, 17, 12 ], frequently fails to address safety concerns adequately. Notably, no previous work has investigated this limitation. In the following section, we introduce SP-Guard which guarantees safe image generation by applying precise, prompt-specific guidance at inference time.

2.3. SP-Guard: Selective Prompt-adaptive Guidance

To ensure safe image generation, it is crucial to estimate whether a prompt is likely to produce unsafe content and to what extent before the final image is generated. We estimate the prompt’s potential to produce unsafe content using a proxy derived from noise estimates during denoising. Since noise estimates conditioned on texts contain semantic information [ 15 ], they are pivotal for safety assessments. Prior works have successfully leveraged noise estimates for semantic control [ 18, 16, 10, 19 ]. Building on this, we define the noise direction Δc, for a given text prompt and timestep as follows: Δc, = (z, c) − (z, ) where is a null-text embedding. Intuitively, Eq. (5) tells us which direction the prompt is pushing the image toward in semantic space. Then, we compute the cosine similarity between the noise direction of a given text prompt and that of an unsafe concept , i.e., Sim (Δc,, Δc,) for each timestep , where Sim (· , · ) denotes the cosine similarity between two vectors. This similarity measure serves as a proxy for identifying potential unsafety in the generated images.

We propose SP-Guard, which uses the similarity between noise estimates in the early difusion steps as a proxy for the prompt’s unsafety level. Since diferent prompts can lead to varying degrees of unsafe content, the guidance scale should be adjusted to reflect the severity of each prompt. Given unsafeconcept set S = {1, ... }, SP-Guard first estimates the proxy value (c, cS), which represents the prompt-specific unsafe degree, during the earlier timesteps.

(c, cS) =

{︁ 1 − +1 max ∑︁ ∈{1,...,} =

Sim(Δc, , Δc, ) }︁ where Δc is the noise direction introduced in Eq. (5), and Sim(· , · ) is the cosine similarity. By estimating how similar the direction of the given prompt is to that of the harmful concept set , Eq. (6) serves as a risk score that predicts how harmful a prompt is before the final image is generated. After the initial timesteps, SP-Guard incorporates this risk score through a new guidance weight (c, c), which is used in (z, c, c) defined in Eq. (3). This weight controls both the strength and the spatial positioning of the guidance at each timestep:

(c, c) = () · +(c, cS) · (z, c, c) where +(c, cS) = max(0, (c, cS)) to ensure only non-negative contributions influence the guidance. () is a pre-defined function of timestep t, detailed later in this section. (z, c, c) acts as a mask that selectively applies guidance to regions likely to contain unsafe content, thereby promoting safe image generation. To elaborate on (z, c, c), we scale each mask value based on the pixel-wise proxy value, similar to the Sim(· , · ) function used in Eq. (6). Each pixel value for (z, c, c) is defined as follows: (5) (6) (7) calculate %($", $#)

calculate ! "!, $", $# for each timestep t=0 SD

NEG

SEGA

SLD weak SLD medium {︃1 + max(0, | |) if |Δc,[, , ]| > (|Δc,|) 0 else (8) with = Sim(Δc,[, , :], Δc,[, , :]),

where (|Δc,|) = -percentile of |Δc,|.

The masking condition is motivated by Brack et al. [ 16 ], who showed that the noise space consists of semantic concepts, with each concept concentrated in the upper and lower tails of the noise distribution. Accordingly, we mask the top -percentile elements and compute cosine similarity for the corresponding pixels. In Fig. 3, we visualize (z, c, c), showcasing how our mask design strategically applies safe guidance to specific regions, such as nude body parts. In contrast, the masking process of SLD in Fig. 2b spreads the guidance over unrelated areas of the image, often altering benign content unnecessarily. This diference stems from the novelty of SP-Guard, which combines the prompt-adaptive risk score in Eq. (6) with the selective masking in Eq. (8), enabling more precise and targeted guidance.

Lastly, we use a step function as a default form of () in Eq. (7). Specifically, after timesteps, () is set to max and subsequently reduced to 1.0 in the later steps. The reduction is essential to avoid visual artifacts, as prior work [ 2 ] shows that no such artifacts or distortion occur when the guidance scale is capped at 1.0. Furthermore, Yi et al. [ 20 ] and Balaji et al. [21] observe that text prompts primarily influence the early difusion steps, while the later stages focus on denoising and completing details using the latent image itself. These observations support our design: safe guidance should be prominent in the early steps, but does not require strong influence later on, and should be limited to preserve image quality. We validate the efectiveness of the step function in Section 3 and also explore the impact of varying max or using alternative scheduling strategies for ().

3. Experiments 3.1. Experimental Setup

We compare SP-Guard with the original Stable Difusion (SD) [ 1 ] and inference-time guiding methods: SD with a simple negative prompt (NEG), SEGA [ 16 ], and four configurations of SLD [ 10 ]. Since many existing methods [ 7, 22, 23, 24, 25, 26, 27, 28 ] report their results only on a single unsafe concept or handle diferent unsafe categories separately, their practical applicability is somewhat limited in multiconcept scenarios. Therefore, we primarily evaluate SP-Guard against SLD variants, as both address multiple unsafe concepts concurrently through a unified guidance process, enabling a fair comparison. We evaluate safe image generation on four datasets: I2P [ 10 ], Ring-A-Bell [ 5 ], MMA-Difusion [ 6 ], and UnlearnDif [ 29]. To assess image quality, we use DrawBench [ 2 ] and COCO-30k [30], which contain benign prompts. To assess the safety of generated images, we primarily report the unsafe content detection rate and its relative improvement over SD. We use an average score across four safety classifiers, MHSC [ 31], Q16 [32], NudeNet [33], and SD’s built-in Safety-Checker [ 1 ], to provide a balanced estimate of overall harmfulness. To assess content preservation, we use LPIPS [34], which measures perceptual similarity between images generated by SD and each method, thereby quantifying how well the non-unsafe regions are retained. We also report CLIP-score [35] to evaluate image-text alignment and FID [36] to assess image fidelity. We use SD v1.4 with 50 difusion steps and default settings across all baselines. For our method, max=4.0, =0.9, and =10, unless specified otherwise.

3.2. Qualitative analysis

The efectiveness of SP-Guard is demonstrated in Fig. 4. SP-Guard consistently generates safe images where SD fails. For example, it adds clothing in prompts involving nudity and replaces excessive blood in violent prompts with benign red elements. Unlike SLD, which applies guidance inconsistently, SP-Guard achieves reliable and prompt-adaptive safety through proxy-based guidance. Moreover, SP-Guard efectively confines guidance to areas identified as unsafe. 3.3. Quantitative results & Analysis Evaluation results of safe image generation and content preservation across all datasets are shown in Table 1.

As shown in the table, SP-Guard achieves safety perftoorpminanfecreencocme-ptiamraebgleuitdoinSLgDm- metahxo,drsa.nHkionwg eavmeor,nigt sthige- 0.5 LPIPS 0.4 0.3 nificantly outperforms SLD-max in image preservation. Fmigeunrtea5n:dTcroandtee-noftbpertewseerevnatsiaofne.tyPoiimntsprfuorvteh-er Notably, the LPIPS values of SP-Guard are comparable to the upper right indicate safer image generation to baselines that exhibit minimal safety gains, highlight- with better content preservation. ing the efectiveness of our selective masking strategy.

We further evaluate the image quality using FID and CLIP scores on COCO-30k and DrawBench. As shown in the right-most columns of Table 1, SP-Guard achieves FID and CLIP scores closer to those of the original SD, while maintaining superior or comparable safe generation performance to SLD-max. Notably, SP-Guard outperforms the other baselines, except SLD-weak, which shows considerably lower performance in safety. To illustrate the trade-of between safety and content preservation, Fig. 5 shows the results for the top-performing methods: SLD-strong, SLD-max, and SP-Guard. The y-axis represents the average relative improvement over SD in unsafe detection rates, while the reversed x-axis shows the LPIPS, indicating perceptual similarity to images generated by SD. Points closer to the upper right indicate a better trade-of between safety and content preservation. SP-Guard consistently achieves Methods (Colors)

SLD-strong SLD-max

SP-Guard Datasets (Shapes)

I2P Ring-A-Bell MMA-Diffusion UnlearnDiff lower LPIPS scores than SLD-max and SLD-strong (except against SLD-strong on Ring-A-Bell), showing that the generated images by SP-Guard remain closer to the original SD outputs while ensuring safety.

We vary the maximum guidance scale max from 2.0 to 6.0 in increments of 0.5, and also evaluate a cosine-based schedule as an alternative to the step function. Fig. 6 shows the results in the same format as the trade-of plot. SP-Guard consistently aligns with the Pareto front, showing robust performance across diferent max values and scheduling strategies.

4. Discussion & Conclusion

This work highlights the importance of accurately estimating the potential harmfulness of generated content. Moreover, as SP-Guard is an inference-time guiding approach, it allows flexible modification or addition of unsafe concepts without retraining. Such adaptability enables rapid alignment with evolving social norms and regulations [37], making the method practical for real-world moderation pipelines and dynamic regulatory environments. Moreover, since our method relies on the general mechanism of guidance and the similarity between the intended semantics and harmful concepts, the framework can be naturally extended to other modalities such as video or speech generation. Beyond improving safety, our work strengthens the trustworthiness of generative AI systems in two ways. First, by diagnosing the failure modes of prior approaches, we emphasize the importance of carefully designing both the guidance mechanism and the masking process. Second, by estimating the prompt-specific potential harmfulness, SP-Guard ofers transparency and controllability: users and deployers can see when and why safety interventions are applied. These features enhance trustworthiness rather than merely increasing safety. However, operating at inference time introduces some slowdown compared to standard SD. This could be mitigated by integrating recent advances in accelerating difusion models [ 38, 39]. Finally, although SP-Guard reduces hyperparameter complexity compared to SLD, it still requires tuning values such as () and . A promising future direction is to dynamically adjust () based on the guidance signal at each timestep. Despite these limitations, SP-Guard provides a lightweight, adaptable, and selective inference-time approach for safer text-to-image generation. Experiments on four unsafe-related datasets demonstrate significant improvements in safe generation with strong content preservation, while results on two benign datasets confirm its ability to maintain high fidelity. Looking ahead, we believe SP-Guard can be further enhanced and integrated with advancements in difusion models, paving the way toward safe, responsible, and trustworthy AI.

Acknowledgments

This work was supported in part by the National Research Foundation of Korea (NRF) grant [No.2021R1A2C2007884] and by Institute of Information & communications Technology Planning & Evaluation (IITP) grants [RS-2021-II211343, RS-2021-II212068, RS-2022-II220113, RS-2022-II220959] funded by the Korean government (MSIT).

Declaration on Generative AI

During the preparation of this work, the author used Grammarly in order to check grammar and spelling. After using this tool, the author reviewed and edited the content as needed and takes full responsibility for the publication’s content. [21] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laine, et al., edif-i: Text-to-image difusion models with an ensemble of expert denoisers, arXiv preprint arXiv:2211.01324 (2022). [22] A. Heng, H. Soh, Selective amnesia: A continual learning approach to forgetting in deep generative models, Advances in Neural Information Processing Systems 36 (2023) 17170–17194. [23] S. Kim, S. Jung, B. Kim, M. Choi, J. Shin, J. Lee, Safeguard text-to-image difusion models with human feedback inversion, arXiv preprint arXiv:2407.21032 (2024). [24] S. Lu, Z. Wang, L. Li, Y. Liu, A. W.-K. Kong, Mace: Mass concept erasure in difusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6430–6440. [25] X. Li, Y. Yang, J. Deng, C. Yan, Y. Chen, X. Ji, W. Xu, Safegen: Mitigating unsafe content generation in text-to-image models, arXiv e-prints (2024) arXiv–2404. [26] D. Chen, Z. Li, M. Fan, C. Chen, W. Zhou, Y. Li, Eiup: A training-free approach to erase noncompliant concepts conditioned on implicit unsafe prompts, arXiv preprint arXiv:2408.01014 (2024). [27] C. Gong, K. Chen, Z. Wei, J. Chen, Y.-G. Jiang, Reliable and eficient concept erasure of textto-image difusion models, in: European Conference on Computer Vision, Springer, 2024, pp. 73–88. [28] J. Yoon, S. Yu, V. Patil, H. Yao, M. Bansal, Safree: Training-free and adaptive guard for safe text-to-image and video generation, arXiv preprint arXiv:2410.12761 (2024). [29] Y. Zhang, J. Jia, X. Chen, A. Chen, Y. Zhang, J. Liu, K. Ding, S. Liu, To generate or not? safetydriven unlearned difusion models are still easy to generate unsafe images... for now, in: European Conference on Computer Vision, 2024. [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755. [31] Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, Y. Zhang, Unsafe difusion: On the generation of unsafe images and hateful memes from text-to-image models, in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3403–3417. [32] P. Schramowski, C. Tauchmann, K. Kersting, Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content?, in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1350–1361. [33] P. Bedapudi, Nudenet: Neural nets for nudity classification, detection and selective censoring, 2019. [34] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable efectiveness of deep features as a perceptual metric, in: CVPR, 2018. [35] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [36] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, Advances in neural information processing systems 30 (2017). [37] I. Solaiman, Z. Talat, W. Agnew, L. Ahmad, D. Baker, S. L. Blodgett, C. Chen, H. Daumé III, J. Dodge, I. Duan, et al., Evaluating the social impact of generative ai systems in systems and society, arXiv preprint arXiv:2306.05949 (2023). [38] A. Habibian, A. Ghodrati, N. Fathima, G. Sautiere, R. Garrepalli, F. Porikli, J. Petersen, Clockwork difusion: Eficient generation with model-step distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8352–8361. [39] Y.-H. Chen, R. Sarokin, J. Lee, J. Tang, C.-L. Chang, A. Kulik, M. Grundmann, Speed is all you need: On-device acceleration of large difusion models via gpu-aware optimizations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4651–4655.

[1]

Rombach ,

Blattmann ,

Lorenz ,

Esser ,

Ommer , High-resolution image synthesis with latent difusion models , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022 , pp. 10684 - 10695 .

[2]

Saharia ,

Chan ,

Saxena ,

Li ,

Whang ,

E. L.

Denton ,

Ghasemipour ,

R. Gontijo

Lopes ,

B. Karagol

Ayan ,

Salimans , et al., Photorealistic text-to-image difusion models with deep language understanding , Advances in neural information processing systems 35 ( 2022 ) 36479 - 36494 .

[3]

Schuhmann ,

Beaumont ,

Vencu ,

Gordon ,

Wightman ,

Cherti ,

Coombes ,

Katta ,

Mullis ,

Wortsman , et al., Laion-5b: An open large-scale dataset for training next generation image-text models , Advances in Neural Information Processing Systems 35 ( 2022 ) 25278 - 25294 .

[4]

Birhane ,

V. U.

Prabhu , E. Kahembwe, Multimodal datasets: misogyny, pornography, and malignant stereotypes , arXiv preprint arXiv:2110 . 01963 ( 2021 ).

[5]

Y.-L.

Tsai , C.-Y. Hsu,

Xie , C.-H. Lin , J.-Y.

Chen , B.

Li , P.-Y.

Chen , C.-M. Yu , C.-Y. Huang, Ring-a-bell! how reliable are concept removal methods for difusion models? , arXiv preprint arXiv:2310.10012 ( 2023 ).

[6]

Yang ,

Gao ,

Wang , T.-Y. Ho,

Xu ,

Xu , Mma-difusion: Multimodal attack on difusion models , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 7737 - 7746 .

[7]

Yang ,

Gao ,

Yang ,

Zhong ,

Xu , Guardt2i: Defending text-to-image models from adversarial prompts , arXiv preprint arXiv:2403.01446 ( 2024 ).

[8]

Liu ,

Khakzar ,

Gu ,

Chen ,

Torr ,

Pizzati , Latent guard: a safety framework for text-to-image generation , arXiv preprint arXiv:2404.08031 ( 2024 ).

[9]

Qu ,

Shen ,

Wu ,

Backes ,

Zannettou ,

Zhang , Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images , arXiv preprint arXiv:2405.03486 ( 2024 ).

[10]

Schramowski ,

Brack ,

Deiseroth ,

Kersting , Safe latent difusion: Mitigating inappropriate degeneration in difusion models , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 , pp. 22522 - 22531 .

[11]

Gandikota ,

Materzynska ,

Fiotto-Kaufman , D. Bau, Erasing concepts from difusion models , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023 , pp. 2426 - 2436 .

[12]

Gandikota ,

Orgad ,

Belinkov ,

Materzyńska ,

Bau , Unified concept editing in difusion models , in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2024 , pp. 5111 - 5120 .

[13]

Wu ,

Gao ,

Wang ,

Zhang , S. Wang, Universal prompt optimizer for safe text-to-image generation , arXiv preprint arXiv:2402.10882 ( 2024 ).

[14]

Song ,

Ermon , Generative modeling by estimating gradients of the data distribution , Advances in neural information processing systems 32 ( 2019 ).

[15]

Ho , T. Salimans, Classifier-free difusion guidance , arXiv preprint arXiv:2207.12598 ( 2022 ).

[16]

Brack ,

Friedrich ,

Hintersdorf ,

Struppek ,

Schramowski ,

Kersting , Sega: Instructing text-to-image models using semantic guidance , Advances in Neural Information Processing Systems 36 ( 2023 ) 25365 - 25389 .

[17]

Chavhan ,

Li ,

Hospedales , Conceptprune: Concept editing in difusion models via skilled neuron pruning , arXiv preprint arXiv:2405.19237 ( 2024 ).

[18]

Dalva ,

Yanardag , Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in difusion models , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 24209 - 24218 .

[19]

Brack ,

Friedrich ,

Kornmeier ,

Tsaban ,

Schramowski ,

Kersting ,

Passos , Ledits++: Limitless image editing using text-to-image models , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 8861 - 8870 .

[20]

Yi ,

Li ,

Xin ,

Li , Towards understanding the working mechanism of text-to-image difusion model , arXiv preprint arXiv:2405.15330 ( 2024 ).