JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

Abstract

We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions.

Why do modern Text-to-Image models fail?

Issues in text-to-image models: Modern text-to-image models often struggle with prompts involving multiple subjects due to entangled attention maps. These maps serve as the model’s conditioning mechanism and can be viewed as spatial probability distributions that indicate where each subject is expected to appear in the image. When these distributions overlap, the model may produce blended features (attribute mixing), omit subjects entirely, or misplace them in the scene. JEDI addresses these issues by ensuring that each subject is represented, subjects remain spatially distinct, and all attributes of a subject stay correctly grouped.

Qualitative Comparison Stable Diffusion 3.5

Stable Diffusion 3.5 + JEDI: The base model often mixes attributes or omits subjects, while JEDI effectively corrects these issues. It preserves the visual style of the original image and consistently handles prompts with multiple subjects.

Qualitative Comparison Stable Diffusion 1.5

Stable Diffusion 1.5 vs. CONFORM vs. JEDI: JEDI achieves clearer subject separation and better compositional alignment than CONFORM, while requiring significantly fewer optimization steps—only 18 vs. 69. Unlike CONFORM, which introduces noticeable stylistic drift due to a high learning rate, JEDI preserves the visual style of the base model and delivers faster, more stable test-time adaptation.

@misc{bill2025jedi, title={JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models}, author={Eric Tillmann Bill and Enis Simsar and Thomas Hofmann}, year={2025}, eprint={2505.19166}, archivePrefix={arXiv}, primaryClass={cs.CV}, }

ICML PUT Workshop 2025

JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

Abstract

Why do modern Text-to-Image models fail?

WORK IN PROGRESS🚧

Qualitative Comparison Stable Diffusion 3.5

Stable Diffusion 3.5 + JEDI: The base model often mixes attributes or omits subjects, while JEDI effectively corrects these issues. It preserves the visual style of the original image and consistently handles prompts with multiple subjects.

Stable Diffusion 3.5 + JEDI: The base model often mixes attributes or omits subjects, while JEDI effectively corrects these issues. It preserves the visual style of the original image and consistently handles prompts with multiple subjects.

Stable Diffusion 3.5 + JEDI: The base model often mixes attributes or omits subjects, while JEDI effectively corrects these issues. It preserves the visual style of the original image and consistently handles prompts with multiple subjects.

Qualitative Comparison Stable Diffusion 1.5

BibTeX