Oral · CVPR 2026 — CVEU Workshop

FOCUS:
Optimal Control for Multi-Entity World Modeling in Text-to-Image Generation

ETH ZΓΌrich
Base Models
Base Models + FOCUS (Ours)

Optimal control improves coherent multi-subject generation in flow matching models. Using FOCUS at test time or via fine-tuning yields faithful compositions with correct attribute binding, minimal leakage, and no omissions, while preserving base style.

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-entity scenes, often exhibiting attribute leakage, identity entanglement, and subject omissions. We present a principled theoretical framework that steers sampling toward multi-subject fidelity by casting flow matching (FM) as stochastic optimal control (SOC), yielding a single-hyperparameter trade-off between fidelity and object-centric state separation / binding consistency. Within this framework, we derive two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow–diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. In addition, we introduce FOCUS (Flow Optimal Control for Unentangled Subjects), a probabilistic attention-binding objective compatible with both algorithms. Empirically, on Stable Diffusion 3.5 and FLUX.1, both algorithms consistently improve multi-subject alignment while maintaining base-model style; test-time control runs efficiently on commodity GPUs, and fine-tuned models generalize to unseen prompts.

Highlights

🧭 Principled framework

Casts flow matching as stochastic optimal control, giving a single λ that trades off fidelity vs. subject disentanglement.

⚑ Test-time controller

Training-free: perturbs the base velocity with one extra pass. Runs on commodity GPUs (β‰ˆ12 GB VRAM).

πŸͺΆ Adjoint Matching

Lightweight LoRA fine-tuning (<0.1% of weights) that regresses to a backward adjoint signal — zero inference-time overhead.

🎯 FOCUS loss

A probabilistic Jensen–Shannon objective over cross-attention distributions, scaling to any number of subjects.

🧩 Model-agnostic

Works with SD 3.5, FLUX.1, SDXL; unifies prior attention heuristics under one optimization view.

πŸ† State-of-the-art

Best composite scores and highest human-preference Elo on both backbones, at test time and after fine-tuning.

Method

We view the sampler of a flow-matching model as a controlled SDE and ask: what is the minimum perturbation that fixes multi-subject generation? Solving this with stochastic optimal control yields two architecture-agnostic recipes — a test-time controller (single-pass velocity correction) and Adjoint Matching (offline LoRA fine-tuning of a control network against a backward adjoint signal). Both are driven by the same running cost f, plug into off-the-shelf samplers, and unify prior attention heuristics (Attend&Excite, CONFORM, Divide&Bind) as special cases.

For the running cost, we exploit the fact that cross-attention maps from image tokens to subject tokens are probability distributions over spatial locations. FOCUS therefore combines a within-subject consistency term and a between-subject separation term, both expressed as normalized Jensen–Shannon divergences. The result is a loss that scales to an arbitrary number of subjects while encouraging unimodal, localized, and non-overlapping attention.

Dachshund
Cross-attention map for 'dachshund'
Corgi
Cross-attention map for 'corgi'
Image
Generated image

Per-subject cross-attention maps extracted from FLUX.1 [dev] for the prompt “A dachshund and a corgi sitting together on a cozy rug”. FOCUS shapes these distributions to be localized and disjoint.

Test-Time Control — Qualitative Results

All heuristics share the same sampler, seed, and prompt — only the SOC running cost changes. Each method is shown at its optimal λ.

Base Attend&Excite CONFORM Divide&Bind Self-Cross Guidance FOCUS (Ours)
SD 3.5
“An astronaut, a violin, and a sunflower floating inside a space station”
FLUX.1
“A swan, a goose, and a duck drifting past lily pads”
SD 3.5
“A fox, a lantern, and a teapot in a misty forest clearing”
FLUX.1
“A quartz crystal, an amethyst, and a citrine displayed on black velvet”

Fine-Tuned (Adjoint Matching) — Qualitative Results

Each heuristic is instantiated as the running cost during Adjoint-Matching fine-tuning with rank-4 LoRA. At inference, the controller adds no extra cost.

Base Attend&Excite CONFORM Divide&Bind Self-Cross Guidance FOCUS (Ours)
SD 3.5
“A Siberian Husky, an Alaskan Malamute, and a Samoyed trotting through fresh snow”
SD 3.5
“A magician, a white rabbit, and a deck of cards on a velvet stage”
FLUX.1
“A macaw, a cockatoo, and an Amazon parrot perched on a jungle vine”
FLUX.1
“A jellyfish, a seashell, and a glass bottle drifting in turquoise water”

Quantitative Results

Mean over a 150-prompt corpus (2–4 subjects per prompt) with five seeds. Gold / silver / bronze mark the top-three values per metric. Composite = macro-average of baseline-relative gains across all metrics.

Test-time control

Heuristic CLIP I-T ↑ SigLIP-2 I-T ↑ BLIP T-T ↑ Qwen2 T-T ↑ PickScore ↑ ImageReward ↑ Composite ↑
SD 3.5 Base0.34740.23090.57310.640222.6941.31750.000
Attend&Excite0.34840.23260.57520.640422.6951.35453.171
CONFORM0.34810.23230.57730.642122.7191.36843.434
Divide&Bind0.34890.23160.57420.639922.6781.34933.937
FOCUS (Ours)0.34830.23440.57510.638522.7501.40034.287
FLUX.1 Base0.34490.22710.57390.630023.4231.29700.000
Attend&Excite0.34300.22420.57160.630423.2551.24941.760
CONFORM0.34360.22520.57260.632123.3571.24611.511
Divide&Bind0.34530.22720.57220.633023.4391.29391.635
FOCUS (Ours)0.34460.22680.57410.632623.4271.29131.971

Fine-tuning (Adjoint Matching)

Heuristic CLIP I-T ↑ SigLIP-2 I-T ↑ BLIP T-T ↑ Qwen2 T-T ↑ PickScore ↑ ImageReward ↑ Composite ↑
SD 3.5 Base0.34740.23090.57310.640222.6941.31750.000
Attend&Excite0.34690.22810.57470.642522.8431.44605.718
CONFORM0.34780.22940.56460.639322.5961.37823.458
Divide&Bind0.34860.22660.58700.635822.3401.35240.801
FOCUS (Ours)0.34950.23310.57440.638322.6451.44955.917
FLUX.1 Base0.34490.22710.57390.630023.4231.29700.000
Attend&Excite0.34680.23200.58760.638223.3331.38062.348
CONFORM0.34580.23050.58000.636923.3721.36311.959
Divide&Bind0.34450.22960.57050.624623.1911.22690.200
FOCUS (Ours)0.34680.23280.57800.638623.3281.38992.588

Human Preference Study

Pairwise preference study with 50 participants and 2,000 comparisons. FOCUS attains the highest win rates on both backbones, and the highest Elo for test-time control.

Heuristic SD 3.5 FLUX.1
Win %Elo ↑ Win %Elo ↑
Test-Time Base45%151746%1464
Attend&Excite53%150049%1526
CONFORM42%137350%1498
Divide&Bind50%156250%1450
FOCUS (Ours)58%154854%1562
Fine-tuning Base39%135551%1462
Attend&Excite56%158450%1476
CONFORM49%152050%1620
Divide&Bind48%143643%1442
FOCUS (Ours)57%160554%1500

Data Efficiency of Fine-Tuning

Composite scores when training FOCUS on subsets of the corpus. A single training prompt already yields strong gains — the SOC controller learns a broadly useful disentangling direction from very limited supervision.

Model1 prompt15 prompts150 prompts
SD 3.55.9173.1911.682
FLUX.12.5882.4571.810

BibTeX

@inproceedings{bill2026focus,
  title     = {FOCUS: Optimal Control for Multi-Entity World Modeling in Text-to-Image Generation},
  author    = {Eric Tillmann Bill and Enis Simsar and Thomas Hofmann},
  booktitle = {CVPR 2026 Workshop on Computer Vision for Entertainment and Universal Media (CVEU)},
  year      = {2026}
}