Decoupling image time t and text time τ turns inference into trajectory selection in a 2D generative space. Fixing one modality recovers the conditional task; evolving both yields joint sampling.
Fix τ=1 (clean text). The model evolves only the image timestep — recovering high-fidelity text-to-image generation while preserving the pretrained image prior.
Fix t=1 (clean image). A discrete Edit-Flow insertion process grows a caption token-by-token in T5 vocabulary space — no autoregressive decoder needed.
Evolve t and τ together. The same backbone produces coherent image–text pairs from pure noise — co-denoising along any trajectory in the unit square.
Restrict the insertion process to a target span. A decoder-free interface for (image, question) → answer that uses exactly the same inference procedure.
Modern text-to-image diffusion models encode rich visual priors but expose them only through one-way, text-conditioned generation. Existing unified vision–language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes.
We introduce FullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text→image, image→text, joint sampling, and partial-text prediction with a single backbone.
On Stable Diffusion 3 under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text→image FID from 62.7 → 31.6 and image→text CIDEr from 2.0 → 99.4 over a LoRA equivalent of the previous SOTA (Dual Diffusion) at matched wall-clock time, while reducing peak VRAM from ~84 GB to ~38 GB and raising throughput by ~8× on two RTX A5000 GPUs in under 24h — training only ~5% of the backbone. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation.
Convert a pretrained text-to-image rectified-flow model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads, while preserving the pretrained image prior.
Keep images in the native continuous flow, add a discrete insertion process for text, and decouple image and text timesteps so conditional, joint, and partial-text generation become different trajectories in a single (t,τ) space.
On SD3 and FLUX.1-dev, outperform a matched Dual Diffusion-LoRA baseline on text→image retention and image→text quality, support downstream VQA, all while training on two commodity GPUs.
Images live in the native continuous rectified-flow latent space (image time t). Text lives in a discrete insertion process built on Edit Flows, with its own text time τ. A single shared backbone consumes both together — making the two modalities mutually informative at every forward pass.
We construct a controlled, matched baseline (DualDiff.-LoRA) on the same SD3 backbone, same LoRA rank, same data, same optimizer, and nearly identical trainable parameter count (136.8M vs. 136.1M). The two methods differ only in their cross-modal interface.
| Method | Text → Image | Image → Text | ||
|---|---|---|---|---|
| FID ↓ | CMMD ↓ | CIDEr ↑ | BERT-F1 ↑ | |
| DualDiff.-LoRA | 62.65 | 0.870 | 2.0 | −0.27 |
| FullFlow step-matched | 29.32 | 0.021 | 13.4 | 0.02 |
| FullFlow time-matched | 31.57 | 0.121 | 99.4 | 0.44 |
FID and CMMD are measured against the frozen base SD3 (prior preservation). CIDEr and BERT-F1 are measured against held-out Moondream captions on a 5k validation split.
Captions generated by FullFlow-SD3 via the discrete insertion process. The whole captioning interface is added with ~136M trainable parameters on top of frozen SD3 — no autoregressive decoder, no joint pretraining.












To verify the recipe is not specific to SD3, we apply it to FLUX.1-dev with minimal architecture-specific changes, sharding the 22 GB model across three RTX A5000 GPUs. After 40k πind + 20k πac steps (~36h total), CIDEr reaches 73.6 while the text→image prior holds at FID 24.7 / CMMD 0.014.








A central worry: doesn't adding captioning break the original image generator? We design a same-noise teacher-matching objective — the frozen base model (just the LoRA-disabled student) acts as a teacher, requiring no extra memory. The result: the FullFlow finetune still generates high-quality images at identical prompts & seeds. Compare the SD3 base model vs. our FullFlow-SD3 finetune below.
Prompts: "A black cat on grass" · "A brown dog in the snow" · "A horse running in a field" · "A beautiful landscape with mountains" · "A futuristic city skyline at sunset"
Prompts: "A black cat and a brown dog playing together" · "A horse and a bear in a forest" · "A sports car driving on a mountain road" · "A kitchen with a chocolate cake on the table" · "An astronaut riding a horse on the moon"
Restricting the insertion process to a target span gives a decoder-free interface for (image, question) → answer. At 10–50× fewer trainable parameters than every baseline and at half the resolution of the strongest competitor, FullFlow attains the highest VizWiz accuracy (50.7) in our table and remains competitive on VQAv2 and COCO captioning.
| Model | Trainable | COCO ↑ | VQAv2 ↑ | VizWiz ↑ | OKVQA ↑ |
|---|---|---|---|---|---|
| CM3Leon | 7B | 61.6 | 47.6 | 37.6 | 23.8 |
| Chameleon | 7B | 18.0 | — | — | — |
| LWM | 7B | — | 55.8 | 11.6 | — |
| Show-O (256) | 1.3B | — | 64.7 | — | — |
| Show-O (512) | 1.3B | — | 69.4 | — | — |
| Transfusion | 7B | 29.0 | — | — | — |
| DualDiff (256) | 2B | — | 59.5 | 19.4 | 28.5 |
| DualDiff (512) | 2B | 56.2 | 60.1 | 29.9 | 25.3 |
| FullFlow (Ours, 256) | 130M | 54.9 | 56.5 | 50.7 | 23.3 |








Keep images in the pretrained continuous flow. Add a discrete Edit-Flow insertion process for text in T5 tokenizer space — every intermediate state is a valid token sequence, decodable for SD3 / FLUX's frozen CLIP encoders.
Text and image losses produce gradients at very different scales (text ≈ 25× image). We estimate the ratio online and update λtxt by EMA — eliminating a fragile, backbone-specific hyperparameter at <0.1% runtime overhead.
To prevent image-prior drift, the LoRA-disabled student itself acts as a teacher on the same noisy latent. No extra frozen copy needs to be held in memory — yet FID improves from 63.5 → 30.4.
Sampling (t,τ) ∼ Unif[0,1]² exposes the model to all jointly corrupted states, training cross-modal correspondence across the full time space — then we refine with alternating-clean for deployment.
Only LoRA adapters (rank 32) + lightweight text heads + a text-timestep conditioning pathway are trained: ~136M params (~5% of SD3-M), ~298M params (~2.5% of FLUX.1-dev).
Captions, joint samples, and VQA answers are all produced by the same insertion process applied to different masks of the token sequence. One model, one inference primitive, four tasks.
@misc{bill2026fullflow,
title = {FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation},
author = {Bill, Eric Tillmann and Simsar, Enis and Tonioni, Alessio and Hofmann, Thomas},
year = {2026},
eprint = {arXiv:XXXX.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
@mastersthesis{bill2026fullflowthesis,
author = {Bill, Eric Tillmann},
title = {FullFlow: Towards Unified Text--Image Generation and Understanding},
school = {ETH Zurich},
year = {2026},
month = may,
address = {Zurich},
type = {Master Thesis},
doi = {10.3929/ethz-c-000800257},
copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}