FullFlow — Bidirectional Vision-Language Generation from Pretrained Flow Models

One backbone, four generative modes

Decoupling image time t and text time τ turns inference into trajectory selection in a 2D generative space. Fixing one modality recovers the conditional task; evolving both yields joint sampling.

Text → Image

Fix τ=1 (clean text). The model evolves only the image timestep — recovering high-fidelity text-to-image generation while preserving the pretrained image prior.

Image → Text

Fix t=1 (clean image). A discrete Edit-Flow insertion process grows a caption token-by-token in T5 vocabulary space — no autoregressive decoder needed.

Joint Sampling

Evolve t and τ together. The same backbone produces coherent image–text pairs from pure noise — co-denoising along any trajectory in the unit square.

VQA / Partial Text

Restrict the insertion process to a target span. A decoder-free interface for (image, question) → answer that uses exactly the same inference procedure.

Abstract

Modern text-to-image diffusion models encode rich visual priors but expose them only through one-way, text-conditioned generation. Existing unified vision–language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes.

We introduce FullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text→image, image→text, joint sampling, and partial-text prediction with a single backbone.

On Stable Diffusion 3 under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text→image FID from 62.7 → 31.6 and image→text CIDEr from 2.0 → 99.4 over a LoRA equivalent of the previous SOTA (Dual Diffusion) at matched wall-clock time, while reducing peak VRAM from ~84 GB to ~38 GB and raising throughput by ~8× on two RTX A5000 GPUs in under 24h — training only ~5% of the backbone. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation.

Contributions

C1 — Recipe

Convert a pretrained text-to-image rectified-flow model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads, while preserving the pretrained image prior.

C2 — Native-space joint flow

Keep images in the native continuous flow, add a discrete insertion process for text, and decouple image and text timesteps so conditional, joint, and partial-text generation become different trajectories in a single (t,τ) space.

C3 — Results

On SD3 and FLUX.1-dev, outperform a matched Dual Diffusion-LoRA baseline on text→image retention and image→text quality, support downstream VQA, all while training on two commodity GPUs.

The two-time generative space

Images live in the native continuous rectified-flow latent space (image time t). Text lives in a discrete insertion process built on Edit Flows, with its own text time τ. A single shared backbone consumes both together — making the two modalities mutually informative at every forward pass.

Inference = trajectory selection in (t, τ). Image time t runs along the x-axis, text time τ along the y-axis. All arrows are valid trajectories in the joint generative space. Colored paths highlight the canonical modes: text→image (right edge, clean text fixed), image→text (top edge, clean image fixed), and joint (any diagonal from the noise corner). A single backbone handles all of them.

Image — continuous flow. Images are corrupted by Gaussian noise along t and reconstructed with a velocity head, identical to the pretrained text-to-image backbone.

Text — discrete insertion (Edit Flow). Tokens are deleted along τ and re-inserted by lightweight count + identity heads. Every intermediate state is a valid token subsequence — decodable for SD3 / FLUX's frozen CLIP encoders.

Architectural changes to the SD3 MMDiT backbone. Blue blocks are frozen pretrained components; yellow blocks carry LoRA adapters; red blocks are newly added trainable modules (text heads + τ-conditioning); green blocks are auxiliary operations. The model jointly processes noisy text tokens and image patches with modality-specific timesteps (τ, t) to predict cross-modal flows.

Headline result: FullFlow vs. Dual Diffusion

We construct a controlled, matched baseline (DualDiff.-LoRA) on the same SD3 backbone, same LoRA rank, same data, same optimizer, and nearly identical trainable parameter count (136.8M vs. 136.1M). The two methods differ only in their cross-modal interface.

Method	Text → Image		Image → Text
Method	FID ↓	CMMD ↓	CIDEr ↑	BERT-F1 ↑
DualDiff.-LoRA	62.65	0.870	2.0	−0.27
FullFlow step-matched	29.32	0.021	13.4	0.02
FullFlow time-matched	31.57	0.121	99.4	0.44

FID and CMMD are measured against the frozen base SD3 (prior preservation). CIDEr and BERT-F1 are measured against held-out Moondream captions on a 5k validation split.

84 → 38 GB

Peak VRAM
fits on two RTX A5000s

0.25 → 1.96

Steps / second
~8× throughput

< 24h

Total training time
on 2× RTX A5000

5% / 2.5%

Trainable params
SD3 / FLUX.1-dev

Image → Text on SD3

Captions generated by FullFlow-SD3 via the discrete insertion process. The whole captioning interface is added with ~136M trainable parameters on top of frozen SD3 — no autoregressive decoder, no joint pretraining.

A yellow dress with a white collar and sleeves is displayed on a wooden hanger against a white background.

A vintage black car with red wheels is displayed against a white background.

A pair of brown leather shoes with intricate detailing and tassel is displayed against a white background.

A classic red car is parked on a street with a building in the background.

The image is a portrait of a woman with dark hair, set against a red background.

The image shows a close-up of a brown leather boot with a pointed toe.

An oil painting featuring a vibrant bouquet of flowers in a blue vase against a dark background.

A pink dress with a floral pattern displayed on a white background.

A blue bicycle with a basket and a seat is shown against a white background.

A close-up of a chocolate cake topped with strawberries on a wooden surface.

A vintage motorcycle is displayed in a workshop with tools in the background.

A landscape painting featuring rolling hills and a vibrant sunset sky.

Same recipe → FLUX.1-dev

To verify the recipe is not specific to SD3, we apply it to FLUX.1-dev with minimal architecture-specific changes, sharding the 22 GB model across three RTX A5000 GPUs. After 40k π_ind + 20k π_ac steps (~36h total), CIDEr reaches 73.6 while the text→image prior holds at FID 24.7 / CMMD 0.014.

An oil painting depicting a vase with a bouquet of pink and purple flowers against a blue background.

An illustration of purple flowers with green leaves, set against a white background.

A collection of red roses arranged on a white background.

A close-up of a black alloy wheel with a design featuring a logo at the center.

A pair of brown leather shoes with laces is displayed against a white background.

A person is wearing a knitted hat and a scarf, against a white background.

A pair of beaded earrings arranged on a wooden surface.

A residential area with a swimming pool, plants and a building against a blue sky.

Preserving the text-to-image prior

A central worry: doesn't adding captioning break the original image generator? We design a same-noise teacher-matching objective — the frozen base model (just the LoRA-disabled student) acts as a teacher, requiring no extra memory. The result: the FullFlow finetune still generates high-quality images at identical prompts & seeds. Compare the SD3 base model vs. our FullFlow-SD3 finetune below.

Base SD3

FullFlow finetune (same prompts & seeds)

Prompts: "A black cat on grass" · "A brown dog in the snow" · "A horse running in a field" · "A beautiful landscape with mountains" · "A futuristic city skyline at sunset"

Same prompts on FLUX.1-dev

Base FLUX.1-dev

FullFlow-FLUX finetune

Prompts: "A black cat and a brown dog playing together" · "A horse and a bear in a forest" · "A sports car driving on a mountain road" · "A kitchen with a chocolate cake on the table" · "An astronaut riding a horse on the moon"

Downstream VQA via partial-text generation

Restricting the insertion process to a target span gives a decoder-free interface for (image, question) → answer. At 10–50× fewer trainable parameters than every baseline and at half the resolution of the strongest competitor, FullFlow attains the highest VizWiz accuracy (50.7) in our table and remains competitive on VQAv2 and COCO captioning.

Model	Trainable	COCO ↑	VQAv2 ↑	VizWiz ↑	OKVQA ↑
CM3Leon	7B	61.6	47.6	37.6	23.8
Chameleon	7B	18.0	—	—	—
LWM	7B	—	55.8	11.6	—
Show-O (256)	1.3B	—	64.7	—	—
Show-O (512)	1.3B	—	69.4	—	—
Transfusion	7B	29.0	—	—	—
DualDiff (256)	2B	—	59.5	19.4	28.5
DualDiff (512)	2B	56.2	60.1	29.9	25.3
FullFlow (Ours, 256)	130M	54.9	56.5	50.7	23.3

Q: Are both phones the same size? A: yes

Q: Is the man at home? A: no

Q: What color is the child's hat? A: black

Q: What color are the man's eyes? A: blue

Q: What animal resembles the ones pictured? A: zebra

Q: What color are the lights that are lit up? A: red

Q: What color is the cat? A: black

Q: What is the person holding? A: toothbrush

Key ingredients of the recipe

Native-space joint flow

Keep images in the pretrained continuous flow. Add a discrete Edit-Flow insertion process for text in T5 tokenizer space — every intermediate state is a valid token sequence, decodable for SD3 / FLUX's frozen CLIP encoders.

Adaptive gradient balancing

Text and image losses produce gradients at very different scales (text ≈ 25× image). We estimate the ratio online and update λ_txt by EMA — eliminating a fragile, backbone-specific hyperparameter at <0.1% runtime overhead.

Same-noise teacher matching

To prevent image-prior drift, the LoRA-disabled student itself acts as a teacher on the same noisy latent. No extra frozen copy needs to be held in memory — yet FID improves from 63.5 → 30.4.

Mixed-corruption pretraining

Sampling (t,τ) ∼ Unif[0,1]² exposes the model to all jointly corrupted states, training cross-modal correspondence across the full time space — then we refine with alternating-clean for deployment.

LoRA-only adaptation

Only LoRA adapters (rank 32) + lightweight text heads + a text-timestep conditioning pathway are trained: ~136M params (~5% of SD3-M), ~298M params (~2.5% of FLUX.1-dev).

Decoder-free text interface

Captions, joint samples, and VQA answers are all produced by the same insertion process applied to different masks of the token sequence. One model, one inference primitive, four tasks.

BibTeX

Preprint

@misc{bill2026fullflow,
      title={FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation}, 
      author={Eric Tillmann Bill and Enis Simsar and Alessio Tonioni and Thomas Hofmann},
      year={2026},
      eprint={2605.20316},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.20316}, 
}

Master's Thesis

@mastersthesis{bill2026fullflowthesis,
    author    = {Bill, Eric Tillmann},
    title     = {FullFlow: Towards Unified Text--Image Generation and Understanding},
    school    = {ETH Zurich},
    year      = {2026},
    month     = may,
    address   = {Zurich},
    type      = {Master Thesis},
    doi       = {10.3929/ethz-c-000800257},
    copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

FullFlow Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision–Language Generation