TL;DR: FlowBind utilizes bidirectional flows to achieve efficient, any-to-any multimodal generation.
We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost.
We recommend using Docker to ensure environment consistency. Alternatively, you can set up the environment manually by installing the dependencies listed in requirements.txt.
- Docker Setup
docker pull yeonwoo378/flowbind:latest
docker run -it yeonwoo378/flowbind:latest bash- Clone the repository
git clone https://github.com/yeonwoo378/flowbind.git
cd flowbind- Install Requirements
pip install -r requirements.txtWe provide a Jupyter notebook to guide you through the inference process, including loading pretrained weights and running generation tasks.
Please refer to demo.ipynb for a step-by-step tutorial.
We utilize the following datasets for training and evaluation.
- Text-Image:
- LAION-COCO (Filtered by aesthetic scores)
- Flickr30k
- Text-Audio:
- Audio-Image:
Note: Due to copyright and licensing restrictions, we cannot provide the raw training datasets directly. Please download the data from the official links provided above.
Before training, you must extract features from your dataset (e.g., Flickr30k).
# Extract features for Text-to-Image (T2I) tasks
python extract_data/extract_t2i.py \
--dataset flickr30k \
--data_root /path/to/your/dataExtracted training features will be automatically saved to ./feats/{dataset_name}
You can train the model using torchrun for distributed training. The command below demonstrates how to launch training on a single node with 4 GPUs.
torchrun --nnodes=1 --nproc_per_node=4 main.py \
--exp_name flowbind_exp \
--batch_size 256 \
--dataset audiocaps flickr30k laion vggsound \
--t_cond adaln \
--hidden_dim 1152 \
--lr 1e-4Training logs are automatically synced to Weights & Biases (WandB). Please ensure you are logged in via wandb login before starting the training.
This repository is built upon the following open-source projects. We thank the authors for their excellent contributions.
If you find our work helpful, please cite:
@misc{cha2025flowbind,
title={FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows},
author={Cha, Yeonwoo and Kim, Semin and Kwon, Jinhyeon and Hong, Seunghoon},
Eprint={arXiv:2512.15420},
year={2025}
}For any inquiries, please contact Yeonwoo Cha at ckdusdn03@kaist.ac.kr.
