Architecture Decoupling Is Not All You Need For Unified Multimodal Model

Nov 2025
(† project leader, ✉ corresponding authors)
1 MMLab, CUHK    2 MeiTuan   
3 USTC    4 TJU   
Teaser.

Overview. In this work, we aim to explore why architectural decoupling improves the performance of unified multimodal models and seek to achieve comparable results without decoupling. We use cross-modal attention intensity at each layer to represent the strength of cross-modal interaction for each task within that layer. Surprisingly, we find that regardless of the degree of architectural decoupling, understanding and generation tasks exhibit a negative correlation within layers, indicating that architectural decoupling does not fundamentally resolve the conflicts between tasks. We further examine the interaction patterns of current SOTA single-task methods and observe that as architectural decoupling becomes stronger, the interaction patterns increasingly resemble those of single-task models, which explains their performance improvements. Based on this finding, we propose Attention Interaction Alignment (AIA), a method that explicitly constrains attention interaction patterns during training. We demonstrate performance improvements on Emu3 and Janus-Pro, narrowing the performance gap with more extensively decoupled architectures.

Abstract

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

System-level comparison on widely used image understanding and generation benchmarks.

BibTeX

If you find our work useful, please consider citing our works:

@article{zheng2025architecture,
      title={Architecture Decoupling Is Not All You Need For Unified Multimodal Model},
      author={Zheng, Dian and Zhang, Manyuan and Li, Hongyu and Zou, Kai and Liu, Hongbo and Guo, Ziyu and Feng, Kaituo and Liu, Yexin and Luo, Ying and Feng, Yan and Pei, Peng and Cai, Xunliang and Li, Hongsheng},
      journal={arXiv preprint arXiv:2503.21755},
      year={2025}
}