DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, Wei; Wang, Yuran; Song, Zijia; Li, Yadong; Zhou, Zenan; Chen, Long; Xu, Jianhua; Wang, Jiaqi; Yu, Kaicheng

ICLR 2026

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song^1,2,3, Yuran Wang⁵, Zijia Song³, Yadong Li⁴, Zenan Zhou⁴, Long Chen⁶,
Jianhua Xu^4*, Jiaqi Wang^2*, Kaicheng Yu^1,2,3*

¹Zhejiang University ²Shanghai Innovation Institute ³Westlake University ⁴Baichuan Inc. ⁵Peking University ⁶Xiaomi EV

^*Corresponding Authors

Paper Code arXiv BibTeX

0.25

rFID with SigLIP-So/14-384

82.0%

ImageNet zero-shot accuracy

+5.8

Average gain on 10 visual understanding benchmarks

+19%

Improvement over TokenFlow on GenEval

DualToken teaser figure showing better reconstruction and understanding with dual visual vocabularies

DualToken separates visual appearance and semantics into two complementary vocabularies: pixel tokens preserve fine-grained details for generation, while semantic tokens retain language-aligned abstractions for understanding.

Abstract

The differing representation spaces required for visual understanding and generation pose a central challenge for unified autoregressive multimodal models. Tokenizers optimized for reconstruction preserve low-level appearance but lack language-aligned semantics, while contrastively trained vision encoders offer strong semantics yet cannot be decoded faithfully back to pixels.

DualToken bridges this gap by introducing dual visual vocabularies inside a single tokenizer: a pixel codebook for low-level appearance and a semantic codebook for high-level meaning. Rather than forcing one codebook to satisfy conflicting objectives, DualToken disentangles them hierarchically across shallow and deep transformer layers, enabling strong performance on both sides. The resulting tokenizer achieves 0.25 rFID and 82.0% zero-shot ImageNet accuracy, and the unified multimodal model outperforms VILA-U on both understanding and generation benchmarks.

Why DualToken

One tokenizer, two visual vocabularies

DualToken factorizes image representation into pixel tokens for reconstruction and semantic tokens for understanding, without increasing visual sequence length in the unified MLLM.

Conflict resolved hierarchically

Reconstruction supervision is applied to shallow layers, while semantic preservation is applied to deep layers, turning a destructive tradeoff into a complementary learning setup.

Stronger than dual-encoder combinations

Compared with directly combining a VQGAN-style encoder and a CLIP-style encoder, DualToken keeps both vocabularies inside a more compatible shared representation space.

Better understanding and generation

Pixel tokens improve downstream recognition of fine details, and semantic tokens improve autoregressive image generation by preserving semantic structure throughout decoding.

Method Overview

Motivation

Understanding and generation demand different visual representations

Directly optimizing one codebook for both semantic preservation and image reconstruction causes a strong objective conflict. In the paper, this appears as degraded semantic metrics and blurry, distorted reconstructions.

Reconstruction-only tokenizers capture texture, color, and local structure but perform poorly in multimodal reasoning.
Language-aligned vision encoders preserve semantics but struggle to decode back into high-fidelity images.
DualToken disentangles these roles rather than collapsing them into one visual vocabulary.

Tokenizer

Hierarchical supervision yields dual codebooks

DualToken analyzes the internal hierarchy of SigLIP and uses shallow features for reconstruction while preserving semantic alignment in deep features. Each feature stream is discretized with residual vector quantization to form its own visual vocabulary.

Shallow-layer features are optimized for reconstruction and quantized into pixel tokens.
Deep-layer features are constrained by semantic preservation and quantized into semantic tokens.
The resulting tokenizer reaches both strong reconstruction fidelity and strong zero-shot semantics.

Unified MLLM

Dual visual vocabularies are consumed by one autoregressive model

Inside the unified model, pixel and semantic tokens are projected into the LLM hidden space and concatenated along the embedding dimension. This preserves sequence length while giving the language model access to both low-level detail and high-level meaning.

Independent visual heads predict residual tokens for the pixel and semantic vocabularies.
Semantic tokens guide generation toward prompt-consistent structure.
Pixel tokens inject fine-grained detail that improves OCR, hallucination detection, and general VQA.

Framework comparison showing DualToken versus dual-encoder and unified alternatives

Key Results

Tokenizer quality

0.25 rFID, 28.69 PSNR, and 0.744 SSIM with the SigLIP-So/14-384 backbone, while retaining 82.0% zero-shot ImageNet accuracy.

Controlled LLaVA study

Within the LLaVA-1.5 framework, DualToken improves the average score across ten understanding benchmarks from 48.1 for VILA-U to 53.9.

Unified understanding

DualToken-3B (384px) reaches 76.2 on MMBench, 72.2 on SEED, and 1588.4 on MME, competitive with much larger unified models.

Unified generation

On GenAI-Bench advanced prompts, DualToken-3B improves the overall score from 0.60 for reproduced VILA-U 3B to 0.68, and clearly outperforms the pixel-only ablation.

Generation examples comparing DualToken with VILA-U and pixel-only baselines

Qualitative results from the paper show that DualToken generates sharper, more semantically faithful images than VILA-U and a pixel-only ablation, especially under long or compositional prompts.

Analysis

Why not combine two separate encoders?

Naive dual-encoder design creates a large representational mismatch between low-level and semantic features. DualToken avoids this by learning both vocabularies within one compatible architecture.

Why do dual tokens help each other?

Pixel tokens recover high-frequency visual clues that improve understanding, while semantic tokens act as positive semantic supervision during generation, producing outputs that better match the prompt.

BibTeX

@inproceedings{song2026dualtoken,
  title={DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies},
  author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Zhou, Zenan and Chen, Long and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}