World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
We present several 32-second, 2fps,1024×576 videos generated by DrivingGPT trained on the nuPlan dataset. The predicted discrete tokens are of lower resolution, but by leveraging a SVD token decoder, we can decode them to generate temporally consistent high-resolution outputs.
Autoregressive transformers trained for next-token prediction have demonstrated remarkable capabilities across diverse domains. In this work, we harness its power for autonomous driving by combining world modeling and trajectory planning. Our approach converts both visual inputs and driving actions into a discrete driving language, enabling unified modeling through autoregressive transformers, as illustrated in the figure below.
We compare the long video generation performance of DrivingGPT with SVD fine-tuning methods. We showcase a 64-frame (32-second) sequence generated on the navtest dataset. (a) SVD fine-tuning methods often exhibit limitations in generating long videos, frequently repeating past content, such as indefinitely remaining at a red light. Conversely, (b) our DrivingGPT demonstrates superior performance in generating long, diverse, and visually appealing videos.
We compare the temporal consistency of DrivingGPT with SVD fine-tuning methods. Diffusion-based methods often exhibit inconsistent temporal generation, leading to object hallucination phenomena. For instance, when comparing models fine-tuned with SVD, we observe the sudden appearance (red box) and gradual disappearance (green box) of objects. In contrast, our autoregressive approach maintains better consistency.
Although the action prediction in DrivingGPT is simplistic, it achieves strong planning performance against strong baselines on the NAVSIM mini benchmark and handle multiple challenging scenarios well.
Method | NC ↑ | DAC ↑ | TTC ↑ | Comf. ↑ | EP ↑ | PDMS ↑ |
---|---|---|---|---|---|---|
Constant Velocity | 66.7 | 63.9 | 45.2 | 100.0 | 23.6 | 24.2 |
ResNet-50 + MLP decoder | 92.6 | 89.9 | 86.2 | 96.3 | 73.7 | 77.8 |
DrivingGPT | 98.9 | 90.7 | 94.9 | 95.6 | 79.7 | 82.4 |
@article{chen2024drivinggpt,
author = {Chen, Yuntao and Wang, Yuqi and Zhang, Zhaoxiang},
title = {DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers},
year = {2024},
}