DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

¹New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, ²School of Artificial Intelligence, University of Chinese Academy of Sciences, ³Centre for Artificial Intelligence and Robotics, HKISI, Chinese Academy of Sciences

Abstract

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

DrivingGPT Generated Videos

We present several 32-second, 2fps,1024×576 videos generated by DrivingGPT trained on the nuPlan dataset. The predicted discrete tokens are of lower resolution, but by leveraging a SVD token decoder, we can decode them to generate temporally consistent high-resolution outputs.

DrivingGPT-Framework

Autoregressive transformers trained for next-token prediction have demonstrated remarkable capabilities across diverse domains. In this work, we harness its power for autonomous driving by combining world modeling and trajectory planning. Our approach converts both visual inputs and driving actions into a discrete driving language, enabling unified modeling through autoregressive transformers, as illustrated in the figure below.

Diversity in Long Video Generation

We compare the long video generation performance of DrivingGPT with SVD fine-tuning methods. We showcase a 64-frame (32-second) sequence generated on the navtest dataset. (a) SVD fine-tuning methods often exhibit limitations in generating long videos, frequently repeating past content, such as indefinitely remaining at a red light. Conversely, (b) our DrivingGPT demonstrates superior performance in generating long, diverse, and visually appealing videos.

Strong Temporal Consistency in Video Generation

We compare the temporal consistency of DrivingGPT with SVD fine-tuning methods. Diffusion-based methods often exhibit inconsistent temporal generation, leading to object hallucination phenomena. For instance, when comparing models fine-tuned with SVD, we observe the sudden appearance (red box) and gradual disappearance (green box) of objects. In contrast, our autoregressive approach maintains better consistency.

Method	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	PDMS ↑
Constant Velocity	66.7	63.9	45.2	100.0	23.6	24.2
ResNet-50 + MLP decoder	92.6	89.9	86.2	96.3	73.7	77.8
DrivingGPT	98.9	90.7	94.9	95.6	79.7	82.4

Method

NC ↑

DAC ↑

TTC ↑

Comf. ↑

EP ↑

PDMS ↑

Constant Velocity

66.7

63.9

45.2

100.0

23.6

24.2

ResNet-50 + MLP decoder

92.6

89.9

86.2

96.3

73.7

77.8

DrivingGPT

98.9

90.7

94.9

95.6

79.7

82.4

@article{chen2024drivinggpt, author = {Chen, Yuntao and Wang, Yuqi and Zhang, Zhaoxiang}, title = {DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers}, year = {2024}, }