A single model for diverse cloth simulation scenes — body-driven garments, robotic manipulation, and general collisions.
Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation.
We present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios — body-driven garments, robotic manipulation, and general collisions — under a single model and achieves approximately 4–9× lower error than prior state-of-the-art; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ~493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.
A single Transformer handles diverse scenarios under one model, achieving 4–9× lower error than prior state-of-the-art across the board.
Arbitrary-resolution meshes are compressed into a fixed-size set of latent tokens — temporal computation is decoupled from mesh resolution.
A 493.4k-frame penetration-free dataset across all three settings, enabling a differentiable Continuous Collision Detection module that eliminates tunneling artifacts.
The architecture is scenario-agnostic: cross-attention compression handles arbitrary mesh sizes, the Transformer carries no scenario-specific priors, and collision objects are encoded as generic triangle tokens. A single model is trained jointly across all scenarios — no per-scenario adaptation.
We construct a ~493.4k-frame penetration-free dataset spanning all three scenarios. Every sequence is verified intersection-free, enabling training with a differentiable Continuous Collision Detection loss and CCD post-processing. (Use ← / → to switch between categories.)
We ablate three settings progressively: w/ DCD loss only, + CCD Loss during training, and + CCD Post. at inference. (Use ← / → to switch between cases.)
@article{zhang2026clothtransformer,
title = {ClothTransformer: Unified Latent-Space Transformers
for Scalable Cloth Simulation},
author = {Zhang, Yu and Shao, Yidi and Ouyang, Wenqi and
Lan, Yushi and Liang, Zhexin and Wu, Chengrui and
Xu, Xudong and Pan, Xingang},
journal = {arXiv preprint},
year = {2026}
}