ClothTransformer: Unified Latent-Space
Transformers for Scalable Cloth Simulation

A single model for diverse cloth simulation scenes — body-driven garments, robotic manipulation, and general collisions.

1S-Lab, Nanyang Technological University  ·  2Feeling AI  ·  3University of Oxford  ·  4Nanyang Technological University  ·  5Shanghai AI Laboratory
Corresponding author
ClothTransformer teaser figure

ClothTransformer generalizes to unseen test cases across diverse scenarios. Left two: Diverse Object Collision — cloth falling onto unseen rigid objects (sword, character). Middle two: Human Garment — unseen body, garment, and animation combinations. Right two: Robotic Manipulation — unseen cloth meshes grasped by a robotic gripper. Notably, across all scenes, a single unified Transformer handles diverse cloth dynamics without per-scenario fine-tuning.

Abstract

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation.

We present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios — body-driven garments, robotic manipulation, and general collisions — under a single model and achieves approximately 4–9× lower error than prior state-of-the-art; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ~493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.

Video

Key Contributions

01 / Architecture
Unified Transformer

A single Transformer handles diverse scenarios under one model, achieving 4–9× lower error than prior state-of-the-art across the board.

02 / Scalability
Latent-Space Dynamics

Arbitrary-resolution meshes are compressed into a fixed-size set of latent tokens — temporal computation is decoupled from mesh resolution.

03 / Dataset
Penetration-Free Data + CCD

A 493.4k-frame penetration-free dataset across all three settings, enabling a differentiable Continuous Collision Detection module that eliminates tunneling artifacts.

Unified Architecture

The architecture is scenario-agnostic: cross-attention compression handles arbitrary mesh sizes, the Transformer carries no scenario-specific priors, and collision objects are encoded as generic triangle tokens. A single model is trained jointly across all scenarios — no per-scenario adaptation.

ClothTransformer architecture: Spatial Encoder, Temporal Transformer, Spatial Decoder
Architecture overview. The framework has three components: a Spatial Encoder (left) that compresses the history cloth mesh and lookahead collision geometry into a fixed-size set of latent tokens via cross-attention; a Temporal Transformer (middle) that evolves the latent state forward in time with block-causal masking; and a Spatial Decoder (right) that reconstructs the next-frame mesh by querying the predicted latents with rest-pose vertex tokens, followed by GNN refinement.

High Quality Penetration-Free Dataset

We construct a ~493.4k-frame penetration-free dataset spanning all three scenarios. Every sequence is verified intersection-free, enabling training with a differentiable Continuous Collision Detection loss and CCD post-processing. (Use ← / → to switch between categories.)

Differentiable CCD

We ablate three settings progressively: w/ DCD loss only, + CCD Loss during training, and + CCD Post. at inference. (Use ← / → to switch between cases.)

BibTeX

@article{zhang2026clothtransformer,
  title   = {ClothTransformer: Unified Latent-Space Transformers
             for Scalable Cloth Simulation},
  author  = {Zhang, Yu and Shao, Yidi and Ouyang, Wenqi and
             Lan, Yushi and Liang, Zhexin and Wu, Chengrui and
             Xu, Xudong and Pan, Xingang},
  journal = {arXiv preprint},
  year    = {2026}
}