ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

ClothTransformer generalizes to unseen test cases across diverse scenarios. Left two: Diverse Object Collision — cloth falling onto unseen rigid objects (sword, character). Middle two: Human Garment — unseen body, garment, and animation combinations. Right two: Robotic Manipulation — unseen cloth meshes grasped by a robotic gripper. Notably, across all scenes, a single unified Transformer handles diverse cloth dynamics without per-scenario fine-tuning.

Abstract

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation.

We present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios — body-driven garments, robotic manipulation, and general collisions — under a single model and achieves approximately 4–9× lower error than prior state-of-the-art methods; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ~493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.

One Model, Diverse Scenarios

A single unified Transformer handles diverse cloth dynamics without per-scenario specialization.

Scenario 1

Human Garment

Body-driven dressing on unseen body / garment / animation combinations.

Scenario 2

Robotic Manipulation

Unseen cloth meshes grasped and lifted by a robotic gripper.

Scenario 3

General Collision

Cloth colliding with unseen rigid objects, handled by continuous-collision detection.

Key Contributions

01 / Architecture

Unified Transformer

A single Transformer handles diverse scenarios under one model, achieving 4–9× lower error than prior state-of-the-art methods across the board.

02 / Scalability

Latent-Space Dynamics

Arbitrary-resolution meshes are compressed into a fixed-size set of latent tokens — temporal computation is decoupled from mesh resolution.

03 / Dataset

Penetration-Free Data + CCD

A 493.4k-frame penetration-free dataset across all three settings, enabling a differentiable Continuous Collision Detection module that eliminates tunneling artifacts.

Unified Architecture

The architecture is scenario-agnostic: cross-attention compression handles arbitrary mesh sizes, the Transformer carries no scenario-specific priors, and collision objects are encoded as generic triangle tokens. A single model is trained jointly across all scenarios — no per-scenario adaptation.

ClothTransformer architecture: Spatial Encoder, Temporal Transformer, Spatial Decoder — **Architecture overview.** The framework has three components: a **Spatial Encoder** (left) that compresses the history cloth mesh and lookahead collision geometry into a fixed-size set of latent tokens via cross-attention; a **Temporal Transformer** (middle) that evolves the latent state forward in time with block-causal masking; and a **Spatial Decoder** (right) that reconstructs the next-frame mesh by querying the predicted latents with rest-pose vertex tokens, followed by GNN refinement.

High Quality Penetration-Free Dataset

We construct a ~493.4k-frame penetration-free dataset spanning all three scenarios. Every sequence is verified intersection-free, enabling training with a differentiable Continuous Collision Detection loss and CCD post-processing. (Use ← / → to switch between categories.)

Differentiable CCD

We ablate three settings progressively: w/ DCD loss only, + CCD Loss during training, and + CCD Post. at inference. (Use ← / → to switch between cases.)

w/ DCD

+ CCD Loss

+ CCD Post.

w/ DCD

+ CCD Loss

+ CCD Post.

w/ DCD

+ CCD Loss

+ CCD Post.

w/ DCD

+ CCD Loss

+ CCD Post.

BibTeX

@misc{zhang2026clothtransformerunifiedlatentspacetransformers,
      title={ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation},
      author={Yu Zhang and Yidi Shao and Wenqi Ouyang and Yushi Lan and Zhexin Liang and Chengrui Wu and Xudong Xu and Xingang Pan},
      year={2026},
      eprint={2605.27852},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2605.27852},
}

ClothTransformer: Unified Latent-Space
Transformers for Scalable Cloth Simulation

Abstract

Video

One Model, Diverse Scenarios

Key Contributions

Unified Architecture

Comparison

High Quality Penetration-Free Dataset

Differentiable CCD

BibTeX