Animate124: Animating One Image to 4D Dynamic Scene

1 National University of Singapore, 2 The University of Hong Kong, 3 The Hong Kong University of Science and Technology

A panda is dancing

A Clownfish is swimming

A blue flag with Chelsea Football Club logo on it, attached to a flagpole,
waving with a smooth, gentle curve

A fox maneuvering a game controller with its paws

An astronaut, helmet in hand, rides a white horse

A monkey riding a bike

A space shuttle launching

A full-bodied tiger standing on its hind legs, confidently playing an acoustic guitar

Abstract

We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.

Framework

The overall framework of our Animate124. After learning the static scene (the first stage, not shown in the figure), the dynamic scene is optimized with a coarse-to-fine strategy in two stages. In the coarse stage, we optimize the dynamic NeRF with the combination of video diffusion and 3D diffusion priors. Subsequently, in the fine stage, additional ControlNet prior is introduced to refine the details and correct semantic drift. The condition of ControlNet derives from the frozen coarse stage model to reduce error accumulation.

Video

BibTeX

@article{zhao2023animate124,
  author    = {Zhao, Yuyang and Yan, Zhiwen and Xie, Enze and Hong, Lanqing and Li, Zhenguo and Lee, Gim Hee},
  title     = {Animate124: Animating One Image to 4D Dynamic Scene},
  journal   = {arXiv preprint arXiv:2311.14603},
  year      = {2023},
}