Animate124: Animating One Image to 4D Dynamic Scene

Abstract

We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.

Framework

The overall framework of our Animate124. After learning the static scene (the first stage, not shown in the figure), the dynamic scene is optimized with a coarse-to-fine strategy in two stages. In the coarse stage, we optimize the dynamic NeRF with the combination of video diffusion and 3D diffusion priors. Subsequently, in the fine stage, additional ControlNet prior is introduced to refine the details and correct semantic drift. The condition of ControlNet derives from the frozen coarse stage model to reduce error accumulation.

@article{zhao2023animate124, author = {Zhao, Yuyang and Yan, Zhiwen and Xie, Enze and Hong, Lanqing and Li, Zhenguo and Lee, Gim Hee}, title = {Animate124: Animating One Image to 4D Dynamic Scene}, journal = {arXiv preprint arXiv:2311.14603}, year = {2023}, }

Animate124: Animating One Image to 4D Dynamic Scene

Abstract

Framework

Video

BibTeX