StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Chang Liu¹, Rui Li¹, Kaidong Zhang¹, Yunwei Lan¹, Dong Liu^1*,

¹University of Science and Technology of China
*Corresponding author

Paper Code Models (HF) DAVIS-Edit (HF) Models (wisemodel) DAVIS-Edit (wisemodel) Models (ModelScope) DAVIS-Edit (ModelScope)

News

[2024 Nov. 27th] We have uploaded our model weights and DAVIS-Edit to ModelScope.
[2024 Nov. 21th] We have updated the Gradio demo at our GitHub repository. Feel free to try it out following the instructions in our document!
[2024 Nov. 20th] We have updated DAVIS-Edit at our HuggingFace dataset repo, and uploaded all the required model weights of StableV2V at our HuggingFace model repo.
[2024 Nov. 19th] Our arXiv paper is released.
[2024 Nov. 18th] We update the codebase of StableV2V at our GitHub repository.
[2024 Nov. 17th] Project page of StableV2V is updated.

Overview

StableV2V presents a novel paradigm to perform video editing in a shape-consistent manner, especially handling the editing scenarios when user prompts cause significant shape changes to the edited contents.

Besides, StableV2V shows superior flexibility in handling a wide series of down-stream applications, considering various user prompts from different modalities.

Method

StableV2V is built based on first-frame-based methods that decomposes the entire video editing process into image editing and motion transfer, where it handles the video editing task with three main components, i.e., Prompted First-frame Editor (PFE), Iterative Shape Aligner (ISA), and Condtiional Image-to-video Generator (CIG). Specifically, these components mainly function as follows:

PFE serves as the first-frame image editor that converts user prompts into the edited contents, which are later propagated to the entire video through subsequent procedures.
ISA mainly addresses the shape inconsistency issue, and utilizes the depth map as an intermediate vehicle to deliver motions, where it simulates and aligns the depth maps with the shapes of edited contents, and thus offers an accurate guidance for the CIG.
CIG takes the edited first frame and aligned depth map as input, and plays as a depth-guided image-to-video generator to produce the entire edited video.

DAVIS-Edit: A Testing Benchmark

We manually construct a testing benchmark, namely DAVIS-Edit, to offer a comprehensive evaluation for video editing studies. DAVIS-Edit contains both text-based and image-based editing tasks, and we set two subsets to address editing scenarios with different degrees of shape differences. You may refer to this link to find more details about the curated DAVIS-Edit dataset.

Results and Applications

1. Text-based Editing
2. Image-based Editing
3. Instruction-based Editing
4. Sketch-based Editing
5. Video Inpainting
6. Video Style Transfer

Text-based Editing

Image-based Editing

Instruction-based Editing

Sketch-based Editing

Video Inpainting

Video Style Transfer

Comparison with State-of-the-Art Studies

In the following videos, we showcase the qualitative comparison of StableV2V with existing state-of-the-art studies.

Text-based Editing

Image-based Editing

More Comparison

Citation

If you find this work helpful to your research, or use our testing benchmark DAVIS-Edit, please cite our paper:


          @misc{liu-2024-etal-stablev2v,
            title={{StableV2V: Stablizing Shape Consistency in Video-to-Video Editing}}, 
            author={Chang Liu and Rui Li and Kaidong Zhang and Yunwei Lan and Dong Liu},
            year={2024},
            eprint={2411.11045},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
          }