StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

1University of Science and Technology of China
*Corresponding author
Paper Code Models (HF) DAVIS-Edit (HF) Models (wisemodel) DAVIS-Edit (wisemodel) Models (ModelScope) DAVIS-Edit (ModelScope)

News

Overview

StableV2V presents a novel paradigm to perform video editing in a shape-consistent manner, especially handling the editing scenarios when user prompts cause significant shape changes to the edited contents.

Besides, StableV2V shows superior flexibility in handling a wide series of down-stream applications, considering various user prompts from different modalities.

Method

StableV2V is built based on first-frame-based methods that decomposes the entire video editing process into image editing and motion transfer, where it handles the video editing task with three main components, i.e., Prompted First-frame Editor (PFE), Iterative Shape Aligner (ISA), and Condtiional Image-to-video Generator (CIG). Specifically, these components mainly function as follows:

  • PFE serves as the first-frame image editor that converts user prompts into the edited contents, which are later propagated to the entire video through subsequent procedures.
  • ISA mainly addresses the shape inconsistency issue, and utilizes the depth map as an intermediate vehicle to deliver motions, where it simulates and aligns the depth maps with the shapes of edited contents, and thus offers an accurate guidance for the CIG.
  • CIG takes the edited first frame and aligned depth map as input, and plays as a depth-guided image-to-video generator to produce the entire edited video.

method

DAVIS-Edit: A Testing Benchmark

method

    We manually construct a testing benchmark, namely DAVIS-Edit, to offer a comprehensive evaluation for video editing studies. DAVIS-Edit contains both text-based and image-based editing tasks, and we set two subsets to address editing scenarios with different degrees of shape differences. You may refer to this link to find more details about the curated DAVIS-Edit dataset.

Results and Applications

    In the following videos, we demonstrate some editing cases performed by StableV2V, including:
  • 1. Text-based Editing
  • 2. Image-based Editing
  • 3. Instruction-based Editing
  • 4. Sketch-based Editing
  • 5. Video Inpainting
  • 6. Video Style Transfer

Text-based Editing

Image-based Editing

Instruction-based Editing

Sketch-based Editing

Video Inpainting

Video Style Transfer

Comparison with State-of-the-Art Studies

    In the following videos, we showcase the qualitative comparison of StableV2V with existing state-of-the-art studies.

Text-based Editing

Image-based Editing

More Comparison

Citation

    If you find this work helpful to your research, or use our testing benchmark DAVIS-Edit, please cite our paper:


          @misc{liu-2024-etal-stablev2v,
            title={{StableV2V: Stablizing Shape Consistency in Video-to-Video Editing}}, 
            author={Chang Liu and Rui Li and Kaidong Zhang and Yunwei Lan and Dong Liu},
            year={2024},
            eprint={2411.11045},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
          }