AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation
Published in: CVPR 2026, 2026
Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. AutoCut is an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. It employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, it further develops a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework.
Recommended Citation: Zhou, M., Qin, S.Z., Li, Y.Z., Chen, Q., Jiang, P., 2026. End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation. CVPR 2026. https://arxiv.org/abs/2603.28366
PaperURL

