Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval

1 Queen Mary University of London
2 Sony AI
3 Adobe Research
4 WICT, Peking University
5 State Key Laboratory of General Artificial Intelligence, Peking University

Abstract

Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise moment-text associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential detrimental noise or unnecessary repetitions in the novel synthetic videos harmful to VMR learning. Experiments on three datasets demonstrate the effectiveness of FVE to unseen novel semantic video moment retrieval tasks.

Our Framework

MY ALT TEXT


Our designed instance-preserving action editing model. We first take the video as a set of images and train an image diffusion model to align a special text token with the instance shared between those frames. Subsequently, we take those frames as a sequence and freeze the layers in the image diffusion model, and append a temporal layer to capture the video motions.

Hybrid Data Selection

MY ALT TEXT


Data generation and hybrid selection. For data generation, we first train the video diffusion model to align moment with a sentence, then we use an editing prompt to edit the moment. The hybrid selection strategy includes a cross-modal relevance and unimodal structure score to select high-quality generation, as well as a model performance disparity to select beneficial data for VMR training.

Comparison with SOTA

MY ALT TEXT


Comparison under VMR task

Visualization

MY ALT TEXT

Examples of Video Editing.

Poster

BibTeX

@article{luo2024generative,
  title={Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval},
  author={Luo, Dezhao and Gong, Shaogang and Huang, Jiabo and Jin, Hailin and Liu, Yang},
  journal={arXiv preprint arXiv:2401.13329},
  year={2024}
}