HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Institution Name
CVPR 2025

*Indicates Equal Contribution
Overview of HOIGen-1M

HOIGen-1M contains over one million video clips for HOI video generation with multiple types of HOI videos, diverse scenarios (15, 000+ objects and 7, 000+ interaction types), and expressive captions.

Abstract

Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first large-scale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation.

HOIGen-1M

The construction of a large-scale HOI dataset faces two main challenges. The first challenge is acquiring high-quality and extensive video data that includes HOI. This involves accurately sourcing videos that capture these interactions. The second challenge is obtaining high-quality captions that precisely describe the people, objects, and scenes involved. This requires accurate and detailed captions to convey the complexity of the interactions and settings depicted in the videos. To address the aforementioned challenges, we build the first large-scale and high-quality dataset for HOI video generation named HOIGen-1M. It exhibits three main features:

  1. Large scale: HOIGen-1M curates over 1M video clips and all videos contain manually verified HOI, which is sufficient for training T2V models.
  2. High quality: HOIGen-1M is strictly selected from the aspects of mete attribute, aesthetics, temporal consistency, motion difference, and MLLM assessment.
  3. Expressive captions: the captions in HOIGen-1M are precise because a Mixture-of-Multimodal-Experts (MoME) strategy is employed to detect and eliminate hallucinations via cross-verification among multiple MLLMs.

A Comparison of the existing datasets for the T2V task and Our HOIGen-1M.

A Comparison of the existing datasets for the T2V task and Our HOIGen-1M. Our HOIGen-1M stands out as an precise T2V dataset tailored for HOI with excellent video quality and detailed captions.

Statistics of video clips in HOIGen-1M.

HOIGen-1M includes multiple types of HOI and spans a range of clip durations. All videos have a resolution of at least 720p and include significant motions.

Caption words statistics in HOIGen-1M.

The distribution of word numbers shows the captions are high-quality and fine-grained, with an average length of 152 words. The distribution of actions and objects in the captions further demonstrates the diversity of the dataset. There are over 15,000 objects and over 7,000 interaction action types, making it possible to train a T2V model to simulate the real world. For clarity, we have only listed the categories with the highest frequency.

Mixture-of-Multimodal-Experts

To eliminate hallucinations generated by large models in video descriptions, we propose a Mixture-of-Multimodal-Experts Strategy (MoME) to detect hallucinations and then fuse the characteristics of different MLLM to correct them.

An illustration of the Mixture-of-Multimodal-Experts (MoME) strategy-based caption method.

MoME first adopts two captions and one decision expert to detect the hallucination. Then, an additional set of decision experts and caption experts is introduced to eliminate these hallucinations.

New Metrics for HOI

To accurately align human preference on the HOI video generation, we introduce two new automatic metrics, CoarseHOIScore and FineHOIScore, to assess the visual quality of interaction.

  1. Prompt suite: we select the object and interaction types from a large-scale HOI detection dataset. Then, we manually filter out rare and unreasonable interactions and get a total of 306 prompts.
  2. CoarseHOIScore: An ideal interaction should at least involve people, objects, and corresponding actions. Therefore, we adopt a HOI detector to predict possible HOI triplets in the generated videos.
  3. FineHOIScore: It evaluates the details of HOIs from a pixel-level perspective, such as the presence or absence of contact between a person and an object and the stability of interactive actions.

Evaluation results with proposed HOIScores and VBench in HOI video generation.

Evaluation results with proposed HOIScores and VBench in HOI video generation. Despite the recent significant attention on T2V benchmarks, systematic evaluation of these models on HOI video generation is still lacking. To solve this problem, we evaluate five popular commercial software Kling 1.5, Pika, Hailuo, Dreamina, and Gen-3, as well as five representative open-source methods including OpenSora, OpenSoraPlan, CogVideoX-2B, CogVideoX-5B, and Mochi-10B.

Effectiveness of HOIGen-1M

We conduct a thorough effectiveness analysis of the proposed captioning method to demonstrate its ability to generate high-quality textual descriptions, which in turn contributes to generating videos that align with HOI.

The effect of the proposed captioning method.

We fine-tune three models on the HOIGen-1M using the LoRA mechanism. A significant increase can be observed after fine-tuning in the HOI scores of all three models, which directly demonstrates the effectiveness of HOIGen-1M in generating HOI videos.

TThe effect of fine-tuning using HOIGen-1M.

The quality of the generated interactive videos significantly improves after fine-tuning the model on HOIGen-1M. Here are the video comparison results:

BibTeX

BibTex Code Here