The construction of a large-scale HOI dataset faces two main challenges. The first challenge is acquiring high-quality and extensive video data that includes HOI. This involves accurately sourcing videos that capture these interactions. The second challenge is obtaining high-quality captions that precisely describe the people, objects, and scenes involved. This requires accurate and detailed captions to convey the complexity of the interactions and settings depicted in the videos. To address the aforementioned challenges, we build the first large-scale and high-quality dataset for HOI video generation named HOIGen-1M. It exhibits three main features:
- Large scale: HOIGen-1M curates over 1M video clips and all videos contain manually verified HOI, which is sufficient for training T2V models.
- High quality: HOIGen-1M is strictly selected from the aspects of mete attribute, aesthetics, temporal consistency, motion difference, and MLLM assessment.
- Expressive captions: the captions in HOIGen-1M are precise because a Mixture-of-Multimodal-Experts (MoME) strategy is employed to detect and eliminate hallucinations via cross-verification among multiple MLLMs.

A Comparison of the existing datasets for the T2V task and Our HOIGen-1M. Our HOIGen-1M stands out as an precise T2V dataset tailored for HOI with excellent video quality and detailed captions.

HOIGen-1M includes multiple types of HOI and spans a range of clip durations. All videos have a resolution of at least 720p and include significant motions.

The distribution of word numbers shows the captions are high-quality and fine-grained, with an average length of 152 words. The distribution of actions and objects in the captions further demonstrates the diversity of the dataset. There are over 15,000 objects and over 7,000 interaction action types, making it possible to train a T2V model to simulate the real world. For clarity, we have only listed the categories with the highest frequency.