Researchers from Stanford University and FAIR Meta unveil CHOIS: a pioneering AI method for assembling realistic 3D human-object interactions guided by language

The problem of generating synchronized motions of objects and humans within a 3D scene has been addressed by researchers from Stanford University and FAIR Meta by introducing CHOIS. The system works based on waypoints of scattered objects, the initial state of objects and humans, and a textual description. It controls interactions between humans and objects by producing realistic and controllable movements of both entities in the given 3D environment.

By leveraging large-scale, high-quality motion capture datasets such as AMASS, interest in generative human motion modeling, including adaptive synthesis of motion and text, has increased. While previous works have used VAE compositions to generate diverse human motion from text, CHOIS focuses on human-body interactions. Unlike existing methods that often focus on hand motion synthesis, CHOIS takes into account whole-body movements that precede object grasping and predicts object movement based on human movements, providing a comprehensive solution for 3D interactive scene simulation.

CHOIS addresses the critical need to capture realistic human behaviors in 3D environments, which is critical for computer graphics, embodied AI, and robotics. CHOIS advances the field by creating simultaneous human and object movement based on language descriptions, initial states, and waypoints of dispersed objects. It addresses challenges such as generating realistic motion, assimilating environmental clutter, synthesizing interactions from language descriptions, and providing a comprehensive system for controllable human-body interactions in diverse 3D scenes.

The model uses a conditional diffusion approach to generate simultaneous object and human motion based on language descriptions, object geometry, and initial states. Constraints are incorporated during the sampling process to ensure realistic human-body contact. The training phase uses a loss function to guide the model in predicting object transformations without explicitly enforcing connectivity constraints.

The CHOIS system is rigorously evaluated against baselines and ablations, displaying superior performance on metrics such as condition matching, contact accuracy, reduced hand body penetration, and foot float. In the FullBodyManipulation dataset, losing object geometry improves the model’s capabilities. CHOIS outperforms baselines and excisions on the 3D-FUTURE dataset, demonstrating its generalizability to novel organisms. Human perceptual studies highlight CHOIS’s better compatibility with text input and superior interaction quality compared to baseline. Quantitative metrics, including position and orientation errors, measure the deviation of results resulting from ground truth motion.

In conclusion, CHOIS is a system that generates realistic human-object interactions based on language descriptions and sparse waypoints of objects. The procedure takes into account object geometry loss during training and uses efficient heuristic terms during sampling to enhance the realism of the results. The interaction module learned by CHOIS can be integrated into a pipeline to synthesize long-term interactions given language and 3D scenes. CHOIS has improved significantly at generating realistic human-object interactions consistent with the provided linguistic descriptions.

Future research could explore enhancing CHOIS by incorporating additional supervision, such as object geometry loss, to better match generated object motion with input waypoints. Investigating advanced routing conditions to enforce connection restrictions may yield more realistic results. Expanding evaluations to diverse datasets and scenarios will test the generalization capabilities of CHOIS. Further human perceptual studies could provide deeper insights into the interactions generated. Applying the learned interaction module to generate long-range interactions based on object waypoints from 3D scenes would also expand the applicability of CHOIS.


Check the Paper and project. All credit for this research goes to the researchers in this project. Also don’t forget to join We have 33k+ ML SubReddit, 41k+ Facebook community, Discord channelAnd Email newsletterwhere we share the latest AI research news, cool AI projects, and more.

If you like our work, you’ll love our newsletter.

Hello, my name is Adnan Hassan. I am a Consultant Trainee at Marktechpost and soon to be a Management Trainee at American Express. I am currently pursuing my dual degree at Indian Institute of Technology Kharagpur. I’m passionate about technology and want to create new products that make a difference.

🐝 [FREE AI WEBINAR] “A Beginner’s Guide to LangChain: Chat with Your Multi-Model Data” December 11, 2023 at 10am PT

Researchers from Stanford University and Facebook AI Research’s Meta lab have recently revealed a groundbreaking AI method called CHOIS, which stands for “Combining Human and Object Interactions via Synthesis.” This pioneering technology is designed to assemble realistic 3D human-object interactions, guided by language input. The team’s innovative approach pushes the boundaries of AI capabilities and has the potential to revolutionize various fields, from robotics and virtual reality to digital animation and computer vision. This cutting-edge development is a significant step forward in the quest to create more immersive and intelligent virtual environments.

Previous Post Next Post

Formulaire de contact