Facebook-parent Meta releases OpenEQA for home robots and smart glasses: Here’s how it may work

6 Min Read

Meta has introduced the Open-Vocabulary Embodied Question Answering (OpenEQA) to test how artificial intelligence (AI) can understand the world around it. This open-source framework is designed to offer AI agents (like smart glasses and home robots) sensory inputs that will allow them to collect clues from their environment, “see” the space it’s in and even offer value to humans who may ask for AI assistance.
In its blog post, the social media giant explained: “Imagine an embodied AI agent that acts as the brain of a home robot or a stylish pair of smart glasses. Such an agent needs to leverage sensory modalities like vision to understand its surroundings and be capable of communicating in clear, everyday language to effectively assist people.”

How OpenEQA will work

Apart from explaining the tech, Meta has also offered several examples to demonstrate how OpenEQA will work. This includes users asking AI agents where they’ve placed an item they need or if they still have food left in the pantry.
Meta explains: “Let’s say you’re getting ready to leave the house and can’t find your office badge. You could ask your smart glasses where you left them, and the agent might respond that the badge is on the dining table by leveraging its episodic memory. Or if you were hungry on the way back home, you could ask your home robot if there’s any fruit left. Based on its active exploration of the environment, it might respond that there are ripe bananas in the fruit basket.”

This means that in the coming days, an at-home robot or pair of smart glasses could help run our lives. However, there’s still a major challenge in developing such a technology.

Meta’s ‘problems’ with vision language models

Meta discovered that vision language models (VLMs) don’t perform accurately. The company noted: “In fact, for questions that require spatial understanding, today’s VLMs are nearly ‘blind’-access to visual content provides no significant improvement over language-only models.”
The company explains that this was one of the reasons for testing OpenEQA as an open-source framework. Meta said that developing an AI model that can truly “see” the world around it as humans do, can recollect where things are placed and when, and then can provide contextual value to a human based on abstract queries, is extremely difficult to create.
As per the company, a community of researchers, technologists, and experts will need to work together to make it a reality.
Meta also claimed that OpenEQA has more than 1,600 “non-templated” question-and-answer pairs that represent how a human may interact with AI. The company has validated the pairs to ensure that they can be answered correctly by the algorithm. However, more work needs to be done with this technology.
The company claimed: “As an example, for the question ‘I’m sitting on the living room couch watching TV. Which room is directly behind me?’, the models guess different rooms essentially at random without significantly benefitting from visual episodic memory that should provide an understanding of the space. This suggests that additional improvement on both perception and reasoning fronts are needed before embodied AI agents powered by such models are ready for primetime.”

Share This Article
Leave a comment