Inside Fei-Fei Li’s Plan to Build AI-Powered Virtual Worlds

Recent AI progress has followed a pattern. In terms of text, images, audio and video, once the right technical fundamentals were discovered, it only took a few years for AI-generated results to go from merely passable to indistinguishable from human creations. While it's still early days, recent advances suggest that virtual worlds—3D environments that you can explore and interact with—could be next.

It's a bet made by a pioneering artificial intelligence researcher. Fei-Fei Lioften called the “godmother” of artificial intelligence for her contributions to computer vision. In November her new startup, World laboratorieslaunched its first commercial offering: a platform called Marblewhere users can create exportable 3D environments from text, image or video prompts.

The platform can provide immediate value to design professionals by automating some technically complex creative work. But Lee's ultimate goal is much more ambitious: to create not just virtual worlds, but what she calls “spatial intelligence.” manifesto“The boundary beyond language is the faculty that links imagination, perception and action.” AI systems can already see the world—with spatial intelligence, she argues, they could begin to interact with it in meaningful ways.

Worlds on request

Although virtual worlds already exist in the form of video games that we interact with through screens or headsets, their creation is technically complex and time-consuming. With the help of AI, virtual worlds could be created much more easily, personalized for their users, and infinitely expandable—at least in theory.

In practice, global models, including models from other companies such as Google DeepMind, Genie 3— still early regarding its potential. Ben Mildenhall, one of Lee's co-founders at World Labs, says he expects them to follow the same trajectory we've seen with text, audio and video – people moving from “that's cute” to “that's interesting” to “I didn't realize this was made by artificial intelligence.”

Indeed, artificial intelligence video generation models have improved rapidly. This improvement is behind the recent popular success models from OpenAI and Middle of the road. Companies like Signatures, runwayAnd Synthesis all of them have also built businesses around AI-generated video. According to Vincent Zitzman, an assistant professor at MIT and an expert in AI world modeling, we can think of video models as “models of the proto-world.”

Lee's latest platform, Marble, offers different ways to create. You can offer it a written description, photographs, video, or an existing 3D scene, and it will spit out a “world” that you can navigate in first person, just like a video game. But these worlds – static at first, although developers can add movement and more using specialized tools – have clear limitations. It only takes a few beats of exploration before the visuals begin to distort and the world takes on a hallucinatory, disjointed texture.

Modeling entire worlds is much more difficult than creating videos. Mildenhall argues that because the barrier to entry into creating 3D worlds is much higher than writing words, you start to see “glimpses of value” in tools like Marble much earlier. “World Labs has shown what can be achieved when you integrate and scale a number of advances made by the computer vision community over the last decade—this is a very impressive achievement,” says Zitzmann. “For the first time, you get a sense of what kind of products can be made with this.”

Lee says that “we can use this technology to create many virtual worlds that connect, extend or complement our physical world.” The arguments for using global models to create new entertainment are quite obvious. And in fields like architecture and engineering, “you can try a thousand times, exploring many potential alternatives at a much lower cost,” says Mildenhall. But major hurdles remain in other touted applications—robotics, science and education.

Path

Although we have plenty of video and camera data to train video models, suitable data for robot training, esp. humanoid robots– much harder to find. We lack proprioceptive or “action data,” says Zitzmann, that could tell the robot which motor movements correspond to physical actions.

For self-driving cars, which have just a few inputs—gears, pedals and a steering wheel—we can “collect millions of hours of video that correlate with the actions of human drivers. But a humanoid robot has all these other joints and actions they can do. And we don't have the data for that,” he says.

In his manifesto, Lee argues that world models will play a “defining role” in solving the data problem for robotics. While the manifesto lays out the concept, Zitzmann says it “doesn't really answer the question” of exactly how world models will solve robotics in the future, since an accurate simulator will require data that correlates movement with action, which we currently lack.

There are also challenges when it comes to using world models in science and education. For fun, it's enough to make everything look realistic. But perhaps more important for science and education is fidelity to the simulated real-world dynamics. “I [could] go in and feel the inside of the cage” or “if I am a surgeon training in laparoscopic surgery, I [could be] inside the gut,” Lee says, discussing what future models of the world might offer. But, of course, simulations of a cell or operation are only useful to the extent that they are accurate. The founders of World Labs are acutely aware of the trade-off between realism and fidelity, and are optimistic that at some point the models will be good enough to provide both.

What if it works?

Compared to language, “spatial reasoning in modern AI is much worse,” says Lee. This is true. But while Lee is betting that solving the problem of spatial intelligence (as her company defines it) is necessary for AI to advance beyond a certain point – trillion concerns about the dollar – whether this will continue remains to be seen. Whether existing multimodal language models such as ChatGPT will hit a wall and suddenly stop improving also remains an open question. What we do know is that across the industry and across modalities, models are improving.

Mildenhall suggests that we will get to a point where “in the framework of the model you can experience everything that you can experience in reality.” In such a world, you can “interact with a thing multimodally and transform it at will into any impulse,” he says.

As reasoning models and virtual reality advance in parallel, we can imagine a strange future in which each of us has access to our own infinitely vast and fascinating generative worlds. Instead of watching a TikTok video of a cat, the cat is right in front of you. Instead of scrolling, explore. Such a world will obey your will. Some users may like this as they fall in love with chatbots today. “We’re not at that level right now,” says Christoph Lassner, another co-founder of World Labs. Zitzmann agrees that the idea is “not crazy,” although he notes that the prohibitive costs and long rendering times suggest such a future is still relatively distant.

Lee emphasizes that this technology will complement and benefit people, and that our approach to it will remain collaborative. Why? “Because I believe in humanity,” she says. “If you look at the course of history, civilization progresses and our knowledge increases.” She rejects both utopian and dystopian visions. “I think we all have a responsibility to improve the state of AI as it becomes more powerful,” she says. “We should all want humanity to prevail and prosper. Therefore, your hope should be where your actions are directed.”

Leave a Comment