“As these artificial intelligence systems become more powerful, they will become more and more integrated into very important areas,” said Leo Gao, a research scientist at OpenAI. MIT Technology Review in an exclusive preview of the new work. “It’s really important to make sure they’re safe.”
This is still early research. The new model, called a low-weight transformer, is much smaller and much less efficient than top-tier mass-market models such as the company's GPT-5, Anthropic's Claude and Google DeepMind's Gemini. Gao says it's as capable as GPT-1, a model that OpenAI developed back in 2018 (though he and his colleagues haven't done a direct comparison).
But the goal isn't to compete with the best in class (at least not yet). Instead, by looking at how this experimental model works, OpenAI hopes to learn about the hidden mechanisms inside bigger and better versions of the technology.
It's an interesting study, says Elisenda Grigsby, a mathematician at Boston College who studies how graduate programs work and was not involved in the project: “I'm confident that the methods it will introduce will have a significant impact.”
Lee Sharkey, a research scientist at artificial intelligence startup Goodfire, agrees. “This work aims to achieve the right goal and seems well executed,” he says.
Why are models so difficult to understand?
OpenAI's work is part of a hot new area of research known as mechanistic interpretability, which attempts to map the internal mechanisms that models use to perform various tasks.
It's harder than it seems. LLMs are built on neural networks, which consist of nodes called neurons arranged in layers. In most networks, each neuron is connected to every other neuron in adjacent layers. Such a network is known as a dense network.
Dense networks are relatively efficient for learning and work, but they spread acquired knowledge across a vast network of connections. As a result, simple concepts or functions can be shared between neurons in different parts of the model. At the same time, individual neurons can also represent many different functions, a phenomenon known as superposition (a term borrowed from quantum physics). As a result, you cannot associate specific parts of the model with specific concepts.






