Researchers isolate memorization from reasoning in AI neural networks

Looking ahead, if information removal techniques are further developed in the future, artificial intelligence companies could potentially one day remove, say, copyrighted content, personal information, or malicious memorized text from a neural network without destroying the model's ability to perform transformative tasks. However, because neural networks store information in distributed ways that are still not fully understood, the researchers say their method “cannot guarantee complete removal of sensitive information.” These are the first steps in a new direction of AI research.

Journey through the neural landscape

To understand how the Goodfire researchers differentiate memorization from reasoning in these neural networks, it's helpful to know about an AI concept called “loss landscape.” Loss Landscape is a way of visualizing how wrong or right an AI model's predictions are as it adjusts its internal settings (called “weights”).

Imagine setting up a complex machine with millions of dials. “Loss” measures the number of errors a machine makes. High loss means many errors, low loss means few errors. The “landscape” is what you would see if you could determine the error rate for every possible combination of dialing settings.

During training, AI models essentially “roll down” in this landscape (gradient descent), adjusting their weights to find the valleys where they make the least mistakes. This process provides output to the AI ​​model, such as answers to questions.

Figure 1 from the article “From Memorization to Reasoning in the Loss Curvature Spectrum.”


Credit:

Merullo et al.


The researchers analyzed the “curvature” of the loss landscape of specific AI language models, measuring how sensitive the model's performance is to small changes in the various weights of the neural networks. Sharp peaks and troughs represent high curvature (where small changes have large effects), while flat plains represent low curvature (where changes have minimal impact).

Using a technique called K-FAK (Approximate curvature taking into account the Kronecker factor) they found that individual remembered facts create sharp jumps in this landscape, but because each remembered element has peaks in a different direction, when averaged together they create a flat profile. Meanwhile, the reasoning abilities that many different inputs rely on maintain constant, moderate curves in the landscape, such as hills that remain roughly the same shape no matter the direction from which you approach them.

Leave a Comment