At NeurIPS, Melanie Mitchell Says AI Needs Better Tests

When people want to soberly assess a condition artificial intelligence and what does it all mean, they tend to turn to Melanie Mitchellcomputer scientist and professor at the Santa Fe Institute. Her 2019 book, Artificial Intelligence: A Guide for Thinking Peoplehelped define the modern conversation about what modern artificial intelligence systems can and cannot do.

Melanie Mitchell

Today V NeuroIPSlargest gathering of artificial intelligence professionals this year, she gave keynote address called “About Science”Alien Mind”: Assessment of infant cognitive abilities, Animalsand AI.” Before the performance she spoke with IEEE spectrum O his themes: Wwhy today's artificial intelligence systems should be studied more as non-verbal minds, which are developmental and comparative psychology can teach AI researchers and how better experimental methods can change the way machine cognition is measured.

You use the phrase “alien intelligence” for both artificial intelligence and biological intelligence such as babies and animals. What do you mean?

Melanie Mitchell: I hope you noticed the quotation marks around the word “alien intelligence.” I quote the author's article [the neural network pioneer] Terrence Sejnowski where does he talk about ChatGPT How look like a space alien who can communicate with us and seems intelligent. And another article by a developmental psychologist. Michael Frank who plays on this theme and says: in developmental psychology we study alien intelligencenamely babies. And we have some methods that we think can be useful in analyzing AI intelligence. This is what I play.

When people talk about assessing intelligence using AI, what kind of intelligence are they trying to measure? Reasoning, abstraction, world modeling, or something else?

Mitchell: All of the above. People mean different things when they use the word intelligence, and intelligence itself, as you say, has all these different dimensions. So, I used the term “cognitive ability”, which is more specific. I study how various cognitive abilities are assessed in developmental and comparative psychology and try to apply some principles from these fields to AI.

Current Challenges in Assessing AI Cognitive Abilities

You say that the AI ​​field lacks good experimental protocols for assessing cognitive abilities. What does AI valuation look like today?

Mitchell: A typical way to evaluate an AI system is to have some set of landmarksand run your system on these test tasks and report accuracy. But it often turns out that while the AI ​​systems we have now kill it in benchmarks, they outperform humans, but that performance doesn't often translate into real-world performance. Just because an AI system passes the bar exam doesn't mean it will be a good lawyer in the real world. Often machines are good at these specific questions, but cannot make generalizations. Additionally, tests designed to evaluate humans make assumptions that are not necessarily relevant or true for AI systems, such as how well the system can remember.

As a computer scientist, I have not received any training in experimental methodology. Conducting experiments with artificial intelligence systems has become a core part of system evaluation, and most people who come into computer science have not received such training.

What do developmental and comparative psychologists know about cognitive research that AI researchers should also know?

Mitchell: As a psychology student, you learn all sorts of experimental methodologies, especially in fields like developmental psychology and comparative psychology, since they are nonverbal agents. You have to think really creatively to find ways to explore them. So they have all sorts of methodologies that involve very careful controlled experiments and many variations of stimuli to test robustness. They carefully study the types of failures, why the system [being tested] can fail because those failures may provide more insight into what is going on than success.

Can you give me a specific example of what these experimental methods look like in developmental or comparative psychology?

Mitchell: One classic example is Clever Hans. There was this horse, Clever Hans, who seemed to be able to do all kinds of arithmetic, counting and other numerical tasks. And the horse taps out the answer with its hoof. For years people have studied this and said, “I think this is real. This is not a hoax.” But then psychologist came up and said, “I’m going to think hard about what’s going on and do some control experiments.” And his control experiments consisted of the following: firstly, blindfold the horse, and secondly, put a screen between the horse and the person asking the question. It turns out that if the horse does not see the person asking the question, it will not be able to complete the task. He found that the horse actually picked up very subtle cues from the questioner's facial expression to know when to stop tapping. Therefore, it is important to find alternative explanations for what is happening. Be skeptical not only of other people's research, but perhaps even of your own research, your favorite hypothesis. I don't think this happens often enough in AI.

Do you have any case studies that focus on babies?

Mitchell: I have one case where babies were claimed to have an innate moral sense. During the experiment, they were shown videos of a cartoon character trying to climb a hill. In one case there was another character who helped them up the hill, and in another case there was a character who pushed them down the hill. So there was a helper and a hindrance. Children were judged on which character they liked best (and they had several ways to do this), and they overwhelmingly liked the helper character better. [Editor's note: The babies were 6 to 10 months old, and assessment techniques included seeing whether the babies reached for the helper or the hinderer.]

But another research group looked at those videos very closely and found that in all the videos with help, the climber who was being helped was excited to get to the top of the hill and was jumping up and down. And so they said, “Well, what if, in an obstacle situation, the climber is bouncing up and down at the bottom of the hill?” And this completely changed the results. The kids always chose the one that jumped.

Again, coming up with alternatives, even if you have a pet hypothesis, is how we do science. One thing that always shocks me a little about artificial intelligence is that people use the word “skeptic” in a negative sense: “You're an LL.M. skeptic.” But our job is to be skeptics, and that should be a compliment.

The importance of replication in AI research

Both of these examples illustrate the theme of seeking counterexplanations. Are there other important lessons you think AI researchers should learn from psychology?

Mitchell: Well, in science in general, the idea of ​​repeating experiments is really important and also builds on other people's work. But unfortunately, this is a bit frowned upon in the AI ​​world. If you, for example, submit a paper to NeurIPS in which you have reproduced someone else's work and then do something incremental to understand it, the reviewers will say, “There's no novelty and it's incremental.” This is the kiss of death for your newspaper. I think it should be valued more because that's how good science is done.

Returning to measuring the cognitive abilities of AI, there is a lot of talk about how we can measure progress towards AGI. Is this a completely different set of questions?

Mitchell: What a term AGI a little foggy. People define it differently. I think it's hard to measure progress on something that isn't so clearly defined. And our understanding of this continues to change, partly in response to what's happening with AI. In the old days of artificial intelligence, people talked about human-level intelligence and how robots could do all the physical things that humans do. But people looked robotics and said, “Well, okay, it's not going to happen anytime soon. Let's just talk about what people call the cognitive side of intelligence,” which I don't think is really that divided. So I'm a little skeptical about AGI, in the best sense if you will.

Articles from your site

Related articles on the Internet

Leave a Comment