Problem Solver or Problem Maker? Learning Math With AI

Three years ago, the public release of OpenAI’s ChatGPT, the world’s most popular large language model, signaled that the Age of Artificial Intelligence had well and truly arrived. Since then, ChatGPT and other generative AI tools—which can create text, images, code, and other content in response to human prompts—have increasingly reshaped many aspects of daily life, including education. Every day, millions of students use ChatGPT to help write essays, finish homework, get feedback, and as a virtual tutor.

How is the rapid spread of generative AI changing the way we learn? Are those changes for the better? The technology is evolving so quickly that researchers are racing to keep up.

A new study by Drs. Hamsa Bastani, Osbert Bastani, Alp Sunga, and colleagues at the University of Pennsylvania, which was recently published in the Proceedings of the National Academy of Sciences, marks one of the first large-scale studies into the impact of generative AI on human learning. In a field experiment, nearly 1,000 high school or secondary students learned math with or without the help of ChatGPT. The results, which shed light on the potential benefits and pitfalls of AI as a learning tool, are eye-opening.

The study involved nearly fifty 9th, 10th, and 11th-grade classes in Türkiye (Turkey). In each class, teachers first reviewed a math topic (such as combinatorics) from the regular course curriculum. Students then completed practice exercises on that topic—either with or without the help of ChatGPT. Afterward, the teacher briefly reviewed the correct answers, and then the students took a closed-book, closed-laptop exam covering what they had learned.

Crucially, each student practiced with or without generative AI. They did so in one of three ways:

In the “GPT Base” condition, students had unfettered access to a version of the public ChatGPT platform. They could use it however they chose. This condition resembled the way many students currently engage with ChatGPT—freely and with little to no structure or guidance.
In the “GPT Tutor” condition, students used a special version of ChatGPT that the researchers had developed in collaboration with teachers. This version had several built-in “guardrails”: It had access to the correct solution for each problem, detailed knowledge of the solution steps, and, most importantly, was instructed to provide hints rather than dole out correct answers. It helped students learn the necessary solution steps, allowing them to solve the problems on their own.
In the control condition, students did not have access to generative AI and instead relied solely on the textbook and their notes.

Several remarkable findings emerged. First, having access to ChatGPT during learning helped students do better on the practice exercises. In the “GPT Base” condition, students performed an impressive 127 percent better than those in the control condition. This improvement occurred despite ChatGPT often producing incomplete or inaccurate solutions, consistent with its well-documented tendency to “hallucinate,” misinterpret, or fabricate information. Students in the “GPT Tutor” condition also demonstrated substantial gains, performing 48 percent better than peers who lacked AI support.

The benefits of learning with generative AI, however, turned out to be a mirage. On the exam, students in the “GPT Base” condition scored 17 percent lower than the control condition, whereas students in the “GPT Tutor” condition performed no better than the control condition. These findings illustrate a longstanding principle from the learning sciences: Performance during training or practice does not necessarily reflect the learning that is actually happening. Indeed, without access to AI, students who had used it were no better off—and in the “GPT Base” condition, significantly worse off—than those who had practiced on their own.

The research team uncovered a contributing factor for these results. In the “GPT Base” condition, many students had ChatGPT solve the practice problems for them and simply copied the answers it provided. They relied on AI to do the hard work and, in the process, deprived themselves of learning opportunities.

Students were also largely unaware of how generative AI had influenced their learning. When surveyed, those in the “GPT Base” condition rated their learning and exam performance as on par with their peers in the control condition, whereas those in the “GPT Tutor” condition gave even higher estimates. Those beliefs did not comport with reality.

It appears, then, that the impact of generative AI on math learning is mixed at best. At present, there are considerable downsides: The standard version of ChatGPT is all too easily used as a shortcut for solving problems, allowing students to bypass the deeper understanding that was once essential for learning math and related skills; there is a risk of encountering incorrect information, which can be compounded when students cannot evaluate the quality of AI-generated responses; and, worryingly, many students are blissfully unaware of the disadvantages they can incur when relying on AI.

The findings for the “GPT Tutor” condition, however, suggest ways to mitigate some of these downsides. By adding “guardrails” to ensure that students are responsible for mastering the materials independently by doing their own work, generative AI can be used in ways that are less likely to harm math learning. Input from human teachers can also help ensure that correct information is consistently provided. For now, perhaps the best advice for using generative AI to learn math would be to resist the temptation to generate correct answers and, moreover, not to blindly accept its output as foolproof.

Given students’ growing reliance on generative AI, this study raises serious concerns. Rather than aiding learning, unrestricted access to ChatGPT and other large language model-based chatbots can undermine it. Students are probably already learning more poorly as a result. Even the custom-designed “GPT Tutor,” which avoided some negative effects, still failed to outperform traditional learning methods.

Ultimately, these findings add fuel to the notion that generative AI remains a “solution in search of a problem,” at least in some educational contexts. Significant improvements—and greater caution on the part of human users—are likely needed before generative AI can be relied on to truly improve math learning.

link