AI Reaches Silver-Medal Level at This Year’s Math Olympiad

During the 2024 International Mathematical Olympiad, Google DeepMind debuted an AI program that can generate complex mathematical proofs

While Paris was preparing to host the 33rd Olympic Games, more than 600 students from nearly 110 countries came together in the idyllic English town of Bath in July for the International Mathematical Olympiad (IMO). They had two sessions of four and a half hours each to answer six problems from various mathematical disciplines. Chinese student Haojia Shi took first place in the individual rankings with a perfect score. In the rankings by country, the team from the U.S. came out on top. The most noteworthy results at the event, however, were those achieved by two machines from Google DeepMind that entered the competition. DeepMind’s artificial intelligence programs were able to solve a total of four out of six problems, which would correspond to the level of a silver medalist. The two programs scored 28 out of a possible 42 points. Only around 60 students scored better, wrote mathematician and Fields Medalist Timothy Gowers, a previous gold medalist in the competition, in a thread on X (formerly Twitter).

To achieve this impressive result, the DeepMind team used two different AI programs: AlphaProof and AlphaGeometry 2. The former works in a similar way to the algorithms that mastered chess, shogi and Go. Using what is called reinforcement learning, AlphaProof repeatedly competes against itself and improves step-by-step. This method can be implemented quite easily for board games. The AI executes several moves; if these do not lead to a win, it is penalized and learns to pursue other strategies.

To do the same for mathematical problems, however, a program must be able not only to check that it has solved the problem but also to verify that the reasoning steps it took to arrive at the solution were correct. To accomplish this, AlphaProof uses so-called proof assistants—algorithms that go through a logical argument step-by-step to check whether answers to the problems posed are correct. Although proof assistants have been around for several decades, their use in machine learning been constrained by the very limited amount of mathematical data available in a formal language, such as Lean, that the computer can understand.

## On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Solutions to math problems that are written in natural language, on the other hand, are available in abundance. There are numerous problems on the Internet that humans have solved step-by-step. The DeepMind team therefore trained a large language model called Gemini to translate a million such problems into the Lean programming language so that the proof assistant could use them to train. “When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean,” the developers wrote on DeepMind’s website. By doing so, AlphaProof gradually learns which proof steps are useful and which are not, enhancing its ability to solve more complex problems.

Geometry problems, which also appear in the IMO, usually require a completely different approach. Back in January DeepMind presented an AI called AlphaGeometry that can successfully solve such problems. To do this, the experts first generated a large set of geometric “premises,” or starting points: for example, a triangle with heights drawn in and points marked along the sides. The researchers then used what is called a “deduction engine” to infer further properties of the triangle, such as which angles coincide and which lines are perpendicular to each other. By combining these diagrams with the derived properties, the experts created a training dataset consisting of theorems and corresponding proofs. This procedure was coupled with a large language model that sometimes also uses what are known as auxiliary constructions; the model might add another point to a triangle to make it quadrilateral, which can assist in solving a problem. The DeepMind team has now come out with an improved version, called AlphaGeometry 2, by training the model with even more data and speeding up the algorithm.

To test their programs, the DeepMind researchers had the two AI systems compete in this year’s Math Olympiad. The team first had to manually translate the problems into Lean. AlphaGeometry 2 managed to solve the geometry problem correctly in just 19 seconds. AlphaProof, meanwhile, was able to solve one number theory and two algebra problems, including one that only five of the human contestants were able to crack. The AI failed to solve the combinatorial problems, however, which might be because these problems are very difficult to translate into programming languages such as Lean.

AlphaProof’s performance was slow, requiring more than 60 hours to complete some of the problems—significantly longer than the total nine hours the students were allotted. “If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher,” Gowers wrote on X. “Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.”

Gowers and mathematician Joseph K. Myers, another previous gold medalist, evaluated the solutions of the two AI systems using the same criteria as was used for the human participants. According to these standards, the programs scored an impressive 28 points, which corresponds to a silver medal. This meant that the AI only narrowly missed out on reaching a gold-medal level of performance, which was awarded for a score of 29 points or more.

On X, Gowers emphasized that the AI programs have been trained with a fairly wide range of problems and that these methods are not limited to Mathematical Olympiads. “We might be close to having a program that would enable mathematicians to get answers to a wide range of questions,” he explained. “Are we close to the point where mathematicians are redundant? It’s hard to say.”

*This article originally appeared in *Spektrum der Wissenschaft* and was reproduced with permission*.