Humanity's last exam

Jan 24, 2025

Big news 🦅! Two of my questions were selected to be part of "Humanity's last exam", the new benchmark from Scale AI.

Currently, all frontier models, including the newly released Deepseek-R1, struggle with low accuracy on this benchmark. The submission process involved reviewing models’ attempts to solve these questions - essentially, they are challenges that models failed to answer back in October. Deepseek-R1 didn’t even exist at that time! This indicates how the benchmark represents, to some level, a meaningful measure of how well models can tackle mathematical problems.

In this post, I want to talk about the experience of developing those questions, because I think it is interesting and I want to write it down and share it with people.

At the end of September I found out about “Humanity’s last exam”, through Twitter1 . It sounded interesting for three reasons:

Definite reward: generous prizes ($5000 for great questions, $500 for good questions), and co-authorship on a big paper.
Perceived fairness: unlike many other places, which prioritised quantity, this one seemed to genuinely want the best questions.
Convenient interface. The initial process worked as follows: you write a question and give an answer (separately). Then, your prompt (no answer) is sent to three “basic” models. If all three of them answer incorrectly, it is sent to two “advanced” models. If those also answer incorrectly, your question is eligible for submission and will be considered for the database.
This means that you could write questions and check the performance of five models on them (not all available for free elsewhere either!) in a single browser tab, simultaneously. This alone was great! I haven’t yet seen a better interface for “casual” experiments with LLMs.

So, off I went. I carefully made time in my schedule and I started to submit questions. The guidelines were quite general, but obviously aimed at objective, easily verifiable answers, so asking about proving, or interpreting, concepts did not seem like a good idea. I decided to go even further - I would only ask questions whose answer was a single integer value. Again, this was my decision.

I started by sending a few group theoretic questions at a level that a good undergraduate student, say from Sapienza, would solve. No luck. I went through my MSc notes and came up with questions inspired on those topics… no luck.

Sadly, there appears to be no record of unsuccessful questions, and I don’t really remember the fine details. I wish I kept a record but the interface was so good that it would have honestly interrupted my workflow to do so. However, I have my tweets!

Most “hard” questions passed the first models, but were one-shotted by o1. These were hard questions at MSc level, on topics on which I am not an “expert”. Stuff that you could expect a successful MMath graduate to be able to solve in a few hours at most, exercises I could give in a Year 4 course, etc.

This specific “HA!” is not among the two included in the benchmark

At some point, I ran out of ideas and I switched to stuff I did in my PhD. An important point that I want to highlight is that this is not necessarily “harder”. My MSc at Sapienza saw things like the hydrogen atom in quantum mechanics, or somewhat exotic applications of blow-ups and Riemann-Roch. These are definitely harder than dealing with characters of finite groups and their normal subgroups… but they’re less common.

One thing that I noticed is that AI had a much easier time with hard, but quite “popular” topics. Going a little off the main road had it struggling a little more.

So, I pushed it to the limit: my PhD was in modular representation theory, which was once described as “a very big deal for a very small number of people”. I went and asked it some questions about calculations based on relatively obscure theorems that I had encountered… and I hit the first genuine area where models did not excel. They still figured out many answers, especially o1, but I was able to submit some questions.

A few weeks later, having exhausted PhD-related ideas (it is of course pointless to submit more than one question on the same topic), and also believing that modular representation theory wouldn’t get more than one or two hits in a benchmark based on the entirety of mathematics anyway, I started again wandering in my old notebooks, textbooks and so on, searching for suitable questions.

This tweet came from a particularly annoying session where I tested the models on Galois groups and field theory. This is an area where I would expected models to struggle quite a bit - answers can be big numbers quite easily… and very similar questions can have very different answers. But this did not happen, not a single Galois theory question went through.

My hypothesis? Galois stuff is well-represented in the training data. Obscure computations on block subsections are not.

Now, the following things remain true:

Some correct answers were guesses. The reasoning did not make sense, or it was not correct. You would expect that remark would allow me to “entrap” the AI by asking another question where the answer you’d guess is not the correct one… but this worked exactly once (and it is one of the two questions that made it in the benchmark).
Many answers would have received full marks if they were in an exam, or assessment of some sort. Well-written, correct, in three cases AI figured out a better answer than the one I had in mind. When I tried to trick it into annoying brute-force calculations where I was hoping it would make mistakes, often it dodged those with an elegant answer.
Yes, there is more to mathematics than just questions with a single integer answer.

Right now, this is a solid benchmark: even the new Deepseek model, which did not exist in autumn, only gets less than 10% of the questions right. Inevitably, it will be surpassed, but when that happens we’ll know that definite progress has been made. When? I would take a bet that it doesn’t eat the panettone2.

Meanwhile, I'm happy that the University of Manchester is represented in the long list of contributors, from all over the world. This was a human team effort and, so far, we appear to still have an edge!

The paper, dataset, and everything else are available here.

Twitter is the social network that’s still by far the absolute best place to be up-to-date about this stuff, and every other place is at least 10x less useful. While there has been some controversy, I’ve never been in the camp of inconveniencing myself for principled stances - and, in terms of dodging ideologies, I grew up in the internet of the 2000s. I feel that this statement should be included because “leaving Twitter” is becoming some sort of positive signal, and if I know my chickens someone will at some point argue that not doing it is somehow negative. It is not.

This is an italian expression to say that it won’t make it past Christmas.

Thoughts

Discussion about this post

Ready for more?