LLM Benchmarking: Surprising Task Complexity Gains

The main objective of many large languages ​​models (LLM) is to provide a convincing text as close as possible to the distinction of human writing. And this is a major reason why it is so difficult to assess the relative performance of LLMs using traditional references: the quality of writing is not necessarily correlated with metrics traditionally used to measure processor performance, such as the instruction rate.

But researchers from the Metr of Think Tank Berkeley, California (for model assessment and threat research), had an ingenious idea. First of all, identify a series of tasks with variable complexity and record the average time required for a group of humans to finish each task. Then, various versions of LLMS perform the same tasks, noting the cases in which an LLM version successfully finishes the task with a certain level of reliability, let’s say 50% of the time. The plots of the resulting data confirm that, over time, the successive generations of an LLM can reliably perform more and more tasks (more and more complex).

No surprise there. But the shock was that this improvement in the capacity of LLMS to be reliably done the harder tasks was exponentialWith a period of doubling of about seven months.

Spectrum ieee Ten on Megan Kinniment, one of the authors of a METR research article describing this work and its surprising implications.

Evaluation of LLM performance metrics

Have you suspected that you will get these results?

Kinniment Megan: At least personally, I did not expect that we have an exponential as clear as we do. But the models certainly improved quickly. So a rapid progress rate was therefore not unexpected.

As you point out in the newspaper, it is always dangerous to look towards the future and to extrapolate. However, you suggest that there is a probability that it will be continuous, which means that in 2030, we will examine the tasks of one month in the capacity of the most advanced large language models.

Kinnial: Let’s take a look at that. In one month, we mean around 167 hours of work, so the number of [human] working hours in a month. And it’s 50% reliability. But longer tasks generally seem to require higher reliability to be really useful. It is therefore something that could make economic effects in practice, the real world and are not as intense as what is planned.

There are a number of things that should continue to make this prediction come true. The equipment should continue to improve roughly at the rate it improves; The software should continue to improve. You should have enough training and availability data of this training data to continue training with a breathtaking clip that has occurred in recent years.

Kinnial: The forecasts and the dates that we have found are simply extrapolated the trend we see on our sequence of tasks. [The trends are] Do not take into account the factors of the real world or changes in calculation scale.

If a large -language model could somehow reach the ability to complete 167 -hour tasks with a reliability of 50%, what types of things are now in the field of the capacity of a large language model?

Kinnial: Well, the big one we often think is accelerating research on R&D AI herself. Since you can create models that accelerate your business’s ability to create better models, you could find yourself in a situation where AI capabilities are developing very quickly.

What exponential growth of AI means for humanity

What you describe recalls the idea of ​​singularity, where you have AI creating other ais by themselves, not helped by human beings.

Kinnial: I think you can get a fairly intense acceleration and make things significantly more difficult to control without necessarily leading to this massively explosive growth. There are reasons to think that you may have various bottlenecks that slow down things in practice. Even if it was the case that we had very, very intelligent ais, this pace of progress could still suffocate things like equipment and robotics. But yes, singularity is undoubtedly an idea that is relevant to all this sector of things.

Things could go fairly quickly, but it’s not like it was singularity or nothing. [AI-development rates] which were soft compared to a singularity could still be intense enough for the way the world must adapt.

You have indicated in the article that some important languages ​​of languages ​​seem to improve in their ability to adapt and improve errors.

Kinnial: I think it was a relatively progressive thing from Chatgpt, and potentially before that. They are less likely to get stuck. They are a little better to change the strategies when things don’t work, but it’s a bit touched. And they are certainly much better to do things than before and better to use tools. But it seems that there are fundamental aspects that have not changed much. One thing that I like to look at when I get a new model is that, on each task, we give the model a number of tokens, a certain number of words that it can say. And if you could imagine giving them more and more time or more tokens to make a task, how does that affect the probability of succeeding? And fundamentally, what we see is that they set is quite strongly. There is a time when you give them more tokens and it doesn’t really help. And for each new model, this tray becomes a little higher.

A woman with brown hair wearing a brown t-shirt.Megan Kinniment was part of the METR team who published the results of an LLM performance study.Megan Kinniment

Humans, I imagine, also have decreasing yields. But if you give a lot of time to a human to do something, they will probably do a better job, especially if you have several humans. And I think that I would be quite impressed by a large model of language which, even if his absolute score was lower, seemed that he could continue to do things and improve. It could be a big problem.

You found that the models operated worse on the tasks that had higher “disorder” scores. Is there a signal that you have withdrawn from data that this state of affairs could change? In other words, that models could gain greater ability to manage tasks that had a higher disorder?

Kinnial: The disorder was a measure that I made to try to obtain a somewhat quantitative measure of how our tasks were unrealistic in relation to the real world. And most of our tasks are not so messy. It is a 16 -point scale. The average is around 3, and the most disorderly tasks are about 8 out of 16.

So what would be a task 16 in terms of disorder?

Kinnial: Something like espionage, where you have a lot of resource limitations. It’s very punitive. You have agents that actively optimize you. It’s easy to spoil. It’s new.

Do you all plan to follow this study?

Kinnial:Openai published O3, and O3 was a little more capable than expected given the trend. We therefore carry out a certain amount of monitoring in terms of measuring other models. We want to focus on world information on AI development and the catastrophic risks of AI systems.

Catastrophic risks of advanced AI

What are the most likely catastrophic risks of AI? I mean, those who come to mind are massive dislocations in employment if and when AI becomes supremely capable.

Kinnial: When we talk about catastrophic risks, we are not only talking about mass unemployment. We are talking about things that look more like this: if everyone became unemployed or if you just don’t need human workers for the vast majority of things, you may not need human workers to maintain your soldier, or much fewer humans. This could allow someone to make a coup more easily, essentially. Or, if you have a large amount of geniuses in a data center, that would make you a very powerful person. If you use it to produce military equipment, we may be able to obtain a concentration of power, and you may no longer have a democratic state.

All this would obviously occur without any form of conscience. These are machines that would have the capacity for schem, to trace and plan, but without the type of consciousness that characterizes human capacity to do so. Consciousness is not necessary for that.

Kinnial:Consciousness is a difficult problem. I do not know if consciousness is necessary for particular behavior. It seems a little above my remuneration note. I also think that it is not crazy that they can be aware at this stage. They would be very intelligent.

So you think that they may be aware at some point in the future?

Kinnial: I mean, if they are as intelligent as you and me, then it does not seem completely crazy. It doesn’t seem crazy for them not to be, and it doesn’t seem crazy for them.

From your site items

Related items on the web

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button