LLM Links, 2/12/2025 and Live Event Friday

Nathan Lambert on Reasoning Models; Dario Amodei on DeepSeek; The Zvi on OAI Deep Research; Min Choi asks Deep Research which jobs AI will replace

Feb 12, 2025

On Friday, at 11 AM New York time, I will be talking with Bloomberg’s James Cham about what the latest developments in AI mean for business. My current plan is to use Substack Live for this. Check back on Friday morning to make sure I have not changed that plan.

Nathan Lambert writes,

A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:
Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.
Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.
Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.

I picture the models trying out a sequence of steps, like a chess player imagining a sequence of moves, and then when the sequence doesn’t seem to lead to good results going back and trying a different sequence. The evaluation function for a sequence of steps comes from reinforcement learning somehow.

I am sorry that I am so hazy about this, but I have yet to see an explanation that I can grok.

Pointer from Alexander Kruel.

Dario Amodei writes,

DeepSeek-V3 was actually the real innovation and what should have made people take notice a month ago (we certainly did). As a pretrained model, it appears to come close to the performance of state of the art US models on some important tasks, while costing substantially less to train (although, we find that Claude 3.5 Sonnet in particular remains much better on some other key tasks, such as real-world coding). DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency.
…R1, which is the model that was released last week and which triggered an explosion of public attention (including a ~17% decrease in Nvidia's stock price), is much less interesting from an innovation or engineering perspective than V3. It adds the second phase of training — reinforcement learning…However, because we are on the early part of the scaling curve, it’s possible for several companies to produce models of this type, as long as they’re starting from a strong pretrained model. Producing R1 given V3 was probably very cheap. We’re therefore at an interesting “crossover point”, where it is temporarily the case that several companies can produce good reasoning models. This will rapidly cease to be true as everyone moves further up the scaling curve on these models.

Pointer from Rowan Cheung.

Zvi Mowshowitz writes,

One thing that frustrates me about Gemini Deep Research, and seems to be present in OpenAI’s version as well, is that it will give you an avalanche of slop whether you like it or not. If you ask it for a specific piece of information, like one number that is ‘the average age when kickers retire,’ you won’t get it, at least not by default. This is very frustrating. To me, what I actually want - very often - is to answer a specific question, for a particular reason.

I think that I would react similarly. For many folks with academic training, the “literature search” is something you always do early in a project. Instead, for a book like Crisis of Abundance or Invisible Wealth I start out with a lot of background knowledge about the topic, including sources that I trust. My research consists of searching for specific information that helps flesh out what I am writing. For example, I might want to know: instead of comparing longevity across countries, where car accidents and homicides affect the results, what is the conditional longevity of someone aged 70 in different countries; or what is longevity adjusted for deaths from non-natural causes?

Min Choi writes,

Deep Research just dropped a wild list... 20 jobs that OpenAI o3 will replace humans.

Pointer from Mark McNeilly.

Number one on the list is “tax preparer.” I wish.

My procedure for tax preparation used to be: open the envelopes from everyone sending me tax information, take out the documents, and give them to my tax preparer.

The new procedure is: get emails from everyone with tax information, go to their web sites one by one, log in, go through two-factor authentication, find the information, download it, move it to a folder for tax documents, go to my tax prepaper’s online system, log in, go through two-factor authentication, then upload the documents. Not a less costly process, by any means.

substacks referenced above: @

Mimir's Well

The New News in AI: 2/10/25 Edition

ChatGPT can do "deep research" for you, AI designs a chip better than humans, Anthropic is catching up to OpenAI, AI more creative than humans - study, Elon uses AI to cut government spending, and more, so…

7 months ago · 1 like · Mark McNeilly

Axis of Ordinary

Links for 2025-01-29

AI: DeepSeek R1 2X speed boost was apparently coded by R1 itself https://simonwillison.net/2025/Jan/27/llamacpp-pr…

7 months ago · 1 like · Alexander Kruel

Interconnects

Why reasoning models will generalize

This post is early to accommodate some last minute travel on my end…

7 months ago · 108 likes · 7 comments · Nathan Lambert

Charles Powell

Feb 12

We see AI everywhere but in the productivity statistics.

Expand full comment

Sean Murphy

In the absence of stated benchmarks it's hard for me to understand what Nathan Lambert means by "superhuman." A calculator can significantly outperform me if I have to do math in my head and even if I have access to pencil and paper. How should I interpret phrases like "superhuman" and "peak performance."

6 more comments...

In My Tribe

Discussion about this post