LLM Links, 2/12/2025 and Live Event Friday
Nathan Lambert on Reasoning Models; Dario Amodei on DeepSeek; The Zvi on OAI Deep Research; Min Choi asks Deep Research which jobs AI will replace
On Friday, at 11 AM New York time, I will be talking with Bloomberg’s James Cham about what the latest developments in AI mean for business. My current plan is to use Substack Live for this. Check back on Friday morning to make sure I have not changed that plan.
A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:
Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.
Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.
Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.
I picture the models trying out a sequence of steps, like a chess player imagining a sequence of moves, and then when the sequence doesn’t seem to lead to good results going back and trying a different sequence. The evaluation function for a sequence of steps comes from reinforcement learning somehow.
I am sorry that I am so hazy about this, but I have yet to see an explanation that I can grok.
Pointer from Alexander Kruel.
DeepSeek-V3 was actually the real innovation and what should have made people take notice a month ago (we certainly did). As a pretrained model, it appears to come close to the performance of state of the art US models on some important tasks, while costing substantially less to train (although, we find that Claude 3.5 Sonnet in particular remains much better on some other key tasks, such as real-world coding). DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency.
…R1, which is the model that was released last week and which triggered an explosion of public attention (including a ~17% decrease in Nvidia's stock price), is much less interesting from an innovation or engineering perspective than V3. It adds the second phase of training — reinforcement learning…However, because we are on the early part of the scaling curve, it’s possible for several companies to produce models of this type, as long as they’re starting from a strong pretrained model. Producing R1 given V3 was probably very cheap. We’re therefore at an interesting “crossover point”, where it is temporarily the case that several companies can produce good reasoning models. This will rapidly cease to be true as everyone moves further up the scaling curve on these models.
Pointer from Rowan Cheung.
One thing that frustrates me about Gemini Deep Research, and seems to be present in OpenAI’s version as well, is that it will give you an avalanche of slop whether you like it or not. If you ask it for a specific piece of information, like one number that is ‘the average age when kickers retire,’ you won’t get it, at least not by default. This is very frustrating. To me, what I actually want - very often - is to answer a specific question, for a particular reason.
I think that I would react similarly. For many folks with academic training, the “literature search” is something you always do early in a project. Instead, for a book like Crisis of Abundance or Invisible Wealth I start out with a lot of background knowledge about the topic, including sources that I trust. My research consists of searching for specific information that helps flesh out what I am writing. For example, I might want to know: instead of comparing longevity across countries, where car accidents and homicides affect the results, what is the conditional longevity of someone aged 70 in different countries; or what is longevity adjusted for deaths from non-natural causes?
Deep Research just dropped a wild list... 20 jobs that OpenAI o3 will replace humans.
Pointer from Mark McNeilly.
Number one on the list is “tax preparer.” I wish.
My procedure for tax preparation used to be: open the envelopes from everyone sending me tax information, take out the documents, and give them to my tax preparer.
The new procedure is: get emails from everyone with tax information, go to their web sites one by one, log in, go through two-factor authentication, find the information, download it, move it to a folder for tax documents, go to my tax prepaper’s online system, log in, go through two-factor authentication, then upload the documents. Not a less costly process, by any means.
substacks referenced above: @
@
#
We see AI everywhere but in the productivity statistics.
In the absence of stated benchmarks it's hard for me to understand what Nathan Lambert means by "superhuman." A calculator can significantly outperform me if I have to do math in my head and even if I have access to pencil and paper. How should I interpret phrases like "superhuman" and "peak performance."