In My Tribe

I think you have the details largely right and I agree that Magoon misspoke. That said, maybe he isn't far off. I suspect internet sources get a lot of weight in how AIs address a question presented to them. In most cases probably way too much weight, not that books and academic journals are all that close to being "true."

13hEdited

Before you call my writing "a bizarre flub," "completely false nonsense," and "bizarre", I would recommend that you actually read my article, which you obviously did not.

Relying on announcements of what Anthropic says that they are trying to do is very different from actual evidence.

We do not know:

The exact size of any book corpus used

The percentage of total training tokens that came from books

Whether the books were licensed, purchased, scraped, or obtained via third parties

How comprehensive the coverage was

Whether the corpus included most modern copyrighted works

Those details remain opaque.

I stand by my statement that:

1) The vast majority of data used to train AI comes from web material.

2) AI has not been trained on the overwhelming majority of books published over the last century.

3) That is a big problem that needs to be addressed.

If AI companies are forced to commit criminal offenses to train AI properly, then that suggests that there is a really problem that needs to be solved very soon. I think that the proposal made in my article is the best solution.

Lex Spoon

That's not quite accurate about Anthropic's class action suit.

AI companies are generally allowed to buy a book and train on it. Anthropic has trained on millions of books and appears to have bought most of the books that are on Amazon.

What the class action was about was books on pirate sites that they did not purchase, e.g. things that used to be on Amazon but no longer are. Usually if a book is valuable, the author will keep listing it, so a lot of this just isn't worth a lot of debate over. For example, if an author publishes a new version of a book on Amazon and removes the old version, the old version could be part of the class action case.

The whole thing appears to be a money grab. The lawers obtained $300 million in fees that Anthropic has to pay for, and they structured in a case that shed no light at all on what the rules are supposed to be.

I feel really bad about extracting money from researchers like this. It's not like AI companies were out in the field and found an oil well and got a windfall from it. It's more like Red Cross shows up and starts putting a tent up, and this army of people with pitchforks and torches are taking down the tent and demanding that the resources be spread among everyone. Well, the resources needed to be in the tent in order for it to work! If I take a table, you take a chair, another person takes a roll of gauze, then we destroyed what was valuable about the enterprise.

Handle

Note that I wasn't characterizing the nature of the legal dispute in the case, just the point which the adversaries of Anthropic were trying to leverage in a PR strategy to create a phony "scandal" about it. My point that the model has definitely trained on the content of every book Anthropic could get its hands on stands, and that Magoon's false assertion to the contrary remains a bizarre flub

Lex Spoon

Yes, I agree very much. I should have said so.

I like to think about the analogy to how a college professor is trained. In grad school, and as life-long study, they will ready every book and paper they can get their hands on. It's not considered stealing so long as the exact text isn't reproduced.

On the contrary, you pretty much have to have to do it. A prof who doesn't read is not going to be any good.

James Hudson

Noah Smith wrote: “The Native Americans simply lost the power to decide what their future would look like.” If we interpret this sentence distributively—as telling us about each individual Indian—we must consider it a silly sentence. The individual Indian never had much power to decide the future, and even after the European invasion he typically still had some such power—at most, *slightly* diminished. More charitably, we may interpret the author as speaking collectively, pretending that Indians (never mind in what geographical region, exactly) had a single mind, which had beliefs and desires and, most of all, will, and that before the coming of the Europeans this entity was exercising considerable power to determine its future (i.e., that of its members), but that afterwards it no longer had such power.

I mildly object to this anthropomorphizing of the collective entity; in particular, the analogy between an individual’s *determining his future*, by taking actions aimed at desired outcomes, and what the collection of Indians was “doing” is quite weak. Furthermore, the rhetorical force of Smith’s passage depends on our having the same sort of concern for the collective entity that we have for each individual person, and this is wrong. Individuals matter inherently; collections do not.

When current hunter gatherer Indians in the Amazon are questioned, they note the total destruction of their way of live.

Previously, hunt game food, and men of other tribes, to kill. Hunt women to steal and/or rape. Celebrate when successful.

Under white cultural dominance, neither killing other tribe's men, nor taking other women by force, are allowed. That similar-to-the-past future was taken away.

Too few anti-Americans know much about Indian life, like the Comanche Indians invading & capturing Apaches, killing the men and gang raping the women.

I wish more "land acknowledgements" would list all the Native tribes that fought over the particular land, none of whom claimed ownership, merely control With their right of control based on their might.

Yancey Ward

Since I have no idea when to short this mania, I won't try but mark my words, the AI bubble will burst spectacularly due to the real world's physical and political constraints.

Reply (3)

Handle

If the AI hype is false, sure, that's the bubble. But if the AI hype is real, everything else was the bubble.

It's hard to zoom out our assumptions and perspectives from the narrow experience of the time and context with which we are most familiar.

When one takes longer historical views, that which seemed robust and permanent at the time ended up withering away to nothing in the blink of an eye. In a similar blink-at-zoom-out, that which seemed to be on the extreme fringe barely clinging to the edge of survival ended up taking over the world. #ManySuchCases.

So, perhaps some things we imagine today to the most stubborn political constraints are actually the things that are more like mere bubbles that themselves will burst when popped by the new reality of AI.

Maybe it's cynical of me to think that politicians are either for sale or will be replaced because out-spent by those who are. But my spin on Mark Felt's rear-looking investigative advice to "Follow the Money" is forward-looking "Anticipate the Money", looking for how much of it is likely to accumulate pressure behind what we imagine to be a sturdy dam right until the very second it cracks and crumbles to nothing and forever transforms the whole region's terrain like a Missoula Flood.

Yancey Ward

I think the technology is transformative but the capital spenders on this are way, way out in front of their skis on this. This is no different from the period of 1995-2000- vast sums are going to be lost and it won't be easy right now to figure out who the survivors are going to be.

I agree that some of the medium and big players will fail, maybe spectacularly, and others will get absorbed or otherwise fade away. Maybe this will even be disruptive but I doubt this will be a bubble to an extent even near as big as the dot com bubble.

MikeW

This seems right to me, but then, what do I know?

Alan

Get to the other side, Arnold. Once you use Claude Code, the world changes. I'm talking to it 8 hours a day now except for the brief moments I read your posts. Actually, maybe that's not a recommendation but a warning!

I've been training business owners and getting out of the chat interface is the single hardest thing about AI. Windows computers are just a pain with this. You have to download so many packages and Anthropic installation in something like a GitHub Codespace is a UI nightmare. Cowork is an improvement but very slow and a bit unstable at this point but it'll get better.

Seth Ariel Green

> I had not noticed that the AI’s were not well read in academic sources. Is that really true?

Possibly for things behind paywalls but I co-authored an open-access piece that's been cited a lot in 2019 and Gemini and Claude both just gave me reasonably accurate summaries. ChatGPT started BSing though.

13h

Did you ask Gemini and Claude whether they have been trained on the exact text of the article? I would be curious as to its answer.

T Benedict

"I had not noticed that the AI’s were not well read in academic sources. Is that really true?" Regarding the Magoon essay. I thought I'd ask Gemini and the abbreviated answer that although Magoon is exaggerating, it is a problem. An excerpt from Gemini - Magoon’s claim that books are "almost completely missing" is a slight exaggeration. GPT-3’s training set included two "Books" datasets (estimated at ~16% of the total), and newer models like GPT-4 have been shown to contain verbatim text from paywalled books. However, compared to the trillions of tokens of web text, high-quality edited literature is indeed a minority voice in the "brain" of the AI.

Reply (3)

I stated more or less the same elsewhere as an opinion. Thx for providing some info to back it up.

13h

16% of what?

16% of total books published over the last century. That sounds very high compared to estimates that I received from ChatGPT. It gave me low single digit percentages.

Or does 16% of the data used to train AI comes from published books?

That still suggests that web content dominates over books.

Stevec

It doesn’t matter what they were trained on. If you ask a question - to the paid up version GPT5 Plus, set to thinking mode - "from scientific sources give me the trend in severe tropical cyclones hitting the east coast of Australia" you will get the right answer (decline of 60%) and the reference, Callahan and Powers 2011. You couldn’t get that from GPT4. It was the average of "all the slop on the internet".

The training plus the toolkit means they understand the question, then grab hold of the correct tools (in this case searching scientific references that discuss this topic), then write out an answer based on your question and the context. Your context is key, if you have asked for "scientific sources", or "properly referenced data", or your settings say, "I prefer scientific data", then you'll get a correspondingly good answer. If you are on the free tier, or you are set to "instant" then you’ll are more likely to get the average of all the PR on the internet.

"humans who refuse to embrace AI will have much less power over their future than will humans who become AI-native."

Changes due to AI may be larger or faster but there's no reason the think they will be different than past tech improvements. If so, we can expect some people to gain a lot, some to lose in varying amounts, and most of us to come out ahead from general gains in efficiency and productivity. Besides that, disruption may have significant impacts as things change.

Most of the links Arnold provides are giving reasons to think the ai tech revolution will be different. I think Freddie's 3 year bet would 80% lose if it was a 6 year bet -- and that means lots more possibilities for wider outcomes, tho you're right there will be gainers, losers, and most of us more (or much much more?) or less better off.

More uncertainty, and far more exciting. And frightening & disruptive.

14h

If you want to know if AI has been trained on books, just ask it. Pick ten important books that have been published within the last 50 years:

“ Have you been trained on the actual text of <book title> book?”

My guess is that you will find that AI is relying on second-hand summaries from internet sources.

J. Frank

I personally found Claude Code extremely easy to use. I had to download Microsoft VS and google the install code for Claude, which I googled and copied and pasted the install command. After that... super easy. I sometimes even use Claude Code for non-programming projects. I'll download a number of PDFs and Word documents into a file, open the file in VS and ask Claude Code questions like it was a Notebook LM.

Lex Spoon

4dEdited

The Native American comment seems subtly wrong in an important way: yesterday's "Native Americans" are today's "Americans", and they have all the same access and capabilities as immigrants from Europe.

To make sense of saying Native Americans can't keep up, you have to frame it in a contorted way like "Native Americans who kept doing things the way they did in the past were not as capable as Native Americans that joined into society with the immigrating Europeans".

https://www.anthropic.com/news/anthropic-rwanda-mou

AI, and Magoon's blind spot on reading academic sources, and China's success, confirms my own bias:

Intel Property is the kind of property that socialism can redistribute effectively. When socialism raises taxes to pay for others, the taxpayers lose. When socialism allows copying all digital data, i.e. no IP, the originator & the copy both exist, the wealth value of that knowledge. Tho the saleable value of an IP monopoly grant by govt goes away -- and there becomes an incentive issue.

Every nation state should be pushing to digitize & copy for their own AI all of human knowledge, to be available, cheaply, to all of its citizens. China's sharing/ copying socialism is better for humanity, tho not IP owners.

Arnold: "humans who refuse to embrace AI will have much less power over their future than will humans who become AI-native." Maybe, but the politics means the majority against AI will mostly want policies to reduce the difference in outcomes from not using AI.

One of my quick idea models: 90 people don't use aigents, 10 people do, mostly for work. The next year, 5 of those using it for work lose their jobs because some better experts use the aigents they were using, but better. The 90 who don't use aigents have lower wages, but also lower prices on stuff that ai-enhanced producers make.

I'm telling my kids to use ai-- and even my wife, a professor in social work. She's strongly resisting. So far.

Since I spent many months in often-corrupt Rwanda, I was especially interested in their govt-Anthropic agreement.

It would be cool wild if current lousy corrupt governments started to massively switch to far less corrupt ai govt aigents for form filling & permissions.

Keith Eubanks

Whatever database(s) AI is trained on today, it is still largely limited to what has been written. Yet, in the larger sphere of human knowledge, most of it is not written. Most skills are still passed from person to person via practice and exchange.

On usage: today, AI can be directed to perform task that we do on a computer or screen. Yet, I cannot direct it to pour and finish a concrete slab or frame a house or build a fence. That will require additional technology coupled with AI and that will require learning mechanisms beyond written databases.

I don't believe the database limitation is quite so true -- vast stores of actual knowledge are available in YouTube videos. But definitely, as I've said for years now, aigents will be able to do computer/ digital tasks as well or better than humans. All digital tasks, tho judging image quality will be a subjective issue.

I'm wondering when there will be ai/human image generations contests, which is better? But I'm more towards text & such ideas, rather than images. Which are always very surface -- maybe even superficial.

commenter

Magoon should be lauded for taking on the closed world of academic journals and for championing the preservation of books as a source of knowledge.

However, FWIW, one minor qualification to Magoon’s point about LLMs not being trained on books, regarding online libraries of digitized public domain books may be an exception that hopefully is offsetting some of the recency and ideological bias. With something like 75,000 digitized public domain books, Project Gutenberg is a widely used source of training material. And the Library of Congress digital collections also make public domain books available for training. I read that “A key example is the Common Pile v0.1, an 8TB dataset of public domain and openly licensed text, which includes public domain books from curated collections covering literature, science, and history. This dataset was used to train Comma v0.1-1T and Comma v0.1-2T, 7-billion-parameter LLMs that perform competitively with models trained on unlicensed data like Llama 1 and 2 7B” but have no idea of what that might mean.

A possible complementary legislative approach to support his approach that he might want to consider, is undoing some of the harmful copyright extension legislation that has been pointlessly enacted over the past decades. For example, repeal of the 1998 Copyright Term Extension Act lengthened the term for pre-1978 works to 95 years, delaying the public domain entry of works from 1923–1977, would be a good start. It also extended post-1978 copyright terms in the United States to life of the author plus 70 years, up from the previous life of the author plus 50 years established by the 1976 Copyright Act. For works made for hire, corporate works, anonymous, or pseudonymous works, protection was extended to 95 years from publication or 120 years from creation, whichever comes first.

Chartertopia

Michael Magoon shows the value of judicious use of commas.

> our current AI is almost exclusively trained on internet sources and published books and academic articles are almost completely missing.

Assuming he meant

> our current AI is almost exclusively trained on internet sources and (published books and academic articles) are almost completely missing.

that doesn't seem right to me. It assumes no books are available on the internet, and didn't Google get in trouble for putting copyrighted books on the internet, available as a few random pages at a time? I don't remember the details and don't know the current state. Academic articles also seem to be widely available, usually at least the abstracts, although again, I don't know the details.

Lots of books are open domain and freely available on the Internet with no restrictions.

John Alcorn

Re: "I had not noticed that the AI’s were not well read in academic sources. Is that really true?"

"Elicit" is a product that focusses AI on peer-reviewed research:

https://elicit.com/

Jeff Abrams

As is becoming increasingly clear/written about, 2028 is shaping to be the first election where AI will be a major - if not the major - issue. Won’t fit neatly within party lines. Based on recent headlines I get a sense that De Santis is going to position himself as relatively AI skeptical - perhaps to try to contrast himself with Vance who he’ll try to imply is in techs pocket.