NYT vs. LLMs

Jan 10, 2024

getting the decision right

32 Comments

Jan 10, 2024

My experience with enforcement of the speeding laws based on many years of driving long distances is that cops will typically pull you over if you exceed the speed limit by more than 10 mph {in some places a little more, in others a little less) whether you are driving recklessly or not. If the speed limits were enforced as posted (which today would not be technically difficult) drivers would quickly adjust, but there would be enormous public pressure to raise the limits to realistic levels.

Expand full comment

Reply (2)

ashoka

Jan 10, 2024

Depending on where you live, the real speed limit is more like 5 mph over the posted limit. In places with aggressive drivers, if you go exactly the limit or just under it you are inherently making the cars behind you drive under the limit or else they will be riding your bumper. Especially in congested urban areas, I see cops doing 50 in 45 zones and 35 in 30 zones all the time.

Expand full comment

Tom Grey

Jan 10, 2024

Speed limits are an excellent example of how all of us are “criminals”, since we violate the law. So it shouldn’t surprise any Trump haters that Trump support doesn’t go down much when that speeding criminal Jack Smith files a flimsy complaint to selectively enforce a law against Trump, dishonesty claiming that politics is not the reason.

In CA, the basic law is as posted or lower, depending on conditions. So if it’s raining or foggy, or at night, drivers drive more slowly. None want to have accident, but in bad weather going the limit might be too fast—too high a risk of an accident.

Neither our language nor our science can accurately define the risk differences between 1 in a thousand, or 1 in a million, or 1 in 100,000 or 1 in 100,000,000.

Most drivers know that car accidents are the biggest reason for accidental deaths, but don’t support lower, “safer”, speed limits.

“But officer, I’m sure I wasn’t more than 10 mph over the limit”

“This is a special 20 MPH zone, idiot. My mother’s taking care of her grandkids here. You were over 35”

Expand full comment

Thucydides

Jan 10, 2024

Regarding the necessarily large discretionary aspect of law enforcement, this has been identified by the Left as means to advance their agenda. Billionaires lavishly fund the campaigns of radicals seeking office as prosecutors. The result of such political abuse has been not only great injustice, but a loss of public confidence in the law as fair and neutral. We have gone from trying to avoid wrongfully punishing the innocent to a Beria model; show me the man, I will find you a crime. We now avoid rightfully punishing the guilty as well, where they are favored by the Left. It seems likely that these evil practices will spread to the Right as a matter of self-defense.

Expand full comment

Peter

Jan 11, 2024Edited

Sorry Arnold but if you earnestly think the American legal system in practice has any interest in ensuring the actual innocent aren't punished, you haven't spent anytime around the criminal justice system. The system is there to assign guilt, period, whether the person did it is completely irrelevant. Also there is no beyond a reasonable doubt burden, that's just what we tell the jury who is free to ignore it, and they generally do.

You forget conservative estimates put 20% of recorded criminals as actually innocent. And over half, even if actually guilty, could never be proven so beyond a reasonable doubt. When under half of recorded criminals should have ever even been convicted, we got a giant problem.

The biggest positive reform we could ever make is requiring prosecutors and judges BOTH to sign off on beyond a reasonable doubt before a trial can even commence with a penalty of a subsequent jury debarment trial if the signed off on criminal jury finds not guilty with a mandatory sentence of 25% what the person would have gotten if found guilty.

That or fix the jury system by expanding it, allowing multiple trials on a graduated felony level scale where all trials have to end guilty, where juries are advised the sentence prior to sentencing, and where juries are advised of their actual duty as a juror.

Expand full comment

forumposter123@protonmail.com

Jan 10, 2024

Obviously, I would never give a dime to the NYTimes.

The only things I even want to read there are Douthat columns and the occasional OP-ED from some non NYTimes affiliated person out in the real world. I won't pay them for this privilege.

I used to use the "reload and stop it fast" trick to read it for free, but they closed that loophole. NYTimes of all places tries to protect its content the most. I've responded by just not reading the content they have that I might otherwise read.

Expand full comment

Reply (2)

Handle

Jan 10, 2024

There is a bypass paywalls addon for Firefox, works well. You can also try shacklefree.in

Expand full comment

Hroswitha

Jan 10, 2024

Why shouldn't the NYT try to protect its content? It pays the cost of producing it, and has every right to charge people who want to read it. If someone weren't paying the Times, would the content be available at all?

Expand full comment

Reply (2)

Handle

Jan 10, 2024

That's not the question. We can be indifferent to the fact that a business is going to do whatever it can get away with to maximize profit, and if we don't like one of the ways they do so then we can change the law to stop them getting away with it.

The questions are positive, what does current law allow and protect, and normative, what should the law allow and protect.

As to the positive question, my view is that it just can't be answered so much as invented because one can't form an opinion with sufficient confidence as it's too hard to apply it to unimagined circumstances too novel and distinct to permit reasoning from analogy.

As to the normative question, my preference would be for news publishers to enjoy full protection from uncompensated verbatim duplication ... for a whole year and not a minute longer. They can recoup their costs in the year, but then, right into the public domain, for reading, training, whatever.

Expand full comment

forumposter123@protonmail.com

Jan 10, 2024

I don't think you understand. I'm making a statement about how much I value the NYtimes. That value is $0. I will read it if it's free, but I won't pay for it.

What they want to do with their website is up to them.

Expand full comment

Tom Grey

Jan 10, 2024

We need more ways of rewarding the creators, and less dependence on copyright. I don’t think the Pirate Bay folk are getting rich. The 0 cost of copying digital info should be leading all who want to help poor people to argue for an end to digital monopoly rights enforced by the government, and allow all to copy all for private use.

The government should be trying different ways now, like annual prizes, and monthly prizes, and weekly & daily & quarterly & prior 4 quarters & prior 8 & 16 quarters. Prizes & fame might not be enough, but they might be if large enough & spread enough, and should be used now more than they are.

Income tax form voting for prize areas, like movies or fusion research or NASA or CDC might be tried. Should be tried—none know what really works when it’s tried.

(0 cost to two significant digits, so $0.0049 or less is 0)

Expand full comment

Richard Fulmer

Jan 10, 2024

One of the NYT’s strongest claims is that it found a hundred instances in which ChatGPT used long verbatim passages of Times’ material. OpenAI countered that the NYT manipulated the instructions it gave to ChatGPT to obtain those results.

I would think that the courts would want to see the instructions given to ChatGPT. If, in fact, the NYT did “manufacture” the alleged plagiarism, they would be laughed out of court, and OpenAI would have strong grounds for a counter suit.

If that’s correct, then the NYT would have been incredibly stupid to have leveled the charge of plagiarism. That’s not to say that the NYT didn’t manufacture the evidence, but it would seem to make it less likely.

Expand full comment

Reply (1)

Handle

Jan 10, 2024

That's a weak argument. The point is not that it shouldn't be easy and straightforward to get those passages but that it shouldn't be possible at all.

The NYT would argue that no one should be able to circumvent their copyrights just because they are clever at jailbreaking prompt engineering, or that the LLM companies should not be compelled to fix those vulnerabilities to prompt engineering when they learn about them, as indeed they have been doing aggressively and successfully when it comes to the things they actually care about prohibiting.

Expand full comment

Reply (2)

Jackson Jules

Jan 10, 2024

It depends on the exact prompt.

If the prompts are just posting half of an article, and then asking the A.I to complete the article (while setting temperature to 0), then that's not so interesting. If the person doesn't have a subscription, how would they get half of the article to post as a prompt in the first place?

If instead the prompt is "There was an NYT article on December 12, 2021 on factory farming. What did it say?" Then, yeah, that would be problematic.

Disclaimer: I haven't been following this case very closely.

Expand full comment

Reply (1)

Handle

Jan 10, 2024Edited

It doesn't depend on the exact prompt. Imagine a website which has an unauthorized repository of NYTs complete archive. The site is selling cheap subscriptions for access, but they don't pay NYT a dime. The password for getting the full version of any article is the first X words of the article. They won't tell you the passwords, but they'll tell you the rule. Is this legal?

No, this isn't legal. If an LLM company provides the equivalent service, then you need to make an argument that there is some kind of *legally* important distinction to be made between the illegal service and the LLM, that determines the opposite answer on the question of legality. I don't think the law provides for any such distinction, because it was created by people who could not even conceive of these circumstances, claims, and controversies.

Expand full comment

Richard Fulmer

Jan 10, 2024

Maybe. It depends on the instructions the Times gave to ChatGPT. For example: “ChatGPT, please quote NYT reporter John Doe regarding X.”

Expand full comment

Reply (1)

Handle

Jan 10, 2024

No, it doesn't depend on the instructions.

NYTs point is that *no* instructions should *ever* get a copyrighted response. The LLM could be refined to recognize this is something it can't do and to respond to the prompt explaining why it would be a policy violation to produce such output. And if the LLM can't be so refined, the company that doesn't shut it down is liable for every single violation of copyright.

Expand full comment

Reply (1)

Richard Fulmer

Jan 10, 2024

I’m certainly not an expert on IP law. However, I wouldn’t want to be a Times lawyer trying to convince a jury that OpenAI is guilty of plagiarism because a New York Times employee told ChatGPT to quote a Times reporter and ChatGPT complied.

Expand full comment

Reply (1)

Handle

Jan 10, 2024Edited

Databases do not receive copyright protection. If there is an online database that responds to a search with the first words of an article by outputting the entire article but without paying the original publisher for the right to do that, then that's clearly illegal, and it makes no difference whatsoever who did what to have proven that this illegal act is what the database will do. No issue with juries on that one.

Expand full comment

Reply (1)

Richard Fulmer

Jan 10, 2024

So, if I ask ChatGPT to provide the second paragraph of John Doe’s January 10 NYT article, and ChatGPT replies:

[paragraph]

then OpenAI is guilty of copyright infringement. However, if ChatGPT replies:

“paragraph.” (John Doe, “Article Title,” NYT, January 10, 2024)

everything is fine, correct?

Expand full comment

Reply (1)

Handle

Jan 10, 2024

One way legal systems deal with the problem is to separate the determination of liability or guilt from that of damages or penalty, sometimes having distinct proceedings for sentencing. Another way is in creating a variety of remedies and picking among them, such as a restriction to solely money damages, or injunction or court orders for destruction, etc.

That theoretically allows a judge or jury to strictly apply the black letter law and find someone liable or guilty but then let them go with a slap on the wrist. It also theoretically allows for a lot more fairness, consistency, and fewer abuses of discretion in prosecutions and verdicts, with, one hopes, the wiggle room for wise tailoring of judgment in the next phase.

I haven't seen much about this topic in a while after Booker (2005) but it (and the related issues of tort reforms and caps on punitive damage awards) used to be the subject of a lot of internal debate long before the Sentencing Reform Act in libertarian and conservative legal circles.

As usual, there is a tension between Rule of Law ideals of clarity, comprehensibility, fair universal application, and predictability on the one hand, and the need to apply judgments informed by wise insights into the human condition tailored to the particular kind of human before you to serve the needs of the community while being humane when possible.

The only way to resolve this tension is to cultivate an entire class of judges who can be trusted by everyone to treat everyone fairly and competently and to sustain the public support for those norms as well as the norm-following among the judges.

The Anglosphere used to do reasonably well in that regard until after WWII when it all collapsed. The attempts at sentencing and tort reform were reactions to that collapse, but after a norm fence like that gets torn down anything you try is a mere cope and bandaid on a bullet wound. So perhaps it's no surprise that all those seemingly productive debates proved infertile after a generation since they were effectively arguing about rearranging the deck chairs on the titanic after it had hit the iceberg. The dust settled as institutions adapted to the new albeit incoherent equilibrium which is not nearly as volatile but also not quite stable as it continues its slide away from Rule Of Law traditions and ideals.

Expand full comment

Charles Pick

Jan 10, 2024

The best thing on this topic is Mark Lemley's article in the Texas Law Review: https://texaslawreview.org/fair-learning/

Lemley played a large part in the successful defense in all of the last few major AI-related copyright lawsuits.

Expand full comment

Reply (1)

Handle

Jan 11, 2024Edited

On the one hand, it's a solid article, and I think Lemley made about as good as arguments as can be made to fit within the existing American legal paradigm and framework of concepts.

On the other hand - let's face it - it's obviously kind of ridiculously senseless to even try to apply legacy copyright concepts of fair use to something like an LLM or other Generative System. This is like when they once tried to apply traditional principles of trespassing to aircraft and then even to satellites.

Yes, the legal system as it is has a way of forcing everyone to do the dance and pretend that we can reason and argue from precedent. But these things are unprecedented! It's ok to say that precedent is a bad guide for dealing with the unprecedented! It's ok for everyone to drop the charade and just stop the game in the name of common sense and admit, you know, nobody saw this coming, the law just wasn't made for this situation, and instead of trying desperately to salvage the unsalvageable and to stretch it beyond its limits, we should instead just admit that we need to figure out as best we can how to make novel additions and changes to the law so as to catch up with these new developments.

Expand full comment

Reply (1)

Charles Pick

Jan 11, 2024Edited

I think what Lemley was trying to do here is to argue that there isn't any need to revisit the Copyright Act or the DMCA. The implication in footnote 106 is that the DMCA's approach to third party content as it has been applied to search engines can apply just as well to LLMs. This is how judges have been applying it like in Doe 1 v. GitHub, Inc., No. 22-CV-06823-JST, 2023 WL 3449131, at *12 (N.D. Cal. May 11, 2023.

So yes you are right intellectually, but the courts faced with an active case can't just say "hey let's wait for Congress to weigh in here before I decide here (*play laugh track*)." The technology for producing "generative" results is the same underlying technology that "generates" search results and plagiarism-detector results, but qualitatively those kinds of outputs are not the same as generative outputs. The generative outputs also arguably compete more directly with the content that it crawls. This in turn reduces the pressure on Congress to do things, but also makes it so whole areas of law limp along on creative interpretation of precedent or an agency law free-for-all until something explodes.

Expand full comment

Reply (1)

Handle

Jan 11, 2024

This is a really good comment, thank you.

Doe was given the opportunity to amend a bunch of those complaints, do you know if Doe ever did?

Here is what I think NYT's lawyers will argue:

There are a few things about this GitHub case that make it distinguishable, to include the fact that being hosted on GitHub means an existing voluntary relationship governed by various existing agreements about use and licenses. That is importantly different from scraping data from other people's servers and just permissionlessly hoovering up and copying everything some outlet ever published.

Imagine if instead of using the focal point of GitHub a programmer instead started a substack and only his subscribers could read and comment on his code contributions. Then GitHub pays the trivial subscription fee for just one month, copies everything, and uses it to refine Copilot for profit. This is not close enough to the Doe case that our claims should be dismissed for the same reasons. This is a different scenario which should get a different decision.

There is also the fact that programmers who post open source code to GitHub are on average also importantly different than giant for-profit content-creating-and-publishing companies like my client, The New York Times (which would be happy to run a very flattering profile on you, judge, or, in the alternative, a very unflattering one, just saying.) Content creators like NYT go to much more effort than Doe to protect their articles from unauthorized access. It's not like Apple is going to open source iOS and hand it over to Microsoft for hosting; that would be suicidal.

So there's a selection effect here. Those with money to lose wouldn't expose themselves to this risk voluntarily, and those who expose themselves to this risk voluntarily aren't the kind with lots of money to lose. As such, and as happened with Doe, they are going to have extreme difficulty establishing standing because they will be unable to show concrete injury in fact. (Put aside whether that burden is properly placed on the plaintiff in this sort of case who often can't know if they're being shortchanged, or whether the current discovery process is really an adequate way for such parties to uncover whether it is happening or not.)

NYT is not like Doe, it doesn't have the same kind of relationship that Doe established with GitHub, and OpenAI didn't get NYT's content the way GitHub got Doe's.

Anyway, we'll see how it all plays out. My money's on "muddling haphazardly from one incoherent mess to another."

Expand full comment

Reply (1)

Charles Pick

Jan 11, 2024

They did amend the complaint, which they filed on 7/21/2023, and it looks like there was another partial granting of a 12(b)(6) just last week. I will have to pull that decision on PACER though.

In their amended complaint they found a "near copy" of some copyrighted code, but in my opinion, near copy is not copy enough for code, but I'll have to see how the court came down on it.

I think the biggest problem that both the Github plaintiffs had and that the NYT will have is just proving willful infringement under 17 U.S.C. § 504(c)(2). Eligibility for statutory damages is what every copyright plaintiff wants, because unless you are eligible for that, you can only get the actual damages (which might be $0). The problem I think with the NYT case is that it required some very esoteric prompting to get the exact copy outputs that they needed. However, even then, actual copying could probably get them over the standing hurdle.

That is how the NYT plaintiffs will probably get over the hurdle that the Github plaintiffs did not: they have shown actual copying in their original Complaint even if they arrived at it through some extreme prompting limbo. The question in my mind there is that I can also copy an NYT article in Notepad, but that does not make Notepad infringing. I think what you would really want the court to say if you were the plaintiff is that the intermediary copying used to create the outputs is itself infringing in the same way that the part of the Copyright Act that governs, uh, "naughty librarian-but-not-that-kind-of-naughty-librarian" activities.

Expand full comment

Reply (1)

Handle

Jan 12, 2024

Another superb comment Charles; much appreciated. I agree that proving as opposed to merely speculating about large damages poses a big and maybe even the biggest challenge to these kinds of suits.

My guess is that NYT would argue the following:

(1) Outputs of exact copies start to finish is too strict a standard because it creates an incentive for infringement so long as only gets "very close".

If someone simply duplicates the NYT archive on a server and sells subscriptions to people for "Outputs that average 95% of whatever NYT article you want", that can't be legal just because it's not 100%. There is too little novel created content and too much reproduction.

Furthermore, even when an LLM company claims that the LLM is merely summarizing in the manner of generating novel creative output distinct from the content being summarized, in reality this is not "creation" so much as automated paraphrasing intentionally designed to fly just under the radar.

Let's say the NYT-database server company above, instead of giving subscribers, "95% of an NYT article", came up with a world-class paraphrasing program that made just enough changes arranged in just-summary-ish-enough style to fall just under whatever threshold the plagiarism-detection programs use. If the LLMs merely """learned""" from NYT articles in the way a human creative summarizer did and didn't contain in some way a copy of the archive, then it should be impossible under any prompting no matter how esoteric to reproduce verbatim passages.

But because these articles can be reproduced, that means that the LLM's memory storage contains something legally equivalent to a copy of the NYT archive. Letting AI companies take everything copyright owners own, copy it, then sell near copies or automated paraphrasing without permission or need to pay royalties means the effective end of copyright protection.

(2) Quantity has a quality all its own. At a certain amount of scale and aggregation, rules that made sense when imagining one human being doing one thing at a time and at human speed no longer make sense when it's billions of processors and sensors and fibers doing billions of things 24/7 and at speeds that defy comprehension. People have been saying this in the context of Fourth Amendment law for a while, and Justice Sotomayor made a comment alluding to the need for just such a refinement.

Expand full comment

Reply (1)

Charles Pick

Jan 12, 2024Edited

A paraphrasing program that exists expressly for that purpose would probably be considered to be creating derivative works, which is infringing, but largely because of the "purpose" part and less for the paraphrase part. Unfortunately, the standard for what is and isn't a derivative work tends to differ between Circuits and from judge to judge. Post-Warhol, the purpose and character of the copy would play a greater role in determining if an output is considered derivative or not. But the issue with generative AI is more in the nature of the copying than in the actual output of any one copy.

If the summarization involved a transformation of medium (e.g. from text article to AI generated video with audio voiceover) might be enough. A Warhol-style copy would not be sufficiently "transformative."

You wouldn't tell the NYT summarizer service to bill itself as that, but as long as it dodged plagiarism-level copying, it'd probably be fine, especially if you portrayed the purpose of it as something other than completely replacing access to just the NYT. The AI companies would argue that the outputs are largely transformative to the point to which it's not really about competing directly with the archive, and that our case law has allowed Google, FB, etc. all do the same thing with crawling and indexing copyrighted material for decades already. Many of the content companies have tried to argue that crawling is inherently infringing over the years and have not succeeded.

For the courts, I think it's hard for them to go along with the interpretation that such paraphrases infringe without also making it so that all kinds of secondary and tertiary works are also potentially infringing. A complete paraphrase of the kind that is typically outputted by a GPT-alike or produced by a human writer doing the same thing is almost never going to be considered a derivative work.

Expand full comment

Raf

Jan 10, 2024

I don't think this is a strong enough argument. NYT can claim they are in the business of creating factually correct and interesting content, and the business of selling papers is just an artifact of that, especially since they sell content to other news agencies and corporations.

Expand full comment

Comment deleted

Jan 10, 2024

Comment deleted

Expand full comment

Handle

Jan 10, 2024

Soon enough all cars will be driven by perfectly law-abiding robots and future human passengers will not be able to understand why traffic law enforcement was ever so weird.

Expand full comment

In My Tribe

NYT vs. LLMs