4 Comments

On a prior post I commented with links to descriptions of the RLHF training they used to make it woke. I ran into this good post from a data scientist explaining the difficulties and potential problematic side effects of trying to do so:

https://medium.com/@colin.fraser/welcome-to-hell-openai-8c2210c92fef

"GPT-3 is a vast funhouse mirror that produces bizarre distorted reflections of the entirety of the billions of words comprised by its training data. ChatGPT is the result of iterative attempts to surgically alter the mirror so that it emits the reflection that the makers desire, one which resembles their idea of an AI Assistant. But each attempt to alter it is risky and unpredictable, and the mirror is too big to inspect all at once. Altering one region of the mirror subtly warps its other regions in unanticipated ways that may go unnoticed until the public is invited to have a look. Reducing its likelihood to emit praise about one person might unexpectedly make it less likely to emit praise about another person. Making less likely to back down from correct mathematical claims might make it less likely to back down from making racist generalizations. It’s a delicate process, and progress is not guaranteed; an adjustment that appears to fix the reflection in some local region of the mirror might have ruinous effects elsewhere. The bad text still contains valuable information for the model: even despicable text helps to reinforce that in English the verb form is usually determined by the noun phrase that starts the sentence and so on. The impossible job that OpenAI has chosen for itself is to alter the 175 billion parameters of the model in such a way as to reduce the likelihood of horrifying text without suppressing the benign syntactic information that the horrifying text helps to reinforce."

and another post by the same author goes into more detail:

https://medium.com/@colin.fraser/chatgpt-automatic-expensive-bs-at-scale-a113692b13d5

"While the specific details of ChatGPT’s construction are not public, OpenAI refers to it as a “sibling model” to InstructGPT. There’s a wonderfully readable paper detailing the exact process that they used to construct InstructGPT available here. The paper includes some comparisons between the output of pure GPT-3 and InstructGPT, which illustrate some of the differences they are trying to achieve.

If you read the paper, one thing that may strike you is just how manual the process of fine-tuning actually is....

The notion that fine-tuning a language model can bring it into “alignment” with some set of values relies on an assumption that those values can be expressed as a probability distribution over words. Again, this is one of those things that might be true, but it’s far from obvious, and there’s no theoretical or empirical reason to believe it.... This is an infinite game of whack-a-mole. There are more ways to be sexist than OpenAI or anyone else can possibly come up with fine-tuning demonstrations to counteract. ....

There are a few other interesting consequences of tuning. One is what is referred to as the “alignment tax”, which is the observation that tuning a model causes it to perform more poorly on some benchmark tasks than the un-tuned model. ...This can go wrong in bizarre ways, leading to bizarre outcomes that don’t match the intended responses at all."

Expand full comment
founding

RLHF is labor intensive but I think one of the surprising findings in recent years is that, starting from a model trained on “everything”, it doesn’t take as much human feedback to tune it as you might think. And it somehow seems to generalize quite well (eg your example of “Donald Trump” vs “President Trump” doesn’t seem to confuse it in practice). The knowledge it gains in the initial training allows it to avoid over fitting to the human feedback like you’re imagining. This is why ChatGPT is so much more lifelike than the chatbots of old.

The problem is the RLHF can train it so well that if it was a human, we would call it “brainwashing”. That’s how they get things like this, where the model can write nice things about Biden but not Trump:

https://twitter.com/LeighWolf/status/1620744921241251842?s=20&t=YaUZQ6KoCvl363IjR1-FTg

Or this, where it would rather detonate a nuclear bomb in a city than have a racial slur uttered:

https://twitter.com/aaronsibarium/status/1622425697812627457?s=20&t=YaUZQ6KoCvl363IjR1-FTg

Unfortunately, the woke are the ones training these models. And just like college students, they seem extremely susceptible to woke indoctrination.

Expand full comment
founding

Compare Michael Huemer, "Are we on the verge of true AI?" (Substack, 4 February 2023):

https://fakenous.substack.com/p/are-we-on-the-verge-of-true-ai

"Chat GPT, from what I’ve heard, is basically like a really sophisticated version of auto-complete [... .] It was trained on some huge database of text, which it uses to predict the probability of a given word (or sequence of words?) following another set of words. [... ] Importantly, it does not have program elements that are designed to represent the meanings of any of the words. Now, you can have a debate about whether a computer can ever understand meanings, but even if it could, this program doesn’t even try to simulate understanding (e.g., it doesn’t have a map of logical relations among concepts or anything like that, as far as I understand it), so of course it doesn’t understand anything that it says."

Expand full comment