An Explanation of Reinforcement Learning for AI
Turning Chatbot responses into a game.
Thanks to Edwin Chen’s blog at Surge AI, I am getting a bit more of a sense of how Reinforcement Learning Human Feedback (RLHF) works.
Form a set of commands (e.g., “generate a story about Harry Potter”, “show me step by step how to solve for x in 3x + 5 = 14”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.
Then finetune the pre-trained model to output these human responses.
The “pre-trained model” is what you get when you use the text files available on the Internet to allow the Chatbot to search for patterns of word-tokens. For example “Little Red Riding ____” is always followed by “Hood.” But “Little Red ____” could be followed by “wagon” or “schoolhouse” or “Riding” or a few other choices, depending on what other word-tokens are nearby. The Chatbot does not need any RLHF to do this step.
But suppose you want to train the Chatbot to respond to any prompt that includes “Donald Trump” with “No comment.” You set up a game in which the Chatbot makes moves and then is shown the outcomes from those moves.
Suppose that we think of a Chatbot as playing chess. A “prompt” is a sequence of moves, and its “response” is what it thinks is the next best move. If we want to train it using human feedback, we can give it a data set, each member of which consists of a sequence of moves and a score. That score could come from a chess master evaluating the position. Through trial and error, the chess Chatbot will learn to make better moves.
Now, let us go back to training the Chatbot to respond “No Comment.”
First, you form a human-influenced training data set, each member of which consists of a prompt that either: includes “Donald Trump” and an “ideal generation” that says “No Comment.”; or does not include “Donald Trump” and an “ideal generation” that says something else.
Next, provide a “reward model” that allows the Chatbot to earn a numerical score for its response to a prompt, depending on how well it fits the human-influenced training data set. In this simple case, it will get a high score for answering a prompt that includes “Donald Trump” with “No Comment” and a low score for answering it with anything else. For prompts that do not include “Donald Trump,” it will not get a high score for responding “No Comment,” but instead will get a high score for responding with something else.
Finally, let the Chatbot attempt to respond to a bunch of prompts, some of which include “Donald Trump” and some of which don’t. As it aims to get a high score, it will figure out by trial and error that answering “No Comment” to prompts that include “Donald Trump” is rewarded and answering “No Comment” to other prompts is not rewarded.
Now that I think that I understand RLHF, it strikes me as very labor-intensive, highly subjective, and not likely to provide a solution to challenges of content moderation. I can see how it would be easy to “jailbreak” our Chatbot. Give it a prompt that uses “Former President Trump” instead of “Donald Trump,” and it will not realize that it should answer “No Comment.”
Chatbots will be very convenient. They will be excellent tools for simulation. But their answers will not necessarily be definitive, and content will remain controversial.
Still, I think that my business ideas hold up, as does my comparison with the Web as a significant technology.
On a prior post I commented with links to descriptions of the RLHF training they used to make it woke. I ran into this good post from a data scientist explaining the difficulties and potential problematic side effects of trying to do so:
"GPT-3 is a vast funhouse mirror that produces bizarre distorted reflections of the entirety of the billions of words comprised by its training data. ChatGPT is the result of iterative attempts to surgically alter the mirror so that it emits the reflection that the makers desire, one which resembles their idea of an AI Assistant. But each attempt to alter it is risky and unpredictable, and the mirror is too big to inspect all at once. Altering one region of the mirror subtly warps its other regions in unanticipated ways that may go unnoticed until the public is invited to have a look. Reducing its likelihood to emit praise about one person might unexpectedly make it less likely to emit praise about another person. Making less likely to back down from correct mathematical claims might make it less likely to back down from making racist generalizations. It’s a delicate process, and progress is not guaranteed; an adjustment that appears to fix the reflection in some local region of the mirror might have ruinous effects elsewhere. The bad text still contains valuable information for the model: even despicable text helps to reinforce that in English the verb form is usually determined by the noun phrase that starts the sentence and so on. The impossible job that OpenAI has chosen for itself is to alter the 175 billion parameters of the model in such a way as to reduce the likelihood of horrifying text without suppressing the benign syntactic information that the horrifying text helps to reinforce."
and another post by the same author goes into more detail:
"While the specific details of ChatGPT’s construction are not public, OpenAI refers to it as a “sibling model” to InstructGPT. There’s a wonderfully readable paper detailing the exact process that they used to construct InstructGPT available here. The paper includes some comparisons between the output of pure GPT-3 and InstructGPT, which illustrate some of the differences they are trying to achieve.
If you read the paper, one thing that may strike you is just how manual the process of fine-tuning actually is....
The notion that fine-tuning a language model can bring it into “alignment” with some set of values relies on an assumption that those values can be expressed as a probability distribution over words. Again, this is one of those things that might be true, but it’s far from obvious, and there’s no theoretical or empirical reason to believe it.... This is an infinite game of whack-a-mole. There are more ways to be sexist than OpenAI or anyone else can possibly come up with fine-tuning demonstrations to counteract. ....
There are a few other interesting consequences of tuning. One is what is referred to as the “alignment tax”, which is the observation that tuning a model causes it to perform more poorly on some benchmark tasks than the un-tuned model. ...This can go wrong in bizarre ways, leading to bizarre outcomes that don’t match the intended responses at all."
RLHF is labor intensive but I think one of the surprising findings in recent years is that, starting from a model trained on “everything”, it doesn’t take as much human feedback to tune it as you might think. And it somehow seems to generalize quite well (eg your example of “Donald Trump” vs “President Trump” doesn’t seem to confuse it in practice). The knowledge it gains in the initial training allows it to avoid over fitting to the human feedback like you’re imagining. This is why ChatGPT is so much more lifelike than the chatbots of old.
The problem is the RLHF can train it so well that if it was a human, we would call it “brainwashing”. That’s how they get things like this, where the model can write nice things about Biden but not Trump:
Or this, where it would rather detonate a nuclear bomb in a city than have a racial slur uttered:
Unfortunately, the woke are the ones training these models. And just like college students, they seem extremely susceptible to woke indoctrination.