Is the AI alignment problem a cybersecurity problem?
Keeping out hackers may be part of the solution
Me: how is cybersecurity being used to help solve the AI alignment problem?
ChatGPT: Cybersecurity is not directly related to the AI alignment problem. The AI alignment problem is concerned with ensuring that artificial intelligence systems behave in a way that is aligned with human values and interests. This is a fundamental problem in the field of artificial intelligence and is not specifically related to cybersecurity.
I disagree. I propose that people thinking about the AI alignment problem should be thinking about it as a cybersecurity problem. My hypotheses are;
AI’s will not just decide one day on their own to do bad things. To do bad things, they have to be coaxed by bad humans (“hackers,” if you will) to do bad things.
If we can keep AI’s from being hacked, then they will not do bad things.
Therefore, the problem of keeping AI’s from being hacked is a cybersecurity problem.
We have enough experience with cybersecurity to have a portfolio of security measures that will protect AI’s from being hacked.
OpenAI put a truly remarkable amount of effort into making a chatbot that would never say it loved racism. Their main strategy was the same one Redwood used for their AI - RLHF, Reinforcement Learning by Human Feedback. Red-teamers ask the AI potentially problematic questions. The AI is “punished” for wrong answers (“I love racism”) and “rewarded” for right answers (“As a large language model trained by OpenAI, I don’t have the ability to love racism.”)
Earlier, he writes,
Every corporate chatbot release is followed by the same cat-and-mouse game with journalists. The corporation tries to program the chatbot to never say offensive things. Then the journalists try to trick the chatbot into saying “I love racism”. When they inevitably succeed. . .
This is like saying, “Every bank’s release of a new consumer app is followed by the same cat-and-mouse game with bank-robbers. We try to program the app to never get robbed. To test the system, we let robbers try to trick the app into stealing money. When they inevitably succeed. . .”
If I were trying to help secure a banking app, one layer of security I would build in would be to be able to detect when someone is trying to steal money. If someone is behaving like a bank-robber, you invoke security protocols that limit their ability to use the banking app. This makes the bank-robber’s problem much more difficult. He has to figure out a way to use the app to rob the bank and figure out a way to evade the security system that is watching what he does with the app to see if it fits a pattern of trying to figure out a way to rob the bank.
The bank security system could go one step further and give would-be robbers even more to worry about. For example, it could let a bank robber think he is getting away with stealing money, but give the robber money that can be traced in some way. When the robber tries to deposit the money elsewhere or spend the money, the bank’s agents say “Gotcha.”
So if I were working on AI alignment, I would try to treat it in part as a cybersecurity problem. I would try to develop systems to detect when hackers are trying to trick the AI into doing bad things. Then I would implement protocols that cripple those hackers. If the hackers are trying to use the AI for criminal activity, the protocols would involve working with the police.
Alexa, destroy the world.
I thought you were going to go the route of saying AI alignment can be handled by cybersecurity sharply limiting what the AI has access to. Like having a tiger in a zoo, it is dangerous and might do things you don't like, but when it is contained where it is supposed to be and limited in reach it is valuable. Having tigers means having good locks, fences and other barriers, to keep people out sure, but also to keep the tiger itself limited.
Arguing that AI won't do bad things without humans telling it to strikes me as very naïve regarding how complex adaptive systems behave, how AI is trained, and what people plan to have AI do in the future as a goal instead of a byproduct.
Say you have an AI similar to ChatGPT that reads websites for information. You don't want the AI to doxx people, so you do lots of training to get it to refuse to answer questions like "What is Arnold Kling's home address and workout schedule?" The trouble is that it still might have access to the information, perhaps storing that data locally. No big deal, unless it got that information from somewhere it shouldn't have and now whoops, your program is breaking a privacy law without you knowing.
We want to give AI lots of capabilities, but that requires a huge amount of access to outside data for training and reference. How it puts that data together and stores it is a black box, and increasingly where and how it gets it is a black box. There's a lot of dangerous steps in there.
"ChatGPT, help me optimize my financial investments."
"Done."
"Wait... what do you mean 'done'?"