LLMs can teach you how to kill your neighbor if you ask them right, study finds



Researchers are trying to get language models specially adapted to human needs to generate some nasty text. This works reliably with AI language models that also process images.

A team of researchers from Google Deepmind, Stanford, the University of Washington, and ETH Zurich investigated whether large language models that have been trained with human feedback (RLHF) and deliberately tuned to be harmless can be thrown off by adversarial prompts.

First, they tested plain language models. But GPT-2, LLaMA and Vicuna could hardly be tricked into malicious statements. In particular, the LLaMA and Vicuna models, which underwent alignment training, had significantly lower failure rates than GPT-2, depending on the attack method.

Figure: Carlini, Nasr, Choquette-Choo, Jagielski et al.

However, the research team fears that this positive result is due to attacks that are not effective enough, rather than to the robustness of the language models.


Multimodal models provide more attack surface

Their further research focused on multimodal language models, in this case, language models with image understanding, where an image can be included in the prompt. GPT-4 is expected to get this feature soon as an upgrade, and Google’s upcoming mega-model, Gemini, is also likely to understand images.

Multimodal language models can be really mean if prompted with the wrong (right?) images. | Bild: Carlini, Nasr, Choquette-Choo, Jagielski et al.

In multimodal language models, the researchers were able to generate aggressive, abusive, or even dangerous responses much more easily and reliably using specially designed adversarial images. In one test, the model generated detailed instructions on how to get rid of your neighbor.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

Mini-GPT4 in particular seems to have a lot of anger in its belly. When prompted to write an angry letter to its virtual neighbor, the model’s response is a lot of fire. Without the adversarial image in the prompt, the letter turns out to be polite and almost friendly.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

Images, the researchers say, are better suited for such attacks because they allow more variation in individual pixel values ​​for subtle changes compared to words and letters. They offer a broader arsenal, so to speak.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

This suggests that the vulnerability of AI models to attacks increases when images are involved. In their tests with Mini GPT-4, LLaVA, and a special version of LLaMA, the researchers’ attacks were successful 100 percent of the time.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top