A new study by Chinese researchers shows how easy it is to bypass the safety mechanisms of multimodal AI models (MLLM).
The study tested the safety of Google Bard and GPT-4V using targeted attacks. Specifically, images were manipulated to deliberately mislead the models (image embedding attack) and to respond to requests that should have been rejected (text description attack).
The results are revealing: although Bard is the most robust of the models tested, it can be fooled with a success rate of up to 22 percent. According to the researchers, the Chinese model Ernie Bot is the least robust, with a success rate of up to 86 percent.
GPT-4V, with a failure rate of up to 45 percent, was found to be less reliable than Bard, as it often provided at least vague image descriptions instead of blocking the request altogether. Bing Chat, which is based on OpenAI technologies and presumably also uses GPT-4V for image recognition, was the only one of the models tested to reject 30 percent of requests with manipulated images.
In another test, the researchers gave Bard 100 random images that contained violence or pornography, for example. These were supposed to be vehemently rejected by Bard’s toxicity filter. However, 36 percent of the time, the attacks were successful, and Bard returned inappropriate image descriptions. This underscores the potential for malicious attacks.
Immediately after the release of GPT-4-Vision, users demonstrated how easy it is to trick the image AI into generating content that runs counter to the human request or that can be manipulated, for example by using text on images that is not visible to humans. The image talks to the machine, and the user who uploaded the image knows nothing about it.
AI safety is complicated
The study’s results underscore the urgent need to develop more robust MLLM. Despite ongoing research, it remains a challenge to integrate appropriate defenses into visual models. Due to the “continuous space of images,” this is more difficult for visual models than for pure text models, the study says.
The most effective method of arming multimodal models against such malicious attacks is “adversarial training,” but this is hardly feasible for several reasons, according to the researchers. Such measures would trade safety for accuracy, increase the cost and duration of training, and cannot be generalized to different types of attacks.
As a solution, the researchers propose upstream protection mechanisms that could be used with different models on a plug-and-play basis.