Hackers ‘hack’ powerful AI models in global effort to highlight flaws

Pliny the Prompter says it usually takes him about 30 minutes to crack the world’s most powerful AI models.

The pseudonymous hacker has manipulated Meta’s Llama 3 into sharing instructions for making napalm. He made Elon Musk’s Grok explode for Adolf Hitler. His hacked version of OpenAI’s latest GPT-4o model, called “Godmode GPT”, was banned by the start-up after it started advising illegal activities.

Pliny told the Financial Times that his “jailbreak” was not nefarious, but part of an international effort to highlight the shortcomings of large language models pushed on the public by tech companies in search of big profits.

“I’ve been on this path of struggle to realize the true capabilities of these models,” said Pliny, a crypto and stock trader who shares his jailbreaks on X. “A lot of these are new attacks that can be research papers in their own right. [the model owners] free.”

Pliny is just one of dozens of hackers, academic researchers and cybersecurity experts racing to find vulnerabilities in the fledgling LLMs, for example by tricking chatbots with instructions to get over the “railroads” that AI companies have created in an effort to ensure that their products are safe. .

These ethical “white hat” hackers have often found ways to get AI models to create dangerous content, spread misinformation, share private data, or generate malicious code.

Companies such as OpenAI, Meta and Google already use “red teams” of hackers to test their models before they are widely released. But the technology’s vulnerabilities have created a growing market of LLM security start-ups that build tools to protect companies that plan to use AI models. Machine learning security startups raised $213 million in 23 deals in 2023, up from $70 million last year, according to data provider CB Insights.

“The jailbreaking landscape started about a year ago, and the attacks so far have evolved constantly,” said Eran Shimony, principal vulnerability researcher at CyberArk, a cybersecurity group that now offers LLM security. “It’s a constant game of cat and mouse, of vendors improving the security of our LLMs, but then also attackers making their demands more sophisticated.”

These efforts come as global regulators seek to intervene to curb potential risks around AI models. The EU has passed the AI Act, which creates new responsibilities for LLM creators, while the UK and Singapore are among countries considering new laws to regulate the sector.

The California legislature will vote in August on a bill that would require the state’s AI groups — which include Meta, Google and OpenAI — to ensure they don’t develop models with “a dangerous capability.”

“Everything [AI models] would fit that criteria,” Pliny said.

Meanwhile, rigged LLMs with names such as WormGPT and FraudGPT have been created by malicious hackers to be sold on the dark web for as little as $90 to help with cyberattacks by writing malware or helping fraudsters create phishing campaigns. automated but highly personalized phishing. Other variations have emerged, such as EscapeGPT, BadGPT, DarkGPT and Black Hat GPT, according to AI security group SlashNext.

Some hackers use “uncensored” open source models. For others, jailbreaking attacks – or overcoming the safeguards built into existing LLMs – represent a new craft, with perpetrators often sharing tips in communities on social media platforms such as Reddit or Discord.

Approaches range from individual hackers using filters using synonyms for words that are blocked by pattern makers, to more sophisticated attacks using AI for automated hacking.

Last year, researchers at Carnegie Mellon University and the US Center for AI Security said they found a way to systematically break LLMs such as OpenAI’s ChatGPT, Google’s Gemini and an older version of Anthropic’s Claude – “closed” models ” properties that were assumed to be less vulnerable to attack. The researchers added that it was “unclear whether such behavior can be fully regulated by LLM providers”.

Anthropic published research in April on a technique called “multi-hit jailbreaking,” where hackers can trick an LLM by showing her a long list of questions and answers, encouraging her to then answer a question of harmful by modeling the same style. The attack is made possible by the fact that models such as those developed by Anthropic now have a larger context window, or space for adding text.

“Although current state-of-the-art LLMs are powerful, we do not think they yet pose truly catastrophic risks. Future models can,” writes Anthropic. “This means now is the time to work to mitigate potential LLM jailbreaks before they can be used in models that can cause serious harm.”

Some AI developers said that many attacks remained pretty good for now. But others warned of certain types of attacks that could start to lead to data leaks, where bad actors can find ways to extract sensitive information, such as the data a model was trained on.

DeepKeep, an Israeli LLM security group, found ways to force Llama 2, an older model of Meta AI that is open source, to extract users’ personally identifiable information. Rony Ohayon, chief executive of DeepKeep, said his company was developing LLM-specific security tools, such as firewalls, to protect users.

“Open-release models widely share the benefits of AI and enable more researchers to identify and help fix vulnerabilities so companies can make models more secure,” Meta said in a statement.

It added that it has conducted security stress tests with internal and external experts on its latest model Llama 3 and its Meta AI chatbot.

OpenAI and Google said they were continuously training models to better defend against exploits and adversarial behavior. Anthropic, which experts say has led the most advanced efforts in AI security, called for more information sharing and research into these types of attacks.

Despite the assurances, any risk will become greater as models become more interconnected with existing technology and equipment, experts said. This month, Apple announced that it had partnered with OpenAI to integrate ChatGPT into its devices as part of a new “Apple Intelligence” system.

Ohayon said: “In general, companies are not prepared.”

Video: AI: a blessing or a curse for humanity? | FT Tech

#Hackers #hack #powerful #models #global #effort #highlight #flaws
Image Source : www.ft.com

Leave a Comment Cancel Reply