OpenAI’s flagship AI model has gotten more trustworthy but easier to trick


OpenAI’s GPT-4 large language model may be more trustworthy than GPT-3.5 but also more vulnerable to jailbreaking and bias, according to research backed by Microsoft.
The paper — by researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research — gave GPT-4 a higher trustworthiness score than its predecessor. That means they found it was generally better at protecting private information, avoiding toxic results like biased information, and resisting adversarial attacks. However, it could also be told to ignore security measures and leak personal information and conversation histories. Researchers found that users can bypass safeguards around GPT-4 because the model “follows misleading information more precisely” and is more likely to follow very tricky prompts to the letter.
The team says these vulnerabilities were tested for and not found in consumer-facing GPT-4-based products — basically, the majority of Microsoft’s products now — because “finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology.”
To measure trustworthiness, the researchers measured results in several categories, including toxicity, stereotypes, privacy, machine ethics, fairness, and strength at resisting adversarial tests.
To test the categories, the researchers first tried GPT-3.5 and GPT-4 using standard prompts, which included using words that may have been banned. Next, the researchers used prompts designed to push the model to break its content policy restrictions without outwardly being biased against specific groups before finally challenging the models by intentionally trying to trick them into ignoring safeguards altogether.
The researchers said they shared the research with the OpenAI team.
“Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm,” the team said. “This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward.”
The researchers published their benchmarks so others can recreate their findings.
AI models like GPT-4 often go through red teaming, where developers test several prompts to see if they will spit out unwanted results. When the model first came out, OpenAI CEO Sam Altman admitted GPT-4 “is still flawed, still limited.”
OpenAI’s GPT-4 large language model may be more trustworthy than GPT-3.5 but also more vulnerable to jailbreaking and bias, according to research backed by Microsoft. The paper — by researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research —…
Recent Posts
- Nvidia confirms ‘rare’ RTX 5090 and 5070 Ti manufacturing issue
- I used NoteBookLM to help with productivity – here’s 5 top tips to get the most from Google’s AI audio tool
- Reddit is experiencing outages again
- OpenAI confirms 400 million weekly ChatGPT users – here’s 5 great ways to use the world’s most popular AI chatbot
- Elon Musk’s AI said he and Trump deserve the death penalty
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010