OpenAI adds new safety net to prevent ChatGPT from giving advice on creating viruses, harmful chemicals
OpenAI now actively screens for biological and chemical risk with o3 and o4-mini models, and blocks model responses using a new safety monitor.
Powerful generative artificial intelligence models have a tendency to hallucinate. They can often offer improper advice and stray off track, which can potentially misguide people. This issue has been notably discussed by industry experts, which is why the topic of guardrails has always been a focus in the AI sector. Companies like OpenAI are now actively addressing this problem, continually working to ensure that their powerful new models remain reliable. This is exactly what the company appears to be doing with its latest models, o3 and o4-mini.
As first spotted by TechCrunch, the company’s safety report has detailed a new system designed to monitor its AI models. This system screens any prompts submitted by users that relate to biological and chemical dangers.
“We've deployed new monitoring approaches for biological and chemical risk. These use a safety-focused reasoning monitor similar to that used in GPT-4o Image Generation and can block model responses,” OpenAI said, in its OpenAI o3 and o4-mini System Card document.
Also Read: ChatGPT now has a library to save your Ghibli and other AI-generated images
Reasoning Monitor Runs In Parallel To o3 And o4-mini
o3 and o4-ini represent significant improvements over their predecessors. With this increased capability, however, comes an expanded scope of responsibility. OpenAI’s benchmarks indicate that o3 is particularly powerful when responding to queries concerning biological threats. This is precisely where the safety-centric inference monitor plays a critical role.
The safety monitoring system runs in parallel with the o3 and o4-mini models. When a user submits prompts related to biological or chemical warfare, the system intervenes to ensure the model does not respond as per the company’s guidelines.
OpenAI also released some figures. According to their data, with the safety monitor in place, the models refrained from responding to risky prompts 98.7% of the time. “We evaluated this reasoning monitor on the output of a biorisk red-teaming campaign in which 309 unsafe conversations were flagged by red-teamers after approximately one thousand hours of red teaming,” OpenAI added.
Other Mitigations
In addition, OpenAI has implemented other mitigations to address potential risks. These include pre-training measures, such as filtering harmful training data, as well as modified post-training techniques designed to not engage with high-risk biological requests, while still permitting “benign” ones.
The system now actively monitors high-risk cybersecurity threats, including attempts to disrupt high-priority adversaries through methods such as hunting, detection, monitoring, tracking, and intelligence sharing.
Also Read: iPhone 17 Air could launch in September 2025 — Key details revealed