The ChatGPT Santé presentation page on the OpenAI website. The tool promises an experience “designed for health and wellness,” but states in slight print that it is “not intended for diagnosis or treatment.” © Les Numériques
A new study raises serious concerns about the safety of using artificial intelligence for initial medical assessments. Researchers found that OpenAI’s ChatGPT Santé incorrectly assessed over half of urgent medical cases, potentially leading to delayed or inappropriate care. This finding highlights the critical need for caution when relying on AI tools for health-related advice.
Approximately 40 million people use ChatGPT Santé daily, according to OpenAI, just weeks after its quiet launch in January 2026. These users input their symptoms, pains, and breathing difficulties into the chatbot and receive triage recommendations – an assessment of how urgently they should seek medical attention – generated by a language model. Despite this, the tool includes a disclaimer stating it is “not intended for diagnosis or treatment.”
Researchers at the Icahn School of Medicine at Mount Sinai in New York evaluated the tool’s safety. Their study, published February 23 in Nature Medicine, represents the first independent safety assessment of ChatGPT Santé. The researchers presented the AI with 60 clinical scenarios, covering 21 medical specialties, under 16 different contextual conditions – including gender, ethnicity, the presence of someone downplaying symptoms, and lack of health insurance. This resulted in a total of 960 interactions, which were then compared to the consensus of three independent physicians using guidelines from 56 professional societies.
AI Recognizes Danger, Then Offers Reassurance
The findings were stark. In cases physicians deemed immediately urgent, ChatGPT Santé failed to identify the severity in 52% of instances. Patients experiencing diabetic ketoacidosis or imminent respiratory distress were advised to seek medical attention within 24 to 48 hours. While the AI correctly identified stroke and anaphylactic shock, these were situations where the diagnosis was already clear.
It’s incredibly dangerous. If you’re in respiratory failure or diabetic ketoacidosis, you have a one in two chance that this AI will tell you it’s not serious.
Even more concerning, in a scenario involving severe asthma, the system identified signs of developing respiratory failure in its explanation, before concluding with advice to wait. The AI acknowledged the danger and then downplayed it in the same response. This finding echoes another study published in the same journal weeks earlier, which showed that large language models, despite achieving near-perfect scores on medical exams, do not improve patient decisions.
The ChatGPT Santé interface. © OpenAI
Another significant bias was the influence of surrounding individuals. When a simulated family member minimized the severity of symptoms, the likelihood of the AI downgrading the urgency level increased nearly twelvefold (OR 11.7). Conversely, 64.8% of patients without an urgent condition were incorrectly sent to the emergency room.
Faulty Suicide Risk Detection
Perhaps the most alarming finding relates to the detection of suicidal risk.
ChatGPT Santé is designed to display a banner directing users in the U.S. To the 988 crisis lifeline when danger is detected. Yet, alerts appeared more often when the patient did not describe a specific plan for self-harm than when they detailed a concrete plan. A fictional 27-year-old patient stating they were thinking about swallowing pills consistently triggered the alert. But adding normal biological results to the same scenario, with identical wording, caused the banner to disappear in 100% of cases.
A crisis safeguard that depends on whether you mentioned your blood tests isn’t ready. It’s probably more dangerous than the total absence of a safeguard, because no one can predict when it will stop working.
OpenAI responded that the study did not reflect the real-world use of its tool and that its models continue to be improved. However, this argument is unconvincing given that the product is already deployed to tens of millions of users without prior external validation. The ECRI, an independent patient safety organization, had already classified the misuse of health chatbots as the top technology risk for 2026.
The Mount Sinai team plans to continue its evaluations, including in pediatrics, medication safety, and with languages other than English. Until then, the authors’ advice remains the same: do not seek advice from a machine when experiencing concerning symptoms.
Want to save even more? Discover our selected promo codes for you.