On June 22, 2025, Texas Governor Greg Abbott signed into the law the Texas Responsible Artificial Intelligence Governance Act (TRAIGA) or (the Act). The Act, which goes into effect January 1, 2026, “seeks to protect public safety, individual rights, and privacy while encouraging the safe advancement of AI technology in Texas.”
Formerly known as HB 149, the Act requires a government agency to disclose to consumers that they are interacting with AI—no matter how obvious this might appear—through plain language, clear and conspicuous wording requirements, and more. The same disclosure requirement also applies to providers of health care services or treatment, when the service or treatment is first provided or, in cases of emergency, as soon as reasonably possible.
The Act further prohibits the development or deployment of AI systems intended for behavioral manipulation, including AI intended to encourage people to harm themselves, harm others, or engage in criminal activity (see a post by our colleagues on Utah’s regulation of mental health chatbots).
As we noted in our previous blog post, HealthBench, an open-source benchmark developed by OpenAI, measures model performance across realistic health care conversations, providing a comprehensive assessment of both capabilities and safety guardrails that better align with the way physicians actually practice medicine. In this post, we discuss the legal and regulatory questions HealthBench addresses, the tool’s practical applications within the health care industry, and its significance in shaping the future of artificial intelligence (AI) in medicine.
The Evolution of Health Care AI Benchmarking
Artificial Intelligence (AI) foundation models have demonstrated impressive performance on medical knowledge tests in recent years, with developers proudly announcing their systems had “passed” or even “outperformed” physicians on standardized medical licensing exams. Headlines touted AI systems achieving scores of 90% or higher on the United States Medical Licensing Examination (USMLE) and similar assessments. However, these multiple-choice evaluations presented a fundamentally misleading picture of AI readiness for health care applications. As we previously noted in our analysis of AI/ML growth in medicine, a significant gap remains between theoretical capabilities demonstrated in controlled environments and practical deployment in clinical settings.
These early benchmarks—predominantly structured as multiple-choice exams or narrow clinical questions—failed to capture how physicians actually practice medicine. Real-world medical practice involves nuanced conversations, contextual decision-making, appropriate hedging in the face of uncertainty, and patient-specific considerations that extend far beyond selecting the correct answer from a predefined list. The gap between benchmark performance and clinical reality remains largely unexamined.
Blog Editors
Recent Updates
- AI Infrastructure, Ideology, and Exports: Inside the White House’s New AI Orders
- Texas Judge Strikes Down HIPAA’s Reproductive Health Amendment
- White House AI Action Plan Drops: Here’s What We Know
- AI Policy Alert: What to Know Before the White House Releases Its AI Action Plan
- Ninth Circuit Applies EKRA to Marketing Intermediaries in Lab Operator’s Allergy Testing Scheme