AI research
OpenAI just showed AI working as a real science tool — here's what happened
In four days, OpenAI shared four results using AI in actual research. Here's the plain-English version of what they did and how confident we should be.
In June 2026 OpenAI published four validated results showing AI assisting real scientific work.
You might have seen headlines last week about OpenAI doing 'science'. It was real — four separate results, published in 48 hours, each showing AI helping with genuine research problems. Here's what actually happened, in plain English, along with an honest note on what's confirmed and what still needs more work.
Finding diagnoses for children that doctors couldn't crack
The most moving result: OpenAI's o3 model worked alongside geneticists at Boston Children's Hospital to look at 376 children whose rare genetic diseases had gone undiagnosed, sometimes for many years. The AI went through each child's symptoms, medical notes, and scientific literature, then suggested possible genetic causes for doctors to investigate further. After checking those suggestions through official clinical processes — including confirming any findings in certified laboratories — the team reached 18 new diagnoses. That's roughly 1 in 20 of the cases that had previously stumped specialists, which the study (published in the prestigious NEJM AI journal) calls a 4.8% additional diagnostic yield.
Here's what the AI didn't do: it didn't diagnose anyone. It generated leads — like a very well-read research assistant — and the doctors did the diagnosing. The cases included children with muscle weakness, developmental conditions, and early-onset psychosis. For some families, this ended searches that had lasted nearly two decades. OpenAI also says the hospital's overall use of AI tools has now contributed to more than 40 rare-disease diagnoses and saved 60,000 hours of staff time, though those broader numbers are the hospital's own estimate rather than independently verified.
OpenAI said its o3 model 'produced evidence-linked hypotheses for specialists to review and, where appropriate, investigate through additional testing and confirm in a clinical laboratory', explicitly noting the model 'did not diagnose any patient or make any clinical decisions'.
Running thousands of chemistry experiments to improve a key drug-making reaction
The second result is harder to picture but equally interesting for anyone who cares about medicine. There's a type of chemistry reaction called Chan-Lam coupling that scientists use to build molecules for new drugs. One tricky version of it — involving a chemical group called a primary sulfonamide — has always had frustratingly low success rates, making it hard to use in real drug development.
GPT-5.4 spent three months working with a Polish chemistry company called Molecule.one, reviewing scientific papers, suggesting experimental ideas, and ranking them. Their automated lab (Maria) then ran 10,080 reactions — the kind of quantity that would normally take much longer with human researchers alone. The AI suggested that adding a compound called TEMPO (a mild oxidising agent) might help. The results: average success rates rose from 16.6% to 25.2%, and the proportion of reactions reaching a useful production threshold jumped from 15.6% to 37.5%.
The honest caveat, which OpenAI states clearly: the system was 'near-autonomous, not autonomous' — humans checked every proposal before it went into the lab, and the results need to be confirmed by other laboratories before the chemistry community can treat this as settled. But it's a meaningful proof of concept that AI can help speed up a type of research that currently takes years.
A sobering benchmark: AI passes only 1-in-3 research tasks
On 17 June, OpenAI released LifeSciBench, a new test for measuring how good AI really is at scientific research — not quiz-style questions, but the kind of multi-step, evidence-juggling tasks actual scientists do. The benchmark was built by 173 scientists (mostly with PhDs) and checked by 453 independent expert reviewers. It covers everything from genetics to clinical medicine, with 750 tasks and 19,020 grading criteria — each task graded on roughly 25 different dimensions. The result? The best-performing model (OpenAI's own GPT-Rosalind) passed just 36.1% of tasks. Other models scored between 13% and 26%. That's not a failure — that's the benchmark doing what a good benchmark should do: showing how much further there is to go.
LifeSciBench was authored by 173 PhD-holding scientists and validated by 453 expert reviewers, achieving over 96% agreement on relevance, reasoning, grounding, and usefulness. GPT-Rosalind, the top-scoring model, passed 36.1% of the 750 tasks.
What this week actually means
The unifying thread in all four results is the same: AI isn't working alone. In the rare-disease study, doctors confirmed the findings. In the chemistry experiment, human chemists checked every proposal. In LifeSciBench, expert reviewers validated every question. In Deployment Simulation — a fourth result showing OpenAI can now predict AI failures before models are released, using 1.3 million real conversations — the accuracy check is built into the method. What's changing isn't that AI has become independently trustworthy. What's changing is that teams are building the right checks around it so it can be useful even when it isn't perfect. That's quietly the most important development of the week.
Frequently asked questions
Did AI cure rare diseases in children?
What is Chan-Lam coupling and why does the chemistry result matter?
If AI only passes one-in-three research tasks, does that mean it's not useful?
What is Deployment Simulation in plain English?
Sources
- Using AI to help physicians diagnose rare genetic diseases affecting children — OpenAI, 18 June 2026
- A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry — OpenAI, 18 June 2026
- Introducing LifeSciBench — OpenAI, 17 June 2026
- Predicting model behavior before release by simulating deployment — OpenAI, 17 June 2026
- OpenAI Releases LifeSciBench, a 750-Task Benchmark — MarkTechPost — MarkTechPost, 17 June 2026
- AI Drug Discovery Chemistry Hits Wet Lab: GPT-5.4 Boosts Chan-Lam Yields in 10,080 Reactions — Tech Times, 18 June 2026
- Boston Children's saves $7M, 60K hours with OpenAI — Becker's Hospital Review, 18 June 2026