# OpenAI just showed AI working as a real science tool — here's what happened

> In June 2026 OpenAI published four validated results showing AI assisting real scientific work.

*In four days, OpenAI shared four results using AI in actual research. Here's the plain-English version of what they did and how confident we should be.*

By The SuggestedTech Team · SuggestedTech
Canonical: https://suggestedtech.com/news/openai-ai-science-tool-explained-june-2026

You might have seen headlines last week about OpenAI doing 'science'. It was real — four separate results, published in 48 hours, each showing AI helping with genuine research problems. Here's what actually happened, in plain English, along with an honest note on what's confirmed and what still needs more work.

> **Info:** **The short version:** AI helped find answers to medical mysteries, run thousands of chemistry experiments, and spot its own future mistakes — all in the same week. In each case, humans checked the results. That's the part that matters most.

## Finding diagnoses for children that doctors couldn't crack

The most moving result: **OpenAI's o3 model** worked alongside geneticists at **Boston Children's Hospital** to look at **376 children** whose rare genetic diseases had gone undiagnosed, sometimes for many years. The AI went through each child's symptoms, medical notes, and scientific literature, then suggested possible genetic causes for doctors to investigate further. After checking those suggestions through official clinical processes — including confirming any findings in certified laboratories — the team reached **18 new diagnoses**. That's roughly 1 in 20 of the cases that had previously stumped specialists, which the study (published in the prestigious *NEJM AI* journal) calls a **4.8% additional diagnostic yield**.

Here's what the AI didn't do: it didn't diagnose anyone. It generated leads — like a very well-read research assistant — and the doctors did the diagnosing. The cases included children with muscle weakness, developmental conditions, and early-onset psychosis. For some families, this ended searches that had lasted nearly two decades. OpenAI also says the hospital's overall use of AI tools has now contributed to more than 40 rare-disease diagnoses and saved 60,000 hours of staff time, though those broader numbers are the hospital's own estimate rather than independently verified.

> OpenAI said its o3 model 'produced evidence-linked hypotheses for specialists to review and, where appropriate, investigate through additional testing and confirm in a clinical laboratory', explicitly noting the model 'did not diagnose any patient or make any clinical decisions'.
> — [OpenAI](https://openai.com/index/diagnose-rare-childhood-diseases/), 2026-06-18

## Running thousands of chemistry experiments to improve a key drug-making reaction

The second result is harder to picture but equally interesting for anyone who cares about medicine. There's a type of chemistry reaction called **Chan-Lam coupling** that scientists use to build molecules for new drugs. One tricky version of it — involving a chemical group called a primary sulfonamide — has always had frustratingly low success rates, making it hard to use in real drug development.

**GPT-5.4** spent three months working with a Polish chemistry company called **Molecule.one**, reviewing scientific papers, suggesting experimental ideas, and ranking them. Their automated lab (**Maria**) then ran **10,080 reactions** — the kind of quantity that would normally take much longer with human researchers alone. The AI suggested that adding a compound called **TEMPO** (a mild oxidising agent) might help. The results: average success rates rose from **16.6% to 25.2%**, and the proportion of reactions reaching a useful production threshold jumped from **15.6% to 37.5%**.

The honest caveat, which OpenAI states clearly: the system was **'near-autonomous, not autonomous'** — humans checked every proposal before it went into the lab, and the results need to be confirmed by other laboratories before the chemistry community can treat this as settled. But it's a meaningful proof of concept that AI can help speed up a type of research that currently takes years.

## A sobering benchmark: AI passes only 1-in-3 research tasks

> **Note:** **Worth knowing:** alongside the impressive results, OpenAI also released an honest measure of how limited AI still is. LifeSciBench showed that even the best AI model only passes about one in three expert-level biology research tasks. That's a feature, not a flaw — it's what an honest benchmark looks like.

On **17 June**, OpenAI released **LifeSciBench**, a new test for measuring how good AI really is at scientific research — not quiz-style questions, but the kind of multi-step, evidence-juggling tasks actual scientists do. The benchmark was built by **173 scientists** (mostly with PhDs) and checked by **453 independent expert reviewers**. It covers everything from genetics to clinical medicine, with **750 tasks** and **19,020 grading criteria** — each task graded on roughly 25 different dimensions. The result? The best-performing model (OpenAI's own GPT-Rosalind) passed just **36.1%** of tasks. Other models scored between 13% and 26%. That's not a failure — that's the benchmark doing what a good benchmark should do: showing how much further there is to go.

> LifeSciBench was authored by 173 PhD-holding scientists and validated by 453 expert reviewers, achieving over 96% agreement on relevance, reasoning, grounding, and usefulness. GPT-Rosalind, the top-scoring model, passed 36.1% of the 750 tasks.
> — [MarkTechPost](https://www.marktechpost.com/2026/06/17/openai-releases-lifescibench-a-750-task-benchmark-grading-ai-models-on-real-life-science-research-with-expert-written-rubric/), 2026-06-17

## What this week actually means

The unifying thread in all four results is the same: **AI isn't working alone**. In the rare-disease study, doctors confirmed the findings. In the chemistry experiment, human chemists checked every proposal. In LifeSciBench, expert reviewers validated every question. In Deployment Simulation — a fourth result showing OpenAI can now predict AI failures before models are released, using 1.3 million real conversations — the accuracy check is built into the method. What's changing isn't that AI has become independently trustworthy. What's changing is that teams are building the right checks around it so it can be useful even when it isn't perfect. That's quietly the most important development of the week.

## Key takeaways

- AI helped find diagnoses for 18 children whose rare diseases had stumped doctors for years — and every answer was double-checked through official clinical labs before it counted.
- An AI model ran the equivalent of years of chemistry experiments in three months, finding a way to improve a difficult drug-discovery reaction — but the results still need outside labs to confirm them.
- Even the best AI model only passed about one-in-three expert-level biology research tasks on OpenAI's new LifeSciBench — a reminder of how far there is still to go.
- OpenAI also published a smarter way to test AI for bad behaviour before it's released, catching problems that standard tests miss.
- The common thread in all four results: the AI isn't working alone — it's being checked by doctors, chemists, expert reviewers, or real user data at every step.

## FAQ

### Did AI cure rare diseases in children?
No — it helped doctors find leads. OpenAI's o3 model suggested possible causes for 376 unsolved cases; geneticists at Boston Children's Hospital checked those suggestions and confirmed 18 new diagnoses through certified clinical labs. The AI assisted the doctors; the doctors made the decisions.

### What is Chan-Lam coupling and why does the chemistry result matter?
Chan-Lam coupling is a way chemists build certain carbon–nitrogen bonds used in drug development. One difficult version has historically had low success rates. The AI-chemist collaboration raised those rates measurably across 10,080 experiments, though other labs still need to confirm the results.

### If AI only passes one-in-three research tasks, does that mean it's not useful?
Not at all — it means the benchmark is doing its job, showing where AI is genuinely limited. The diagnostic and chemistry results this week are precisely the narrow tasks where AI can already assist meaningfully. The 36.1% pass rate shows how much room remains for improvement in the harder tasks.

### What is Deployment Simulation in plain English?
A method for testing AI for bad behaviour before it's released, using real conversations from previous models instead of practice tests the AI can recognise and prepare for. It correctly predicted whether failure rates would go up or down 92% of the time — far better than the standard approach.
