Skip to main content
SuggestedTech
Back to all news

AI research

OpenAI just showed AI working as a real science tool — here's what happened

In four days, OpenAI shared four results using AI in actual research. Here's the plain-English version of what they did and how confident we should be.

The SuggestedTech TeamVerified June 2026

In June 2026 OpenAI published four validated results showing AI assisting real scientific work.

You might have seen headlines last week about OpenAI doing 'science'. It was real — four separate results, published in 48 hours, each showing AI helping with genuine research problems. Here's what actually happened, in plain English, along with an honest note on what's confirmed and what still needs more work.

Finding diagnoses for children that doctors couldn't crack

The most moving result: OpenAI's o3 model worked alongside geneticists at Boston Children's Hospital to look at 376 children whose rare genetic diseases had gone undiagnosed, sometimes for many years. The AI went through each child's symptoms, medical notes, and scientific literature, then suggested possible genetic causes for doctors to investigate further. After checking those suggestions through official clinical processes — including confirming any findings in certified laboratories — the team reached 18 new diagnoses. That's roughly 1 in 20 of the cases that had previously stumped specialists, which the study (published in the prestigious NEJM AI journal) calls a 4.8% additional diagnostic yield.

Here's what the AI didn't do: it didn't diagnose anyone. It generated leads — like a very well-read research assistant — and the doctors did the diagnosing. The cases included children with muscle weakness, developmental conditions, and early-onset psychosis. For some families, this ended searches that had lasted nearly two decades. OpenAI also says the hospital's overall use of AI tools has now contributed to more than 40 rare-disease diagnoses and saved 60,000 hours of staff time, though those broader numbers are the hospital's own estimate rather than independently verified.

OpenAI said its o3 model 'produced evidence-linked hypotheses for specialists to review and, where appropriate, investigate through additional testing and confirm in a clinical laboratory', explicitly noting the model 'did not diagnose any patient or make any clinical decisions'.

Source: OpenAI · 18 June 2026

Running thousands of chemistry experiments to improve a key drug-making reaction

The second result is harder to picture but equally interesting for anyone who cares about medicine. There's a type of chemistry reaction called Chan-Lam coupling that scientists use to build molecules for new drugs. One tricky version of it — involving a chemical group called a primary sulfonamide — has always had frustratingly low success rates, making it hard to use in real drug development.

GPT-5.4 spent three months working with a Polish chemistry company called Molecule.one, reviewing scientific papers, suggesting experimental ideas, and ranking them. Their automated lab (Maria) then ran 10,080 reactions — the kind of quantity that would normally take much longer with human researchers alone. The AI suggested that adding a compound called TEMPO (a mild oxidising agent) might help. The results: average success rates rose from 16.6% to 25.2%, and the proportion of reactions reaching a useful production threshold jumped from 15.6% to 37.5%.

The honest caveat, which OpenAI states clearly: the system was 'near-autonomous, not autonomous' — humans checked every proposal before it went into the lab, and the results need to be confirmed by other laboratories before the chemistry community can treat this as settled. But it's a meaningful proof of concept that AI can help speed up a type of research that currently takes years.

A sobering benchmark: AI passes only 1-in-3 research tasks

On 17 June, OpenAI released LifeSciBench, a new test for measuring how good AI really is at scientific research — not quiz-style questions, but the kind of multi-step, evidence-juggling tasks actual scientists do. The benchmark was built by 173 scientists (mostly with PhDs) and checked by 453 independent expert reviewers. It covers everything from genetics to clinical medicine, with 750 tasks and 19,020 grading criteria — each task graded on roughly 25 different dimensions. The result? The best-performing model (OpenAI's own GPT-Rosalind) passed just 36.1% of tasks. Other models scored between 13% and 26%. That's not a failure — that's the benchmark doing what a good benchmark should do: showing how much further there is to go.

LifeSciBench was authored by 173 PhD-holding scientists and validated by 453 expert reviewers, achieving over 96% agreement on relevance, reasoning, grounding, and usefulness. GPT-Rosalind, the top-scoring model, passed 36.1% of the 750 tasks.

Source: MarkTechPost · 17 June 2026

What this week actually means

The unifying thread in all four results is the same: AI isn't working alone. In the rare-disease study, doctors confirmed the findings. In the chemistry experiment, human chemists checked every proposal. In LifeSciBench, expert reviewers validated every question. In Deployment Simulation — a fourth result showing OpenAI can now predict AI failures before models are released, using 1.3 million real conversations — the accuracy check is built into the method. What's changing isn't that AI has become independently trustworthy. What's changing is that teams are building the right checks around it so it can be useful even when it isn't perfect. That's quietly the most important development of the week.

Frequently asked questions

Did AI cure rare diseases in children?
No — it helped doctors find leads. OpenAI's o3 model suggested possible causes for 376 unsolved cases; geneticists at Boston Children's Hospital checked those suggestions and confirmed 18 new diagnoses through certified clinical labs. The AI assisted the doctors; the doctors made the decisions.
What is Chan-Lam coupling and why does the chemistry result matter?
Chan-Lam coupling is a way chemists build certain carbon–nitrogen bonds used in drug development. One difficult version has historically had low success rates. The AI-chemist collaboration raised those rates measurably across 10,080 experiments, though other labs still need to confirm the results.
If AI only passes one-in-three research tasks, does that mean it's not useful?
Not at all — it means the benchmark is doing its job, showing where AI is genuinely limited. The diagnostic and chemistry results this week are precisely the narrow tasks where AI can already assist meaningfully. The 36.1% pass rate shows how much room remains for improvement in the harder tasks.
What is Deployment Simulation in plain English?
A method for testing AI for bad behaviour before it's released, using real conversations from previous models instead of practice tests the AI can recognise and prepare for. It correctly predicted whether failure rates would go up or down 92% of the time — far better than the standard approach.

Sources

← All news