- On December 16, 2025, OpenAI announced FrontierScience, a new benchmark designed to evaluate AI’s expert-level scientific reasoning capabilities in three core fields: physics, chemistry, and biology.
- The focus of FrontierScience is not on memorizing knowledge, but on true scientific thinking: hypothesis formation, verification, refinement, and interdisciplinary synthesis.
- OpenAI stated that advanced models like GPT-5 have been used by scientists in actual research, ranging from interdisciplinary literature reviews and multilingual research synthesis to complex mathematical proofs.
- Thanks to AI, many research tasks that once took days or weeks now require only a few hours to complete.
- The creation of FrontierScience stems from the fact that old scientific benchmarks have become outdated, saturated, or limited to multiple-choice questions, failing to reflect genuine reasoning.
- For example, the 2023 GPQA benchmark saw GPT-4 achieving only 39% compared to the expert level of 70%, but by 2025, GPT-5.2 reached 92%, indicating a need for tougher evaluations.
- FrontierScience consists of over 700 text-based questions, including 160 “gold-standard” questions directly compiled and validated by experts.
- The benchmark is divided into two branches: FrontierScience-Olympiad with 100 short-answer questions of International Science Olympiad difficulty, and FrontierScience-Research with 60 multi-step research problems constructed by PhDs.
- The Research section uses a 10-point rubric grading scale, evaluating both the results and intermediate reasoning steps; a score of 7/10 or higher is considered correct.
- Initial results show GPT-5.2 achieving 77% on the Olympiad and 25% on the Research section, while Gemini 3 Pro reached 76% on the Olympiad, reflecting significant progress but also much room for improvement.
📌 OpenAI announced FrontierScience, a new benchmark evaluating AI’s expert-level scientific reasoning in three core fields: physics, chemistry, and biology, marking a shift from knowledge memorization to true scientific thinking: hypothesis formation, verification, refinement, and interdisciplinary synthesis. With over 700 difficult questions built by experts, results show GPT-5.2 reaching 77% on theoretical problems but only 25% on open research tasks. This indicates that AI is strong enough to support science, but a large gap remains before it can truly generate new scientific breakthroughs.

