The development of modern machine-learning (ML) models requires large and diverse datasets of high quality training data. State-of-the-art quantum chemical methods, and density functional theory (DFT) in particular, can be used to generate a variety of molecular and atomic descriptors that can be used to create ML potentials. Creating millions to billions of datapoints not only challenges theoretical methods but also the utilized quantum chemistry code. In this context efficiency and robustness become integral factors for the successful generation of reference data at a large scale.
The current release of the Open Molecules 2025 (OMol25) Dataset led by Meta and researchers of the Berkeley Lab marks a new milestone in the development of quantum chemical data sets. A total of 83 million unique molecular systems with up to 350 atoms resulting in more than 100 million hybrid DFT calculations with more than 6 billion CPU core-hours underlines the herculean effort put into the OMol25.
“Built with the high-performance quantum chemistry program package ORCA (Version 6.0.1), OMol25 contains simulations of large atomic systems that, until now, have been out of reach.” – Meta
We are particularly proud that the researchers behind the OMol25 project put their trust into ORCA and that our continuous development of ORCA and its highly efficient algorithms like RIJCOSX became part of this achievement.
Would you like to learn more about the OMol25 dataset and the models developed by its authors?