A new artificial intelligence model, Evo 2, is capable of analyzing and designing genomic sequences from all forms of life, representing a significant leap forward in our ability to understand and manipulate DNA.
A key application of this technology is predicting the impact of genetic variations on human health.
For years, computational biology has sought to replicate the success of large language models by applying them to the study of DNA. Early AI tools in this field were limited to analyzing isolated proteins or specific bacterial genomes. However, the complexity of eukaryotic life – and the intricate interactions within genomes like that of humans – presented a substantial technical hurdle. This new development addresses those challenges.
Evo 2, developed by researchers at the Arc Institute and NVIDIA, in collaboration with scientists from Stanford University, the University of California, Berkeley, and the University of California, San Francisco, marks a turning point in DNA interpretation. The model operates on the principle that DNA sequences can be analyzed in a similar way to human language. Just as language models learn patterns from text, Evo 2 learns regularities within nucleotide sequences. The model was trained on over 9.3 trillion nucleotides from more than 128,000 complete genomes and metagenomic data encompassing bacteria, archaea, eukaryotes, and bacteriophages.
Details of the model, recently published in Nature, demonstrate a system capable of identifying functional patterns in DNA and using that information to both interpret genetic variants and generate new biological sequences. This advancement could accelerate research into genetic diseases and personalized medicine.
How Evo 2 Was Developed
To handle the immense size of genomic sequences, researchers developed a specialized computational architecture called StripedHyena 2. This architecture allows Evo 2 to analyze sequences of up to one million nucleotides in a single input, enabling the detection of relationships between genetic elements separated by large distances within the genome – a particularly relevant capability for organisms with complex genomes.
The model was trained using the OpenGenome2 dataset, a collection of curated and non-redundant genomic sequences from organisms across all domains of life. The training process, which lasted several months, utilized the NVIDIA DGX Cloud AI platform and more than 2,000 H100 GPU units to process trillions of nucleotides, with the goal of learning the relationships between sequences and biological functions.
Predicting the Impact of Genetic Variants with Evo 2
Evo 2 has a wide range of applications within computational biology, including identifying functional elements of the genome, generating new DNA sequences, studying genome organization, and predicting the effect of mutations on proteins and organisms. The ability to predict the impact of genetic variations is particularly valuable for human genetics, as understanding the functional consequences of these variations is central to interpreting genomic data in biomedical research and molecular diagnostics.
The model learns evolutionary patterns present in large sets of genomic sequences. When a mutation is introduced into a sequence, Evo 2 can calculate how the probability of that sequence being compatible with observed patterns changes. Mutations that reduce this probability may be associated with detrimental effects on biological function.
Researchers demonstrated this capability by analyzing variants of the BRCA1 gene, which is associated with hereditary breast and ovarian cancer. When tested using sets of BRCA1 variants with known functional effects, the model was able to differentiate between benign variants and those with loss-of-function effects based on changes in sequence probability upon mutation. Evo 2 showed the ability to analyze variants in both coding and non-coding regions near RNA messenger processing sites, suggesting this type of model could help prioritize clinically relevant variants in genomic studies.
Evo 2 Can Design Genomic Sequences
Beyond analyzing genetic variants, Evo 2 can also be used to generate new DNA sequences. This capability is based on learning patterns present in the genomes of different organisms, allowing the model to produce sequences that maintain characteristics similar to those observed in nature.
The model can produce coherent sequences at the genome scale, including complete human mitochondrial sequences and bacterial genomes of hundreds of thousands of base pairs. In tests using the minimal genome of the bacterium Mycoplasma genitalium, researchers generated sequences of approximately 580 kilobases containing genes with structural characteristics similar to natural genes.
Researchers have also used Evo 2 to design functional synthetic bacteriophages as potential therapeutic alternatives to antibiotics. These results suggest that Evo 2 can generate genomic sequences with plausible biological characteristics, opening the door to the computational design of biological components. For example, Evo 2 can generate regulatory sequences capable of modifying chromatin accessibility in human cells, potentially facilitating the development of more specific gene editing tools or gene therapies.
“If you have a gene therapy that you want to activate only in neurons or liver cells, it would be possible to design a genetic element that is only accessible in those cell types,” noted Hani Goodarzi, a researcher at the Arc Institute and co-author of the study.
A New Generation of Genomic Models
The development of Evo 2 illustrates how artificial intelligence can contribute to integrating different levels of biological information from genomic sequences. By learning conserved patterns in highly diverse organisms, the model allows for the exploration of the genome from a comparative perspective, encompassing everything from small regulatory elements to complete genomes. The findings could lead to a more comprehensive understanding of genetic function and disease.
As the study authors state: “The Evo series of models establishes the foundations for biological modeling and design that unifies the different length scales of biology through a common representation. These capabilities, combined with large-scale DNA manipulation technologies, could enable the programmable design of more complex biological functions. We anticipate that future work integrating genomic sequence data with other modalities may lead to models capable of usefully simulating complex phenotypes in health and disease.”
As models that integrate different types of biological data – such as transcriptomic, epigenomic, or proteomic information – are developed, tools like Evo 2 could contribute to advancing computational models capable of more accurately predicting how genetic variations influence phenotypes and the development of diseases.
The Evo 2 model has been released as an open-source resource, including the code, model parameters, and the OpenGenome2 training dataset. This will allow other research groups to utilize and develop new applications based on this system. Alongside AlphaGenome, a tool developed by the Google DeepMind team to interpret the human genome, Evo 2 promises significant advances for biology and medicine.
Scientific Article
Brixi, G., Durrant, M.G., Ku, J. et al. Genome modelling and design across all domains of life with Evo 2. Nature (2026). https://doi.org/10.1038/s41586-026-10176-5
Sources
With Evo 2, AI can model and design the genetic code for all domains of life. https://www.eurekalert.org/news-releases/1118060
