Pulse Brain · Growing Health Evidence Index
Tier 3 — Observational / field trialPreprint

Advancing ab initio genome annotation with OrionGeno

Liu, L.; Cai, X.; Wang, S.; Deng, Y.; Wu, Y.; Pan, Y.; Wang, J.; Zhang, C.; Xia, H.; Tan, N.; Su, K.; Liu, Y.; Zhou, X.; Liu, L.; Wei, T.; Zhang, Y.; Li, Q.; Li, Y.; Yin, P.; Xu, X.

bioRxiv · 2026

Read source ↗ All evidence

Summary

The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.

Outcomes reported

The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.

Theme
Farming systems, soils & land use
Subject
Other / interdisciplinary
Study type
Research
Source type
Preprint
Status
Preprint
Geography
United Kingdom
System type
Other
DOI
10.64898/2026.04.26.720859
Catalogue ID
IRmoq83umo-99eccd
Pulse AI · ask about this record

Dig deeper with Pulse AI.

Pulse AI has read the whole catalogue. Ask about this record, its theme, or how the findings apply to UK farming and policy — every answer cites the underlying studies.