Accurate estimation of intraspecific microbial gene content variation in metagenomic data with MIDAS v3 and StrainPGC

Byron J. Smith, Chunyu Zhao, Veronika Dubinkina, Xiaofan Jin, Liron Zahavi, Saar Shoer, Jacqueline Moltzau-Anderson, Eran Segal, Katherine S. Pollard*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Metagenomics has greatly expanded our understanding of the human gut microbiome by revealing a vast diversity of bacterial species within and across individuals. Even within a single species, different strains can have highly divergent gene content, affecting traits such as antibiotic resistance, metabolism, and virulence. Methods that harness metagenomic data to resolve strain-level differences in functional potential are crucial for understanding the causes and consequences of this intraspecific diversity. The enormous size of pangenome references, strain mixing within samples, and inconsistent sequencing depth present challenges for existing tools that analyze samples one at a time. To address this gap, we updated the MIDAS pangenome profiler, now released as version 3, and developed StrainPGC, an approach to strain-specific gene content estimation that combines strain tracking and correlations across multiple samples. We validate our integrated analysis using a complex synthetic community of strains from the human gut and find that StrainPGC outperforms existing approaches. Analyzing a large, publicly available metagenome collection from inflammatory bowel disease patients and healthy controls, we catalog the functional repertoires of thousands of strains across hundreds of species, capturing extensive diversity missing from reference databases. Finally, we apply StrainPGC to metagenomes from a clinical trial of fecal microbiota transplantation for the treatment of ulcerative colitis. We identify two Escherichia coli strains, from two different donors, that are both frequently transmitted to patients but have notable differences in functional potential. StrainPGC and MIDAS v3 together enable precise, intraspecific pangenomic investigations using large collections of metagenomic data without microbial isolation or de novo assembly.

Original languageEnglish
Pages (from-to)1247-1260
Number of pages14
JournalGenome Research
Volume35
Issue number5
Early online date10 Apr 2025
DOIs
Publication statusPublished - May 2025

Funding

This work was funded by a National Heart, Lung, and Blood Institute grant HL160862, the Chan Zuckerberg Biohub San Francisco, Gladstone Institutes, and the Sam Simeon Fund. B.J.S. was supported by a Computational Innovation Postdoctoral Fellowship from the Noyce Initiative for Digital Transformation in Computational Biology and Health Data Science. J.M-A. was supported by funding from the Kenneth Rainin Foundation and the Crohn's and Colitis Foundation. We thank Françoise Chanut for extensive editorial support.

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Fingerprint

Dive into the research topics of 'Accurate estimation of intraspecific microbial gene content variation in metagenomic data with MIDAS v3 and StrainPGC'. Together they form a unique fingerprint.

Cite this