Comparative genomics: the bacterial pan-genome
Introduction
The advent of ultra-high throughput next generation sequencing technologies, for example, Roche-454 Life Sciences (www.roche-applied-science.com), Solexa-Illumina (www.illumina.com), and ABI-SOLiD (www.appliedbiosystems.com) and large-scale comparative genomics sequencing projects (e.g. http://www3.niaid.nih.gov/research/resources/mscs/, http://www.sanger.ac.uk/Projects/Microbes/, and http://genome.jgi-psf.org/mic_home.html) are leading to the availability of whole genome sequences for many strains of several bacterial species. Although until recently there has been a strong bias toward medically and environmentally relevant species, it is conceivable that in the near future many genome sequences will be available for most known bacterial species. Even unculturable species can now be tackled thanks to the emerging field of single cell genomics [1].
Comparative genomics analyses between multiple genomes of individual species have revealed extensive genomic intra-species diversity [2]. Given today’s ease of generating draft whole genome sequences, it would be of value to know how many genomes should be sequenced for any given species to accurately represent its entire gene repertoire. A way to tackle this problem is to ask how many new genes are identified every time a new genome of the species of interest is sequenced. Tettelin et al. [3••] pioneered this approach using multiple genomes of Streptococcus agalactiae, followed by Hogg et al. [4••] who studied Haemophilus influenzae genomes. In both cases, the analyses resulted in the determination of a core genome that consists of genes shared by all the strains studied and probably encode functions related to the basic biology and phenotypes of the species. The striking feature of the studies was the realization that a significant percentage of each genome sequence was specific to each individual strain and therefore each new genome sequenced provided a number of new genes not previously characterized. Thus, the species’ gene repertoire was significantly larger than that of any single strain of that species and a large number of genomes would have to be sequenced to characterize it. This led the authors to the concept of the bacterial pan-genome or supragenome, the topic of this review. The pan-genome is the sum of the above core genome and the dispensable genome that is composed of genes present in some but not all the strains studied as well as the strain-specific genes. The dispensable genome contributes to the species’ diversity and probably provides functions that are not essential to its basic lifestyle but confer selective advantages including niche adaptation, antibiotic resistance, and the ability to colonize new hosts.
Section snippets
The Streptococcus agalactiae pan-genome model
The first pan-genome analysis was conducted on S. agalactiae, a major cause of disease in newborns, infants, and the elderly [5, 6]. On the basis of the first S. agalactiae genome sequence and its use as the reference strain in microarray-based comparative genomic hybridizations [7], it was determined that this species’ genomic diversity was fairly extensive. Eighteen percent of the genes in the reference genome were absent in at least one of the 19 isolates tested on the microarray and most of
The Haemophilus influenzae pan-genome model
Hogg et al. [4••] further developed the pan-genome analysis by proposing a new model for the analysis of 13 H. influenzae genomes. The model takes into account the way in which the individual genes are distributed among the different genomes. Rather than trying to extrapolate the trend for new genes discovered by means of a simple regression, they probabilistically assign genes to classes that represent the proportion of strains in the population that possess the gene. Then they generate
Heaps’ law and a new model for open pan-genomes
From a fundamental point of view, measuring the size of the pan-genome is one instance of a general class of measurements where, given a collection of ‘entities’ and their ‘attributes’, the number of distinct attributes that have been observed is monitored as a function of the number of entities considered. For the case of the pan-genome, ‘entities’ are genomes and their ‘attributes’ are genes. In many cases of practical interest it is known that the number n of distinct attributes grows
Applying Heaps’ law to a number of species
A subset of nine bacterial species for which nine or more whole genome sequences were available with annotation in Genbank (either in the complete genome or the whole genome shotgun WGS sections) were selected for application of the pan-genome model with power law regression using medians (Figure 3). In all cases the model fitted the data well. The analysis revealed five species with an open pan-genome: S. agalactiae, S. pneumoniae, E. coli, B. cereus, and P. marinus. The diversity in lifestyle
Conclusions
Bacterial intra-species diversity keeps unveiling its depth and secrets as the power and speed of whole genome sequencing technologies increases. Several groups have made use of pan-genome models in an attempt to better characterize the breadth of the gene repertoire accessible to individual microbes and understand the amount of additional genomic data required for proper characterization of this repertoire. Defining the pan-genome of a bacterium sheds light on its biology and life style and
References and recommended reading
Papers of particular interest published within the period of review have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgements
We are grateful to Joshua Orvis, Anup Mahurkar, and Dave Kemeza for their dedication in pushing the intensive pan-genome computes through the IGS computer grid.
References (20)
- et al.
Microbiology in the post-genomic era
Nat Rev Microbiol
(2008) - et al.
Single-cell genomics
Nat Rev Microbiol
(2008) - et al.
Bacterial pathogenomics
Nature
(2007) - et al.
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’
Proc Natl Acad Sci U S A
(2005) - et al.
Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains
Genome Biol
(2007) - et al.
Molecular pathogenesis of neonatal group B streptococcal infection: no longer in its infancy
Mol Microbiol
(2004) - et al.
Epidemiology of group B streptococcal disease. Risk factors, prevention strategies, and vaccine development
Epidemiol Rev
(1994) - et al.
Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae
Proc Natl Acad Sci U S A
(2002) - et al.
Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition
Genome Biol
(2007) - et al.
Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitides
Proc Natl Acad Sci U S A
(2008)
Cited by (724)
Exploring the resistome and virulome in major sequence types of Acinetobacter baumannii genomes: Correlations with genome divergence and sequence types
2024, Infection, Genetics and EvolutionWhole-genome sequencing of extensively drug-resistant Salmonella enterica serovar Typhi clinical isolates from the Peshawar region of Pakistan
2024, Journal of Infection and Public HealthImpact of evolution on lifestyle in microbiome
2024, Advances in Genetics