Comparative genomics: the bacterial pan-genome

doi:10.1016/j.mib.2008.09.006

Current Opinion in Microbiology

Volume 11, Issue 5, October 2008, Pages 472-477

https://doi.org/10.1016/j.mib.2008.09.006 Get rights and content

Bacterial genome sequencing has become so easy and accessible that the genomes of multiple strains of more and more individual species have been and will be generated. These data sets provide for in depth analysis of intra-species diversity from various aspects. The pan-genome analysis, whereby the size of the gene repertoire accessible to any given species is characterized together with an estimate of the number of whole genome sequences required for proper analysis, is being increasingly applied. Different models exist for the analysis and their accuracy and applicability depend on the case at hand. Here we discuss current models and suggest a new model of broad applicability, including examples of its implementation.

Introduction

The advent of ultra-high throughput next generation sequencing technologies, for example, Roche-454 Life Sciences (www.roche-applied-science.com), Solexa-Illumina (www.illumina.com), and ABI-SOLiD (www.appliedbiosystems.com) and large-scale comparative genomics sequencing projects (e.g. http://www3.niaid.nih.gov/research/resources/mscs/, http://www.sanger.ac.uk/Projects/Microbes/, and http://genome.jgi-psf.org/mic_home.html) are leading to the availability of whole genome sequences for many strains of several bacterial species. Although until recently there has been a strong bias toward medically and environmentally relevant species, it is conceivable that in the near future many genome sequences will be available for most known bacterial species. Even unculturable species can now be tackled thanks to the emerging field of single cell genomics [1].

Comparative genomics analyses between multiple genomes of individual species have revealed extensive genomic intra-species diversity [2]. Given today’s ease of generating draft whole genome sequences, it would be of value to know how many genomes should be sequenced for any given species to accurately represent its entire gene repertoire. A way to tackle this problem is to ask how many new genes are identified every time a new genome of the species of interest is sequenced. Tettelin et al. [3^••] pioneered this approach using multiple genomes of Streptococcus agalactiae, followed by Hogg et al. [4^••] who studied Haemophilus influenzae genomes. In both cases, the analyses resulted in the determination of a core genome that consists of genes shared by all the strains studied and probably encode functions related to the basic biology and phenotypes of the species. The striking feature of the studies was the realization that a significant percentage of each genome sequence was specific to each individual strain and therefore each new genome sequenced provided a number of new genes not previously characterized. Thus, the species’ gene repertoire was significantly larger than that of any single strain of that species and a large number of genomes would have to be sequenced to characterize it. This led the authors to the concept of the bacterial pan-genome or supragenome, the topic of this review. The pan-genome is the sum of the above core genome and the dispensable genome that is composed of genes present in some but not all the strains studied as well as the strain-specific genes. The dispensable genome contributes to the species’ diversity and probably provides functions that are not essential to its basic lifestyle but confer selective advantages including niche adaptation, antibiotic resistance, and the ability to colonize new hosts.

Section snippets

The Streptococcus agalactiae pan-genome model

The first pan-genome analysis was conducted on S. agalactiae, a major cause of disease in newborns, infants, and the elderly [5, 6]. On the basis of the first S. agalactiae genome sequence and its use as the reference strain in microarray-based comparative genomic hybridizations [7], it was determined that this species’ genomic diversity was fairly extensive. Eighteen percent of the genes in the reference genome were absent in at least one of the 19 isolates tested on the microarray and most of

The Haemophilus influenzae pan-genome model

Hogg et al. [4^••] further developed the pan-genome analysis by proposing a new model for the analysis of 13 H. influenzae genomes. The model takes into account the way in which the individual genes are distributed among the different genomes. Rather than trying to extrapolate the trend for new genes discovered by means of a simple regression, they probabilistically assign genes to classes that represent the proportion of strains in the population that possess the gene. Then they generate

Heaps’ law and a new model for open pan-genomes

From a fundamental point of view, measuring the size of the pan-genome is one instance of a general class of measurements where, given a collection of ‘entities’ and their ‘attributes’, the number of distinct attributes that have been observed is monitored as a function of the number of entities considered. For the case of the pan-genome, ‘entities’ are genomes and their ‘attributes’ are genes. In many cases of practical interest it is known that the number n of distinct attributes grows

Applying Heaps’ law to a number of species

A subset of nine bacterial species for which nine or more whole genome sequences were available with annotation in Genbank (either in the complete genome or the whole genome shotgun WGS sections) were selected for application of the pan-genome model with power law regression using medians (Figure 3). In all cases the model fitted the data well. The analysis revealed five species with an open pan-genome: S. agalactiae, S. pneumoniae, E. coli, B. cereus, and P. marinus. The diversity in lifestyle

Conclusions

Bacterial intra-species diversity keeps unveiling its depth and secrets as the power and speed of whole genome sequencing technologies increases. Several groups have made use of pan-genome models in an attempt to better characterize the breadth of the gene repertoire accessible to individual microbes and understand the amount of additional genomic data required for proper characterization of this repertoire. Defining the pan-genome of a bacterium sheds light on its biology and life style and

References and recommended reading

Papers of particular interest published within the period of review have been highlighted as:

• of special interest
•• of outstanding interest

Acknowledgements

We are grateful to Joshua Orvis, Anup Mahurkar, and Dave Kemeza for their dedication in pushing the intensive pan-genome computes through the IGS computer grid.

References (20)

D. Medini et al.
Microbiology in the post-genomic era
Nat Rev Microbiol
(2008)
A. Walker et al.
Single-cell genomics
Nat Rev Microbiol
(2008)
M.J. Pallen et al.
Bacterial pathogenomics
Nature
(2007)
H. Tettelin et al.
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’
Proc Natl Acad Sci U S A
(2005)
J.S. Hogg et al.
Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains
Genome Biol
(2007)
K.S. Doran et al.
Molecular pathogenesis of neonatal group B streptococcal infection: no longer in its infancy
Mol Microbiol
(2004)
A. Schuchat et al.
Epidemiology of group B streptococcal disease. Risk factors, prevention strategies, and vaccine development
Epidemiol Rev
(1994)
H. Tettelin et al.
Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae
Proc Natl Acad Sci U S A
(2002)
T. Lefebure et al.
Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition
Genome Biol
(2007)
C. Schoen et al.
Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitides
Proc Natl Acad Sci U S A
(2008)

There are more references available in the full text version of this article.

Cited by (724)

Comparative genomics reveals the potential biotechnological applications of Liquorilactobacillus nagelii VUCC-R001, a strain isolated from kombucha tea
2024, Food Bioscience
Liquorilactobacillus nagelii is a lactic acid bacterium frequently found in a variety of traditional fermented foods, where it contributes to their sensory properties and potential health benefits. However, research evaluating the genetic and functional features of L. nagelii is scarce in the literature. In this study, we sequenced and assembled the genome of L. nagelii VUCC-R001, a strain isolated from kombucha tea, assessing its safety and exploring its biotechnological potential, mainly in terms of d-phenyllactic acid and dextran production, through a comparative genomic approach with 35 Liquorilactobacillus genomes and related phenotypic validation. Bioinformatic analysis revealed a good-quality draft genome (∼2.4 Mb) of VUCC-R001 with a completeness around 99.7% (N50 of 151,630 bp). Comparative genomic analyses showed the correct identification of the new strain, the absence of genes encoding transmissible antibiotic resistance, virulence factors, and biogenic amine production, underlining its safety, also confirmed by phenotypic tests. We identified genes putatively associated with d-phenyllactic acid (PLA) production and verified the capability of this strain to produce a high concentration (52 mg/L) of PLA in vitro. To date, this is the first study reporting a Liquorilactibacillus strain that produces d-phenyllactic acid. Genome analyses of L. nagelii also elicited the presence of a dextransucrase GH70 (EC 2.4.1.5), leading to the production of dextran from sucrose, an exopolysaccharide with applications in the food and biomedical industries. This investigation provides new insights into the genomic features and functional attributes of L. nagelii, opening new prospects for the biotechnological use of selected strains belonging to this species.
A pan-genomic assessment: Delving into the genome of the marine epiphyte Bacillus altitudinis strain 19_A and other very close Bacillus strains from multiple environments
2024, Heliyon
Marine macroalgae are the habitat of epiphytic bacteria and provide several conditions for a beneficial biological interaction to thrive. Although Bacillus is one of the most abundant epiphytic genera, genomic information on marine macroalgae-associated Bacillus species remains scarce. In this study, we further investigated our previously published genome of the epiphytic strain Bacillus altitudinis 19_A to find features that could be translated to potential metabolites produced by this microorganism, as well as genes that play a role in its interaction with its macroalgal host. To achieve this goal, we performed a pan-genome analysis of Bacillus sp. and a codon bias assessment, including the genome of the strain Bacillus altitudinis 19_A and 29 complete genome sequences of closely related Bacillus strains isolated from soil, marine environments, plants, extreme environments, air, and food. This genomic analysis revealed that Bacillus altitudinis 19_A possessed unique genes encoding proteins involved in horizontal gene transfer, DNA repair, transcriptional regulation, and bacteriocin biosynthesis. In this comparative analysis, codon bias was not associated with the habitat of the strains studied. Some accessory genes were identified in the Bacillus altitudinis 19_A genome that could be related to its epiphytic lifestyle, as well as gene clusters for the biosynthesis of a sporulation-killing factor and a bacteriocin, showing their potential as a source of antimicrobial peptides. Our results provide a comprehensive view of the Bacillus altitudinis 19_A genome to understand its adaptation to the marine environment and its potential as a producer of bioactive compounds.
Exploring the resistome and virulome in major sequence types of Acinetobacter baumannii genomes: Correlations with genome divergence and sequence types
2024, Infection, Genetics and Evolution
The increasing global prevalence of antimicrobial resistance in Acinetobacter baumannii has led to concerns regarding the effectiveness of infection treatment. Moreover, the critical role of virulence factor genes in A. baumannii's pathogenesis and its propensity to cause severe disease is of particular importance. Comparative genomics, including multi-locus sequence typing (MLST), enhances our understanding of A. baumannii epidemiology. While there is substantial documentation on A. baumannii, a comprehensive study of the antibiotic-resistant mechanisms and the virulence factors contributing to pathogenesis, and their correlation with Sequence Types (STs) remains incompletely elucidated. In this study, we aim to explore the relationship between antimicrobial resistance genes, virulence factor genes, and STs using genomic data from 223 publicly available A. baumannii strains. The core phylogeny analysis revealed five predominant STs in A. baumannii genomes, linked to their geographical sources of isolation. Furthermore, the resistome and virulome of A. baumannii followed an evolutionary pattern consistent with their pan-genome evolution. Among the major STs, we observed significant variations in resistant genes against “aminoglycoside” and “sulphonamide” antibiotics, highlighting the role of genotypic variations in determining resistance profiles. Furthermore, the presence of virulence factor genes, particularly exotoxin and nutritional / metabolic factor genes, played a crucial role in distinguishing the major STs, suggesting a potential link between genetic makeup and pathogenicity. Understanding these associations can provide valuable insights into A. baumannii's virulence potential and clinical outcomes, enabling the development of effective strategies to combat infections caused by this opportunistic pathogen.
Whole-genome sequencing of extensively drug-resistant Salmonella enterica serovar Typhi clinical isolates from the Peshawar region of Pakistan
2024, Journal of Infection and Public Health
Typhoid fever, caused by Salmonella enterica serovar Typhi, is a significant public health concern due to the escalating of antimicrobial resistance (AMR), with limited treatment options for extensively drug-resistant (XDR) S. Typhi strains pose a serious threat to disease management and control. This study aimed to investigate the genomic characteristics, epidemiology and AMR genes of XDR S. Typhi strains from typhoid fever patients in Pakistan.
We assessed 200 patients with enteric fever symptoms, confirming 65 S. Typhi cases through culturing and biochemical tests. Subsequent antimicrobial susceptibility testing revealed 40 cases of extensively drug-resistant (XDR) and 25 cases of multi-drug resistance (MDR). Thirteen XDR strains were selected for whole-genome sequencing, to analyze their sequence type, phylogenetics, resistance genes, pathogenicity islands, and plasmid sequences using variety of data analysis resources. Pangenome analysis was conducted for 140 XDR strains, including thirteen in-house and 127 strains reported from other regions of Pakistan, to assess their genetic diversity and functional annotation.
MLST analysis classified all isolates as sequence type 1 (ST-1) with 4.3.1.1. P1 genotype characterization. Prophage and Salmonella Pathogenicity Island (SPI) analysis identified intact prophages and eight SPIs involved in Salmonella's invasion and replication within host cells. Genome data analysis revealed numerous AMR genes including dfrA7, sul1, qnrS1, TEM-1, Cat1, and CTX-M-15, and SNPs associated with antibiotics resistance. IncY, IncQ1, pMAC, and pAbTS2 plasmids, conferring antimicrobial resistance, were detected in a few XDR S. Typhi strains. Phylogenetic analysis inferred a close epidemiological linkage among XDR strains from different regions of Pakistan. Pangenome was noted closed among these strains and functional annotation highlighted genes related to metabolism and pathogenesis.
This study revealed a uniform genotypic background among XDR S. Typhi strains in Pakistan, signifying a persistence transmission of a single, highly antibiotic-resistant clone. The closed pan-genome observed underscores limited genetic diversity and highlights the importance of genomic surveillance for combating drug-resistant typhoid infections.
Impact of evolution on lifestyle in microbiome
2024, Advances in Genetics
This chapter analyses the interaction between microbiota and humans from an evolutionary point of view. Long-term interactions between gut microbiota and host have been generated as a result of dietary choices through coevolutionary processes, where mutuality of advantage is essential. Likewise, the characteristics of the intestinal environment have made it possible to describe different intrahost evolutionary mechanisms affecting microbiota. For its part, the intestinal microbiota has been of great importance in the evolution of mammals, allowing the diversification of dietary niches, phenotypic plasticity and the selection of host phenotypes. Although the origin of the human intestinal microbial community is still not known with certainty, mother-offspring transmission plays a key role, and it seems that transmissibility between individuals in adulthood also has important implications. Finally, it should be noted that certain aspects inherent to modern lifestyle, including refined diets, antibiotic intake, exposure to air pollutants, microplastics, and stress, could negatively affect the diversity and composition of our gut microbiota. This chapter aims to combine current knowledge to provide a comprehensive view of the interaction between microbiota and humans throughout evolution.
Pangenome analysis reveals the genetic basis for taxonomic classification of the Lactobacillaceae family
2023, Food Microbiology
Lactobacillaceae represent a large family of important microbes that are foundational to the food industry. Many genome sequences of Lactobacillaceae strains are now available, enabling us to conduct a comprehensive pangenome analysis of this family. We collected 3591 high-quality genomes from public sources and found that: 1) they contained enough genomes for 26 species to perform a pangenomic analysis, 2) the normalized Heap's coefficient λ (a measure of pangenome openness) was found to have an average value of 0.27 (ranging from 0.07 to 0.37), 3) the pangenome openness was correlated with the abundance and genomic location of transposons and mobilomes, 4) the pangenome for each species was divided into core, accessory, and rare genomes, that highlight the species-specific properties (such as motility and restriction-modification systems), 5) the pangenome of Lactiplantibacillus plantarum (which contained the highest number of genomes found amongst the 26 species studied) contained nine distinct phylogroups, and 6) genome mining revealed a richness of detected biosynthetic gene clusters, with functions ranging from antimicrobial and probiotic to food preservation, but ∼93% were of unknown function. This study provides the first in-depth comparative pangenomics analysis of the Lactobacillaceae family.

View all citing articles on Scopus

View full text

Article preview

Current Opinion in Microbiology

Introduction

Section snippets

The Streptococcus agalactiae pan-genome model

The Haemophilus influenzae pan-genome model

Heaps’ law and a new model for open pan-genomes

Applying Heaps’ law to a number of species

Conclusions

References and recommended reading

Acknowledgements

References (20)

Microbiology in the post-genomic era

Nat Rev Microbiol

Single-cell genomics

Nat Rev Microbiol

Bacterial pathogenomics

Nature

Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’

Proc Natl Acad Sci U S A

Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains

Genome Biol

Molecular pathogenesis of neonatal group B streptococcal infection: no longer in its infancy

Mol Microbiol

Epidemiology of group B streptococcal disease. Risk factors, prevention strategies, and vaccine development

Epidemiol Rev

Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae

Proc Natl Acad Sci U S A

Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition

Genome Biol

Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitides

Proc Natl Acad Sci U S A

Cited by (724)

Comparative genomics reveals the potential biotechnological applications of Liquorilactobacillus nagelii VUCC-R001, a strain isolated from kombucha tea

A pan-genomic assessment: Delving into the genome of the marine epiphyte Bacillus altitudinis strain 19_A and other very close Bacillus strains from multiple environments

Exploring the resistome and virulome in major sequence types of Acinetobacter baumannii genomes: Correlations with genome divergence and sequence types

Whole-genome sequencing of extensively drug-resistant Salmonella enterica serovar Typhi clinical isolates from the Peshawar region of Pakistan

Impact of evolution on lifestyle in microbiome

Pangenome analysis reveals the genetic basis for taxonomic classification of the Lactobacillaceae family