Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Robust and scalable inference of population history from hundreds of unphased whole genomes

Abstract

It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The effect of phasing error.
Figure 2: Performance of SMC++ in comparison to MSMC and ∂ai.
Figure 3: SMC++ results for jointly inferring population size histories and divergence times.
Figure 4: Computational performance of SMC++, MSMC and ∂ai.
Figure 5: Results of effective population size inference across eight extant human populations and an ancient Ust'-Ishim individual.
Figure 6: Inference of split times in modern humans.
Figure 7: Results of effective population size inference for two finch species and Drosophila.

Similar content being viewed by others

Accession codes

Accessions

European Nucleotide Archive

References

  1. Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  3. Skoglund, P. et al. Genetic evidence for two founding populations of the Americas. Nature 525, 104–108 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Raghavan, M. et al. Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 349, aab3884 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Huerta-Sánchez, E. et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature 512, 194–197 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Racimo, F., Sankararaman, S., Nielsen, R. & Huerta-Sánchez, E. Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet. 16, 359–371 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).

    Article  PubMed  Google Scholar 

  9. Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507, 354–357 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Vernot, B. & Akey, J.M. Resurrecting surviving Neandertal lineages from modern human genomes. Science 343, 1017–1021 (2014).

    Article  CAS  PubMed  Google Scholar 

  11. Miller, W. et al. Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change. Proc. Natl. Acad. Sci. USA 109, E2382–E2390 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Stewart, J.R. & Stringer, C.B. Human evolution out of Africa: the role of refugia and climate change. Science 335, 1317–1321 (2012).

    Article  CAS  PubMed  Google Scholar 

  13. Sawyer, S.A. & Hartl, D.L. Population genetics of polymorphism and divergence. Genetics 132, 1161–1176 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Griffiths, R.C. & Tavaré, S. Sampling theory for neutral alleles in a varying environment. Proc. R. Soc. Lond. B 344, 403–410 (1994).

    Article  CAS  Google Scholar 

  15. Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theor. Popul. Biol. 55, 248–259 (1999).

    Article  CAS  PubMed  Google Scholar 

  16. McVean, G.A. & Cardin, N.J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).

    Article  CAS  Google Scholar 

  17. Marjoram, P. & Wall, J.D. Fast “coalescent” simulation. BMC Genet. 7, 16 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Gutenkunst, R.N., Hernandez, R.D., Williamson, S.H. & Bustamante, C.D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V.C. & Foll, M. Robust demographic inference from genomic and SNP data. PLoS Genet. 9, e1003905 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Bhaskar, A., Wang, Y.X.R. & Song, Y.S. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25, 268–279 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kamm, J.A., Terhorst, J. & Song, Y.S. Efficient computation of the joint sample frequency spectra for multiple populations. J. Comput. Graph. Stat. (in the press).

  22. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Dutheil, J.Y. et al. Ancestral population genomics: the coalescent hidden Markov model approach. Genetics 183, 259–274 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Paul, J.S., Steinrücken, M. & Song, Y.S. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187, 1115–1128 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Steinrücken, M., Paul, J.S. & Song, Y.S. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013).

    Article  PubMed  Google Scholar 

  27. Sheehan, S., Harris, K. & Song, Y.S. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Steinrücken, M., Kamm, J.A. & Song, Y.S. Inference of complex population histories using whole-genome sequences from multiple populations. Preprint at. bioRxiv http://dx.doi.org/10.1101/026591 (2015).

  29. Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

    Article  CAS  PubMed  Google Scholar 

  31. Terhorst, J. & Song, Y.S. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. USA 112, 7677–7682 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

    Article  CAS  PubMed  Google Scholar 

  33. Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Langergraber, K.E. et al. Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proc. Natl. Acad. Sci. USA 109, 15716–15721 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Singhal, S. et al. Stable recombination hotspots in birds. Science 350, 928–932 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Lack, J.B. et al. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199, 1229–1241 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Keightley, P.D., Ness, R.W., Halligan, D.L. & Haddrill, P.R. Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics 196, 313–320 (2014).

    Article  CAS  PubMed  Google Scholar 

  38. Griffiths, R.C. & Marjoram, P. in Progress in Population Genetics and Human Evolution (eds. Donnelly, P. and Tavaré, S.) 87, 257–270 (Springer-Verlag, 1997).

    Article  CAS  Google Scholar 

  39. Hobolth, A. & Jensen, J.L. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor. Popul. Biol. 98, 48–58 (2014).

    Article  PubMed  Google Scholar 

  40. Wilton, P.R., Carmi, S. & Hobolth, A. The SMC is a highly accurate approximation to the ancestral recombination graph. Genetics 200, 343–355 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Tataru, P., Nirody, J.A. & Song, Y.S. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics 30, 3430–3431 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Polanski, A. & Kimmel, M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165, 427–436 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Simonsen, K.L. & Churchill, G.A. A Markov chain model of coalescence with recombination. Theor. Popul. Biol. 52, 43–59 (1997).

    Article  CAS  PubMed  Google Scholar 

  44. Paul, J.S. & Song, Y.S. Blockwise HMM computation for large-scale population genomic inference. Bioinformatics 28, 2008–2015 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Bishop, C.M. Pattern Recognition and Machine Learning (Springer, 2006).

  46. Staab, P.R., Zhu, S., Metzler, D. & Lunter, G. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics 31, 1680–1682 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank J. Pool and C. Langley for helpful comments on our inferred Drosophila demography. We also thank H. Li for providing us with the Ust'-Ishim genome sequence. This research is supported in part by NIH grants R01 GM094402 and R01 GM108805 and by a Packard Fellowship for Science and Engineering (Y.S.S.).

Author information

Authors and Affiliations

Authors

Contributions

J.T., J.A.K. and Y.S.S. conceived the study, developed the theoretical model and wrote the manuscript. J.T. developed software implementing the method and performed data analysis. J.A.K. contributed benchmarks of ∂ai.

Corresponding author

Correspondence to Yun S Song.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Results of demographic inference when ρ is not known.

Each step plot represents inference on a single simulated data set with sample size n = 50. The colors of the estimated size histories correspond to the ratio of recombination to mutation used in each simulation, which was not known to SMC++ during model fitting. The ratio ranged from 1:10 (black) to 10:1 (light blue). The true demography used for simulation is shown in bold black. The nested scatterplot compares the true versus estimated ratio of recombination to mutation rates. The mutation rate θ/2 was assumed to be known. SMC++ is able to fairly accurately estimate the recombination rate over two orders of magnitude with respect to the mutation rate and is most accurate when the mutation and recombination rates are approximately equal.

Supplementary Figure 2 Results of demographic inference across three African subpopulations.

Supplementary Figure 3 Results of demographic inference across three Asian subpopulations.

Supplementary Figure 4 Results of demographic inference across two European subpopulations.

Supplementary Figure 5 Sensitivity analysis for human demographic inference.

Blue lines are reproduced from Figure 5. Red lines represent the result of randomly downsampling the data to contain 90% of the original set of chromosomes and rerunning the analysis.

Supplementary Figure 6 Results of analyzing non-human species, in generations.

Supplementary Figure 7 Schematic of the differences between PSMC, MSMC and SMC++.

The HMM used in PSMC tracks the hidden TRMCA of a pair of haploid lineages and emits binary symbols based on the heterozygosity of this pair at each block of sites. MSMC tracks the hidden time to first coalescence among several haploid lineages, as well as the identity (denoted by the bolded bars) of the two lineages that coalesce first. It considers as emissions the allelic state of all lineages in the sample. SMC++, like PSMC, tracks the TMRCA in only a pair of individuals and emits 2-tuples whose distribution is given by the conditioned SFS (Section S1).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7, Supplementary Tables 1–3 and Supplementary Note (PDF 1875 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Terhorst, J., Kamm, J. & Song, Y. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 49, 303–309 (2017). https://doi.org/10.1038/ng.3748

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3748

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing