Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Abstract

We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Quantification of transcript abundance from RNA-seq experiments.
Figure 2: Problems with current transcript abundance estimation methods.
Figure 3: Modeling and correcting fragment sequence bias.

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

NCBI Reference Sequence

References

  1. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  Google Scholar 

  2. Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).

    Article  Google Scholar 

  3. 't Hoen, P.A. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).

    Article  CAS  Google Scholar 

  4. Su, Z. & SEQC/MAQC-III Consortium A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).

    Article  CAS  Google Scholar 

  5. Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).

    Article  CAS  Google Scholar 

  6. Hansen, K.D., Irizarry, R.A. & Wu, Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012).

    Article  Google Scholar 

  7. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011).

    Article  CAS  Google Scholar 

  8. Zheng, W., Chung, L.M. & Zhao, H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics 12, 290 (2011).

    Article  CAS  Google Scholar 

  9. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).

    Article  CAS  Google Scholar 

  10. Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

    Article  CAS  Google Scholar 

  11. Leek, J.T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).

    Article  Google Scholar 

  12. Li, J., Jiang, H. & Wong, W.H. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11, R50 (2010).

    Article  Google Scholar 

  13. Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).

    Article  Google Scholar 

  14. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22–R14 (2011).

    Article  CAS  Google Scholar 

  15. Nicolae, M., Mangul, S., Maˇndoiu, I.I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol. Biol. 6, 9 (2011).

    Article  Google Scholar 

  16. Li, W. & Jiang, T. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28, 2914–2921 (2012).

    Article  CAS  Google Scholar 

  17. Lahens, N.F. et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 (2014).

    Article  Google Scholar 

  18. Hayer, K.E., Pizarro, A., Lahens, N.F., Hogenesch, J.B. & Grant, G.R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    Article  CAS  Google Scholar 

  20. Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).

    Article  CAS  Google Scholar 

  21. Benjamini, Y. & Speed, T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

    Article  CAS  Google Scholar 

  22. Li, J.J., Jiang, C.R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 19867–19872 (2011).

    Article  CAS  Google Scholar 

  23. Hron, T., Pajer, P., Pačes, J., Baru˚tneˇk, P. & Elleder, D. Hidden genes in birds. Genome Biol. 16, 164 (2015).

    Article  Google Scholar 

  24. Patro, R., Mount, S.M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).

    Article  CAS  Google Scholar 

  25. Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  Google Scholar 

  26. Patro, R., Duggal, G. & Kingsford, C. Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv (2015).

  27. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015 (2010).

    Article  CAS  Google Scholar 

  28. Frazee, A.C., Jaffe, A.E., Langmead, B. & Leek, J.T. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31, 2778–2784 (2015).

    Article  CAS  Google Scholar 

  29. Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).

    Article  Google Scholar 

  30. Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402 (2015).

    Article  CAS  Google Scholar 

  31. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  Google Scholar 

  32. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  33. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    Article  CAS  Google Scholar 

  34. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).

    Article  CAS  Google Scholar 

  35. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors are grateful for helpful suggestions from Y. Benjamini, W. Huber, N. Lahens, L. Pinello, C. Meyer, R. Patro, Z. Xu, and Y. Li. M.I.L. was supported by NIH grant 5T32CA009337-35. J.B.H. was supported by NIH R01 grant HG005220, the National Institute of Neurological Disorders and Stroke (5R01NS054794-08 to J.B.H.), the Defense Advanced Research Projects Agency (DARPA-D12AP00025, to John Harer, Duke University). R.A.I. was supported by NIH R01 grant HG005220.

Author information

Authors and Affiliations

Authors

Contributions

M.I.L. and R.A.I. designed the method. M.I.L., J.B.H., and R.A.I. wrote the manuscript.

Corresponding author

Correspondence to Rafael A Irizarry.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–5 and Supplementary Figures 1–25 (PDF 6827 kb)

Supplementary Note (PDF 376 kb)

Supplementary Code (ZIP 30 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Love, M., Hogenesch, J. & Irizarry, R. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34, 1287–1291 (2016). https://doi.org/10.1038/nbt.3682

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3682

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics