Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Proteogenomic characterization of human colon and rectal cancer

Abstract

Extensive genomic characterization of human cancers presents the problem of inference from genomic abnormalities to cancer phenotypes. To address this problem, we analysed proteomes of colon and rectal tumours characterized previously by The Cancer Genome Atlas (TCGA) and perform integrated proteogenomic analyses. Somatic variants displayed reduced protein abundance compared to germline variants. Messenger RNA transcript abundance did not reliably predict protein abundance differences between tumours. Proteomics identified five proteomic subtypes in the TCGA cohort, two of which overlapped with the TCGA ‘microsatellite instability/CpG island methylation phenotype’ transcriptomic subtype, but had distinct mutation, methylation and protein expression patterns associated with different clinical outcomes. Although copy number alterations showed strong cis- and trans-effects on mRNA abundance, relatively few of these extend to the protein level. Thus, proteomics data enabled prioritization of candidate driver genes. The chromosome 20q amplicon was associated with the largest global changes at both mRNA and protein levels; proteomics data highlighted potential 20q candidates, including HNF4A (hepatocyte nuclear factor 4, alpha), TOMM34 (translocase of outer mitochondrial membrane 34) and SRC (SRC proto-oncogene, non-receptor tyrosine kinase). Integrated proteogenomic analysis provides functional context to interpret genomic abnormalities and affords a new paradigm for understanding cancer biology.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Summary of detected single amino acid variants (SAAVs) and the impact of single nucleotide variants (SNVs) on protein abundance.
Figure 2: Correlations between mRNA and protein abundance in TCGA tumours.
Figure 3: Effects of copy number alterations on mRNA and protein abundance.
Figure 4: Proteomic subtypes of colon and rectal cancers, associated genomic features, and relative abundance of HNF4α.

Similar content being viewed by others

References

  1. The Cancer Genome Atlas Research Network Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013)

    ADS  Google Scholar 

  2. The Cancer Genome Atlas Research Network Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008)

    Google Scholar 

  3. The Cancer Genome Atlas Research Network Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011); erratum. 490, 292 (2012)

    Google Scholar 

  4. The Cancer Genome Atlas Research Network Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012); corrigendum. 491, 288 (2012)

    ADS  PubMed Central  Google Scholar 

  5. The Cancer Genome Atlas Research Network Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)

    ADS  Google Scholar 

  6. The Cancer Genome Atlas Research Network Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012)

    ADS  Google Scholar 

  7. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  8. Wang, X. & Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29, 3235–3237 (2013)

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Wang, X. et al. Protein identification using customized protein sequence databases derived from RNA-Seq data. J. Proteome Res. 11, 1009–1017 (2012)

    CAS  ADS  PubMed  Google Scholar 

  10. Kim, W. K. et al. Identification and selective degradation of neopeptide-containing truncated mutant proteins in the tumors with high microsatellite instability. Clin. Cancer Res. 19, 3369–3382 (2013)

    CAS  PubMed  Google Scholar 

  11. Liu, H., Sadygov, R. G. & Yates, J. R. 3rd A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 4193–4201 (2004)

    CAS  PubMed  Google Scholar 

  12. de Sousa Abreu, R., Penalva, L. O., Marcotte, E. M. & Vogel, C. Global signatures of protein and mRNA expression levels. Mol. Biosyst. 5, 1512–1526 (2009)

    PubMed  Google Scholar 

  13. Foss, E. J. et al. Genetic variation shapes protein networks mainly through non-transcriptional mechanisms. PLoS Biol. 9, e1001144 (2011)

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Ghazalpour, A. et al. Comparative analysis of proteome and transcriptome variation in mouse. PLoS Genet. 7, e1001393 (2011)

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Gry, M. et al. Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 10, 365 (2009)

    PubMed  PubMed Central  Google Scholar 

  16. Foss, E. J. et al. Genetic basis of proteome variation in yeast. Nature Genet. 39, 1369–1375 (2007)

    CAS  PubMed  Google Scholar 

  17. Fu, J. et al. System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nature Genet. 41, 166–167 (2009)

    CAS  PubMed  Google Scholar 

  18. Peng, J. et al. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals Applied Statistics 4, 53–77 (2010)

    ADS  MathSciNet  MATH  Google Scholar 

  19. Garrison, W. D. et al. Hepatocyte nuclear factor 4α is essential for embryonic development of the mouse colon. Gastroenterology 130, 19.e1–19.e (2006)

    Google Scholar 

  20. Chellappa, K., Robertson, G. R. & Sladek, F. M. HNF4α: a new biomarker in colon cancer? Biomark. Med. 6, 297–300 (2012)

    CAS  PubMed  Google Scholar 

  21. Cheung, H. W. et al. Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer. Proc. Natl Acad. Sci. USA 108, 12372–12377 (2011)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  22. Shimokawa, T. et al. Identification of TOMM34, which shows elevated expression in the majority of human colon cancers, as a novel drug target. Int. J. Oncol. 29, 381–386 (2006)

    CAS  PubMed  Google Scholar 

  23. Irby, R. B. et al. Activating SRC mutation in a subset of advanced human colon cancers. Nature Genet. 21, 187–190 (1999)

    CAS  PubMed  Google Scholar 

  24. Monti, S., Tamayo, P., Mesirov, J. & Golub, T. R. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003)

    MATH  Google Scholar 

  25. Fearon, E. R. Molecular genetics of colorectal cancer. Annu. Rev. Pathol. 6, 479–507 (2011)

    CAS  PubMed  Google Scholar 

  26. De Sousa E Melo, F. et al. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions. Nature Med. 19, 614–618 (2013)

    CAS  PubMed  Google Scholar 

  27. Sadanandam, A. et al. A colorectal cancer classification system that associates cellular phenotype and responses to therapy. Nature Med. 19, 619–625 (2013)

    CAS  PubMed  Google Scholar 

  28. Zhang, B., Kirov, S. & Snoddy, J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 33, W741–W748 (2005)

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Chang, H. Y. et al. Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc. Natl Acad. Sci. USA 102, 3738–3743 (2005)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  30. Shi, Z., Wang, J. & Zhang, B. NetGestalt: integrating multidimensional omics data over biological networks. Nature Methods 10, 597–598 (2013)

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Polyak, K. & Weinberg, R. A. Transitions between epithelial and mesenchymal states: acquisition of malignant and stem cell traits. Nature Rev. Cancer 9, 265–273 (2009)

    CAS  Google Scholar 

  32. Loboda, A. et al. EMT is the dominant program in human colon cancer. BMC Med. Genomics 4, 9 (2011)

    PubMed  PubMed Central  Google Scholar 

  33. Geiger, T., Sabanay, H., Kravchenko-Balasha, N., Geiger, B. & Levitzki, A. Anomalous features of EMT during keratinocyte transformation. PLoS One 3, e1547 (2008)

    Google Scholar 

  34. Kiemer, A. K., Takeuchi, K. & Quinlan, M. P. Identification of genes involved in epithelial-mesenchymal transition and tumor progression. Oncogene 20, 6679–6688 (2001)

    CAS  PubMed  Google Scholar 

  35. Zeisberg, M. & Neilson, E. G. Biomarkers for epithelial-mesenchymal transitions. J. Clin. Invest. 119, 1429–1437 (2009)

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature Rev. Genet. 13, 227–232 (2012)

    CAS  PubMed  Google Scholar 

  37. Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1160–1167 (2009)

    PubMed  PubMed Central  Google Scholar 

  38. Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000)

    CAS  ADS  PubMed  Google Scholar 

  39. Ding, L. et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature 464, 999–1005 (2010)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  40. Li, S. et al. Endocrine-therapy-resistant ESR1 variants revealed by genomic characterization of breast-cancer-derived xenografts. Cell Rep. 4, 1116–1130 (2013)

    CAS  ADS  PubMed  Google Scholar 

  41. Licklider, L. J., Thoreen, C. C., Peng, J. & Gygi, S. P. Automation of nanoscale microcapillary liquid chromatography-tandem mass spectrometry with a vented column. Anal. Chem. 74, 3076–3083 (2002)

    CAS  PubMed  Google Scholar 

  42. Ma, Z. Q. et al. Supporting tool suite for production proteomics. Bioinformatics 27, 3214–3215 (2011)

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature Biotechnol. 30, 918–920 (2012)

    CAS  Google Scholar 

  44. Dasari, S. et al. Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 11, 1686–1695 (2012)

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Tabb, D. L., Fernando, C. G. & Chambers, M. C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. 6, 654–661 (2007)

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Kim, S. et al. The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics 9, 2840–2852 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Ma, Z. Q. et al. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J. Proteome Res. 8, 3872–3881 (2009)

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Zhou, J. Y. et al. Improved LC-MS/MS spectral counting statistics by recovering low-scoring spectra matched to confidently identified peptide sequences. J. Proteome Res. 9, 5698–5704 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Halvey, P. J., Zhang, B., Coffey, R., Liebler, D. C. & Slebos, R. J. Proteomic consequences of a single gene mutation in a colorectal cancer model. J. Proteome Res. 11, 1184–1195 (2012)

    CAS  PubMed  Google Scholar 

  50. Kislinger, T. et al. Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 125, 173–186 (2006)

    CAS  PubMed  Google Scholar 

  51. Zhang, B. et al. Detecting differential and correlated protein expression in label-free shotgun proteomics. J. Proteome Res. 5, 2909–2918 (2006)

    CAS  PubMed  Google Scholar 

  52. Li, M. et al. Comparative shotgun proteomics using spectral count data and quasi-likelihood modeling. J. Proteome Res. 9, 4295–4305 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Ning, K., Fermin, D. & Nesvizhskii, A. I. Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data. J. Proteome Res. 11, 2261–2271 (2012)

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Zybailov, B., Coleman, M. K., Florens, L. & Washburn, M. P. Correlation of relative abundance ratios derived from peptide ion chromatograms and spectrum counting for quantitative proteomic analysis using stable isotope labeling. Anal. Chem. 77, 6218–6224 (2005)

    CAS  PubMed  Google Scholar 

  55. Old, W. M. et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteom. 4, 1487–1502 (2005)

    CAS  Google Scholar 

  56. Halvey, P. J. et al. Proteogenomic analysis reveals unanticipated adaptations of colorectal tumor cells to deficiencies in DNA mismatch repair. Cancer Res. 74, 387–397 (2014)

    CAS  PubMed  Google Scholar 

  57. Slebos, R. J. et al. Proteomic analysis of oropharyngeal carcinomas reveals novel HPV-associated biological pathways. Int. J. Cancer 132, 568–579 (2013)

    CAS  PubMed  Google Scholar 

  58. Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003)

    CAS  PubMed  Google Scholar 

  59. Liu, Q. et al. Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation. Mol. Cell. Proteomics 12, 1900–1911 (2013)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  60. Zybailov, B. et al. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 5, 2339–2347 (2006)

    CAS  PubMed  Google Scholar 

  61. Gallien, S. et al. Targeted proteomic quantification on quadrupole-orbitrap mass spectrometer. Mol. Cell. Proteom. 11, 1709–1723 (2012)

    Google Scholar 

  62. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011)

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7, 562–578 (2012)

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  66. Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011)

    ADS  PubMed  Google Scholar 

  67. Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011)

    PubMed  PubMed Central  Google Scholar 

  68. Wang, P. Statistical Methods for CGH Array Analysis. (VDM Verlag, 2010)

  69. Darsigny, M. et al. Hepatocyte nuclear factor-4alpha promotes gut neoplasia in mice and protects against the production of reactive oxygen species. Cancer Res. 70, 9423–9433 (2010)

    CAS  PubMed  Google Scholar 

  70. Schwartz, B. et al. Inhibition of colorectal cancer by targeting hepatocyte nuclear factor-4α. Int. J. Cancer 124, 1081–1089 (2009)

    CAS  PubMed  Google Scholar 

  71. Saandi, T. et al. Regulation of the tumor suppressor homeogene Cdx2 by HNF4α in intestinal cancer. Oncogene 32, 3782–3788 (2013)

    CAS  PubMed  Google Scholar 

  72. Chellappa, K. et al. Src tyrosine kinase phosphorylation of nuclear receptor HNF4α correlates with isoform-specific loss of HNF4α in human colon cancer. Proc. Natl Acad. Sci. USA 109, 2302–2307 (2012)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  73. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  74. Reich, M. et al. GenePattern 2.0. Nature Genet. 38, 500–501 (2006)

    CAS  PubMed  Google Scholar 

  75. Verhaak, R. G. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  76. Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    MATH  Google Scholar 

  77. Cerami, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404 (2012)

    PubMed  Google Scholar 

  78. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA 99, 6567–6572 (2002)

    CAS  ADS  PubMed  PubMed Central  Google Scholar 

  79. Wang, J., Duncan, D., Shi, Z. & Zhang, B. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res. 41, W77–W83. W77–83 (2013)

    PubMed  PubMed Central  Google Scholar 

  80. Wang, J. et al. GO-function: deriving biologically relevant functions from statistically significant functions. Brief. Bioinform. 13, 216–227 (2012)

    PubMed  Google Scholar 

  81. Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford) 2010, baq023 (2010)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Cancer Institute (NCI) CPTAC awards U24CA159988, U24CA160035, and U24CA160034; by NCI SPORE award P50CA095103 and NCI Cancer Center Support Grant P30CA068485; by National Institutes of Health grant GM088822; and by contract 13XS029 from Leidos Biomedical Research, Inc. Genomics data for this study were generated by The Cancer Genome Atlas pilot project established by the NCI and the National Human Genome Research Institute. Information about TCGA and the investigators and institutions comprising the TCGA research network can be found at http://cancergenome.nih.gov/.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

B.Z., R.J.C.S., D.L.T., L.J.Z. and D.C.L. designed the proteomic analysis experiments, data analysis workflow, and proteomic–genomic data comparisons. K.F.S., L.J.Z., R.J.C.S. and D.C.L. directed and performed proteomic analysis of colon tumour and quality control samples. J.W., X.W., J.Z., Q.L., Z.S., P.W., S.W., R.J.C.S. and B.Z. performed proteomic-genomic data analyses. M.C.C., S.K., R.J.C.S. and D.L.T. performed analyses of mass spectrometry data and adapted algorithms and software for data analysis. S.R.D., R.R.T. and M.J.C.E. developed and prepared breast xenografts used as quality control samples. S.A.C., K.F.S. and D.C.L. designed strategy for quality control analyses. R.J.C.S., C.R.K, R.C.R. and H.R. coordinated acquisition, distribution and quality control evaluation of TCGA tumor samples. B.Z., J.W., R.J.C.S., R.J.C. and D.C.L. interpreted data in context of colon cancer biology. B.Z., R.J.C.S. and D.C.L. wrote the manuscript.

Corresponding author

Correspondence to Daniel C. Liebler.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

All of the primary mass spectrometry data on TCGA tumour samples are deposited at the CPTAC Data Coordinating Center as raw and mzML files for public access (https://cptac-data-portal.georgetown.edu).

Extended data figures and tables

Extended Data Figure 1 Mass-spectrometry-based proteomics workflow.

Protein was extracted from frozen tumour tissue and used to generate tryptic digests. The resulting tryptic peptides were fractionated using off-line basic reverse-phase (high-pressure) liquid chromatography (basic RPLC). Collected fractions were pooled and used for reverse-phase HPLC in line with a Thermo Orbitrap-Velos MS instrument. Raw data were processed by MSConvert and then used for database and spectral library searching using three different search engines (Myrimatch, Pepitome and MS-GF+). Identified peptides were assembled using IDPicker 3 with selected filters as described in the methods. IDPicker 3 stores its protein assemblies for a specified set of filters in the idpDB format. These SQLite databases associate spectra with peptides, peptides with proteins, and LC-MS/MS experiments with a hierarchy of experiments.

Extended Data Figure 2 Relaxing the false discovery rate of peptide-spectrum match for high-confident proteins increases spectral counts.

To increase spectral counts and improve statistical comparisons, we first created a protein assembly that maximized the number of proteins identified (at 0.1% peptide-spectrum match false discovery rate (PSM FDR)) and then relaxed the PSM FDR to 1% exclusively for the set of confidently identified proteins. This strategy led to increased spectral counts from 4,896,831 to 6,299,756, a 29% increase. a, Spectral count plot of all 7,526 confidently identified proteins demonstrates the increase in the absolute number of spectra identified for each protein, but no decrease for any of the proteins. Each dot in the figure represents one of the 7,526 proteins; x axis and y axis represent the spectral counts obtained in the data sets with 0.1% and 1% PSM FDR, respectively, both plotted on a log scale. b, Density plot showing the distribution of PSM FDR scores for all rescued PSMs. Rescued PSMs are of high quality with a median PSM FDR score of less than 0.2%, indicating the maintained integrity of the data set.

Extended Data Figure 3 Read mapping, exon coverage and missense somatic variants in RNA-seq data.

a, Summary of total RNA-Seq read counts and mapping results for individual samples. b, Distribution of percentage sequence coverage in exons for individual samples. Among all 228,157 exons, 76% were expressed, but only 64% had an average coverage greater than 1. Exons with no coverage were not included in the box plots. c, Number of missense somatic variants detected by RNA-seq in individual samples. Approximately 54% of the mutation positions were covered by RNA-seq reads and only 43% were covered by three or more reads.

Extended Data Figure 4 Parallel-reaction-monitoring validation results.

Single amino acid variants (SAAVs) identified in the TCGA shotgun data set were validated using parallel-reaction-monitoring (PRM) analyses. Three distinct SAAVs in four TCGA samples were selected for validation. The TCGA samples were freshly prepared in the same manner as the original samples analysed by shotgun proteomics. Each sample was spiked with 12.5 fmol μl−1 of a mixture of all isotopically labelled peptides. Using an inclusion list containing the precursor m/z values representing both unlabelled (endogenous) and labelled peptides, each fraction was analysed by PRM for the variant peptides. This figure shows the PRM data for the variant sequence LVVVGADGVGK (KRAS(Gly12Asp) in TCGA-AA-3818. a, The MS/MS spectrum identified in the initial shotgun analyses. b, The annotated MS/MS spectrum of the unlabelled endogenous variant peptide in the PRM analysis. c, The annotated MS/MS spectrum of the spiked, labelled peptide in the PRM analysis. d, The chromatographic trace showing the overlapping transitions and retention time of the endogenous variant peptide. e, The chromatographic trace showing the overlapping transitions and retention time of the labelled variant peptide.

Extended Data Figure 5 Platform evaluation and analysis method selection using quality control samples.

a, The lower-left half (uncoloured) depicts pairwise scatter plots of the samples, with x and y axes representing quantile-normalized spectral counts for samples in corresponding columns and rows, respectively. The upper-right half (red) depicts pairwise Spearman’s correlation coefficients for the same comparisons. b, For each normalization method (none, global, quantile and NSAF), we calculated the intraclass correlation coefficients (ICCs) for individual proteins in the quality control data set. The analysis was done for the top 1,000, 500 or 100 proteins with the largest variance and the cumulative fraction curves were plotted. In most scenarios, quantile normalization generated slightly higher ICC scores than global normalization, and both methods clearly outperformed the NSAF normalization. c, We sorted all proteins in the quality control data set based on their total spectral counts and then divided the proteins into 10 bins with equal number of proteins. Average spectral count ranges for each bin are shown in the brackets in the legend box. For each bin, we calculated the ICCs for individual proteins in the bin. The analysis was done for the top 300, 200 or 100 proteins with the largest variance in each bin. The cumulative fraction curves were plotted. Protein bins with spectral counts less than 1.4 showed clearly lower ICC scores, whereas the ICC score curves started to converge when the average spectral count was greater than 1.4.

Extended Data Figure 6 Extended data for mRNA–protein correlation analysis.

a, Evaluation of the length bias in different RNA-Seq-based gene abundance estimation methods. The plot shows the distribution of correlation between gene length and estimated transcript abundance based on FPKM (fragments per kilobase of exon per million fragments mapped, blue curve) and RSEM (RNA-seq expectation maximization, red curve), respectively. FPKM measure is independent of gene length, whereas the RSEM measure strongly correlates with gene length. b, Relationship between mRNA–protein correlation and the stability of the molecules. Human genes were separated into four categories based on the mRNA and protein half-lives of their mouse orthologues: stable mRNA–stable protein; stable mRNA–unstable protein, unstable mRNA–stable protein, and unstable mRNA–unstable protein. Distribution of mRNA–protein correlations for genes in each category was plotted in the box plots. Genes with stable mRNA and stable protein showed relatively higher mRNA–protein correlation whereas those with unstable mRNA and unstable protein showed relatively lower mRNA–protein correlation. Only common genes in both our study and the mouse study were included in the analysis. The total number of genes in each category (N) is labelled in the figure. The P value indicating correlation difference among the four categories was calculated based on the Kruskal–Wallis non-parametric analysis of variance (ANOVA) test. The P value indicating correlation difference between the stable mRNA–stable protein group and the unstable mRNA–unstable protein group was calculated based on the two-sided Wilcoxon rank-sum test.

Extended Data Figure 7 mRNA and protein-level cis-effect of copy number alterations in focal amplification, focal deletion and non-focal regions.

The figure plots cumulative fraction curves of copy number alteration (CNA)–mRNA (dashed lines) and CNA–protein (solid lines) expression correlations for genes in the focal amplification regions (red), focal deletion regions (green), and non-focal regions (blue), respectively. Focal alteration regions were defined in the TCGA study. Any chromosomal regions outside the focal amplification and deletion regions were considered as non-focal regions. CNA–mRNA correlations were significantly higher than CNA–protein correlations for genes in any of the three groups. Moreover, genes in the focal amplification regions showed the highest level of CNA–mRNA and CNA–protein correlations among the three groups of genes. P values were based on the two-sided Kolmogorov–Smirnov test.

Extended Data Figure 8 HNF4α isoforms and the effect of HNF4A shRNA on the proliferation of colon cancer cells.

a, Multiple sequence alignment of the HNF4α isoforms, with peptides detected by shotgun proteomics and sequences corresponding to the shRNA target sequences highlighted. Different colours of the letters indicate different levels of sequence coverage in the shotgun proteomics study, as indicated by the colour scale bar. Yellow boxes highlight sequences corresponding to the shRNA target sequences. TRCN0000019193 specifically targets P1 promoter-driven isoforms, whereas the other two target both types of isoforms. b–d, The P1-HNF4α-specific shRNA showed mixed impacts (b), whereas shRNAs simultaneously targeting both P1 and P2 HNF4α showed a primarily negative impact on cell proliferation (c, d). Moreover, a stronger negative impact was associated with increased copy number, both for the P1- HNF4α specific shRNA (P = 0.04, Spearman’s correlation (r)) and for all shRNAs (P = 0.01, Spearman’s correlation P values for individual shRNAs summarized by the Fisher’s combined probability test).

Extended Data Figure 9 Consensus matrices, the empirical cumulative distribution function plot and core sample identification.

a, Consensus matrices of the 90 CRC samples for k = 2 to k = 8. The consensus matrices show the robustness of the discovered clusters to sampling variability (resampling 80% samples) for cluster numbers k = 2 to 8. In each consensus matrix, both the rows and the columns were indexed with the same sample order and samples belonging to the same cluster frequently are adjacent to each other. For each pair of samples, a consensus index, which is the percentage of times they belong to the same cluster during 1,000 runs of the clustering algorithm based on resampling was calculated. The consensus index for each pair of samples was represented by colour gradient from white (0%) to red (100%) in the consensus matrix. b, Cumulative distribution function (CDF) plots corresponding to the consensus matrices for k = 2 to k = 8. This plot shows the cumulative distribution of the entries of the consensus matrices within the 0–1 range. Skew towards 0 and 1 indicates good clustering. As k increases, the area under the CDF is hypothesized to increase markedly until k reaches the ktrue. In this case, 7 was considered as ktrue because the change of the area under the CDF was close to zero when k increased from 7 to 8. c, Silhouette plot for core sample identification. For each sample (y axis), the silhouette width (x axis) compares its similarity to its assigned class and to any other classes. Samples with higher similarity to their assigned class than to any other classes will get positive silhouette width score and be selected as core samples.

Extended Data Figure 10 Network analysis of the subtype signature proteins.

a, The number of signature proteins for each subtype. For a given subtype, the red circle represents proteins that were different in abundance between the subtype and all other subtypes, the green circle represents proteins that were different in abundance between the subtype and normal colon tissues. The intersection between red and green circles contains the signature proteins for each subtype. b, Visualizing subtype-C-signature proteins in NetGestalt. Proteins in the iRef protein–protein interaction network are placed in a linear order together with the hierarchical modular organization of the network. Alternating bar colours (green and orange) are used to distinguish neighbouring modules. Proteins in the up and down signatures of subtype C were visualized as two separate tracks below the network modules, where each bar represents a protein. These proteins are not randomly distributed in the network. Highlighted by red or blue arrows are four Network modules (I, IV, V, VI) significantly enriched with up-signature proteins and two modules (II and III) significantly enriched with down-signature proteins (adjusted p value <0.01). c, d, Heat maps depicting relative abundance of down- and up-signature proteins of subtype C in modules III and I, respectively. Tumours are displayed as rows, grouped by normal controls (N) and proteomic subtypes (A–E) as indicated by different side bar colours. Proteins are displayed as columns. e, f, Network diagrams depicting the interaction of down- and up-signature proteins of subtype C in modules III and I, respectively. Node and node-border colours represent relatively higher or lower abundance in the subtype compared to other subtypes and normal colon tissues, respectively. Red and blue in the heat maps and the network diagrams represent relatively higher or lower abundance, respectively.

Supplementary information

Supplementary Information

This file contains Supplementary Tables 1-15. (XLSX 13426 kb)

PowerPoint slides

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Wang, J., Wang, X. et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387 (2014). https://doi.org/10.1038/nature13438

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature13438

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer