Skip to main content
Advertisement

Main menu

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why Submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research

User menu

  • My alerts

Search

  • Advanced search
Life Science Alliance
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research
  • My alerts
Life Science Alliance

Advanced Search

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why Submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Follow lsa Template on Twitter
Research Article
Transparent Process
Open Access

Statistical guidelines for quality control of next-generation sequencing techniques

View ORCID ProfileMaximilian Sprang, Matteo Krüger, View ORCID ProfileMiguel A Andrade-Navarro, View ORCID ProfileJean-Fred Fontaine  Correspondence email
Maximilian Sprang
Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
Roles: Formal analysis, Visualization, Methodology, Writing—original draft, review, and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Maximilian Sprang
Matteo Krüger
Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
Roles: Formal analysis, Visualization, Methodology
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Miguel A Andrade-Navarro
Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
Roles: Supervision, Funding acquisition, Methodology, Project administration, Writing—original draft, review, and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Miguel A Andrade-Navarro
Jean-Fred Fontaine
Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
Roles: Conceptualization, Data curation, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing—original draft, review, and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jean-Fred Fontaine
  • For correspondence: fontaine@uni-mainz.de
Published 30 August 2021. DOI: 10.26508/lsa.202101113
  • Article
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF
Loading

Article Figures & Data

Figures

  • Tables
  • Supplementary Materials
  • Figure 1.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 1. Distribution of low- and high-quality files for different experimental parameters in the dataset.

    (A, B, C) Distribution of files by organism, biological sample type, and assay in the dataset. (D, E, F) Distribution of files by ChIP protein, ChIP antibody and biological sample (only the 10 files with the biggest difference between high- and low-quality are shown). (G, H, I) Distribution of files by ChIP protein, ChIP antibody and biological sample (only 10 most frequent annotations in low-quality files shown). There was a total of 269 proteins, 349 antibodies and 212 biological sample types.

  • Figure 2.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 2. Methods to identify features of possible importance towards assessing the quality of a file.

    High- and low-quality files are defined using their label in ENCODE (released and revoked, respectively). The raw FastQ files are used as input for FastQC and Bowtie2 tools, respectively returning RAW and MAP feature sets. The mapped reads from Bowtie2 are used as input for ChIPseeker and ChIPpeakAnno packages, returning the LOC and MAP feature sets, respectively.

  • Figure 3.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 3. Comparison of the manually curated quality and the guidelines given by ENCODE for high-quality files.

    Horizontal dashed lines are minimal values given as guidelines. Guidelines as well as files are all from ENCODE version 3 (2013–2018). (A) Number of aligned reads for H3K9me3, ChIP-Seq, and RNA-Seq. The blue dashed line denotes 45 million reads given as minimal guideline for H3K9me3 ChIP-Seq and the red line denotes 30 million reads given as minimal guideline for RNA-Seq. (B) Uniquely mapped reads for DNAse-Seq data. The orange dashed line indicates the ENCODE guideline of 20 million reads. (C) Useable fragments for narrow and broad peak ChIP-Seq. The blue dashed line is at 45 million and the red dashed line at 20 million, indicating ENCODEs guidelines for broad and narrow peak data, respectively.

  • Figure 4.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 4. Six features used for quality assessment in the Cistrome database.

    (A) From left to right: FastQC’s raw sequence median quality score (5), uniquely mapped reads ratio of BWA’s mapping (7), PCR bottleneck coefficient (PBC), fraction of reads in peaks (FRiP) (17), proportion of the 500 most significant peaks overlapping with a union of DNase-seq peaks derived from ENCODE files (PeaksUnionDHSRatio) (20, 34), and number of peaks called by MACS2 with a fold change above 10 (PeaksFoldChangeAbove10). The thresholds for the features are: 0.25, 0.6, 0.8, 0.01, 0.7, and 500, respectively. The x-axis shows how many quality flags were indicating low quality excluding the flag of feature being represented. Boxplots do not show outlier points. (B) Pairwise Pearson’s correlation coefficients of Cistrome’s features. (C) Pearson’s, Spearman’s and Kendall’s correlation coefficients of each Cistrome’s feature with the number of low-quality flags of the other features.

  • Figure 5.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 5. Comparing features in custom subsets using the dashboard.

    The data were filtered for polyA plus RNA-seq files, a higher level data subset compared to subsets defined for groups A, B, and C. Two example features were selected: per sequence quality score (top row) and multiple mapping (bottom row). On the left-hand side, boxplots show the distributions of values for high- (blue) and low-quality (orange) files for the respective features. On the right-hand side, the histograms of values are shown. Yet, using the dashboard, we can conclude that the multiple mapping feature is more powerful than per sequence quality score to differentiate files by quality.

  • Figure S1.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure S1. Antibody selection by quality.

    The graphic shows the distribution of next-generation sequencing files obtained in a human ChIP-Seq assay with H3K4me3 and H3K27me3 as target, using the antibody on the y-axis. For H3K4me3 the choice is quite clear, since the two most abundant antibodies have produced only good quality files. H3K27me3 shows that even frequently used antibodies like ENCAB036YAO only led to bad-quality files. It would now be interesting to see if these files are all from one batch or related to a same biological sample, or if it is really a possible problem for this antibody and target combination. Indeed, ENCAB036YAO was used in spleen, ascending aorta and Peyer’s patch tissues. Its more successful competitor ENCAB000ANB was used with primary cells and cell lines.

  • Figure 6.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 6. Top features most often significant to differentiate files by quality in each of the three subset groups.

    For each of the three subset groups (A, B, and C), the number of subsets (y-axis) in which a quality feature is significant is shown. Subsets have a minimum of 10 files and the P-value cutoff for significance is defined as either 0.01 (top row), 0.001 (middle row) or 0.0001 (bottom row). The total number of subsets in each subset group is 436 for subset group A, 354 for B and 461 for C. The number of subsets with at least 10 files in each subset group is 38 for subset group A, 41 for B and 33 for C.

  • Figure S2.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure S2. Ranking of features in group 1 subsets.

    Y-axis: number of files for which a feature is significant (false discovery rate < 0.01). The top 10 features are dominated by MAP. The LOC features follow closely and most are significant in all but one subset, which is a strong difference to the smaller subsets A, B, and C. Results are similar when decreasing the false discovery rate threshold to 0.001 (data not shown).

  • Figure S3.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure S3. Ranking of features in group 2 subsets.

    Y-axis: number of files for which a feature is significant (false discovery rate < 0.01). The top 10 features contain features of MAP, RAW, and LOC. The best performing is a LOC feature. Results are similar when decreasing the false discovery rate threshold to 0.001 (data not shown).

  • Figure 7.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 7. Decision trees derived with the CART algorithm.

    The Gini-criterion is used and a maximum depth of three was set. The two decision trees are related to human single-ended (SE) H3K4me1 ChIP-seq (left-hand side; p58 in Supplemental Data 1) and human SE CTCF ChIP-seq (right-hand side, p64 in Supplementary trees file [Supplemental Data 1]) and achieve 100% accuracy to classify related files by quality. Every split node contains the number of files in the node (samples) and the files’ true classification of quality (samples by class = [high-quality files, low-quality files]), as well as the prediction for this node (predicted class). At the bottom of every node, the quality feature that will be used for the split and the corresponding threshold are given. MAP_SE_multiple: percentage of reads that are mapped to multiple genomic locations in a SE experiment; LOC_Other_Intron: percentage of reads in non-first intron regions; TSS_+2500 and TSS_+500: percentage of reads in [2,000, 3,000] or [0, 1,000] bp region, respectively, relative to transcription start sites; MAP_SE_no_mapping: percentage of reads that could not be mapped to reference genome in a SE experiment; LOC_Distal_Intergenic: Percentage of reads in distal intergenic regions.

Tables

  • Figures
  • Supplementary Materials
    • View popup
    Table 1.

    Average classification performance of the quality features in each subset group.

    Quality featureGroup AGroup BGroup C
    RAW_Basic_Statistics0.5000.5000.500
    RAW_Per_base_sequence_quality0.6810.7380.747
    RAW_Per_tile_sequence_quality0.6450.6680.658
    RAW_Per_sequence_quality_scores0.6930.7380.754
    RAW_Per_base_sequence_content0.6500.6220.632
    RAW_Per_sequence_GC_content0.6840.6860.652
    RAW_Per_base_N_content0.6740.7320.745
    RAW_Sequence_Length_Distribution0.5020.5230.511
    RAW_Sequence_Duplication_Levels0.6060.5860.602
    RAW_Overrepresented_sequences0.7710.7630.740
    RAW_Adapter_Content0.5510.5260.538
    RAW_Kmer_Content0.5590.5740.549
    MAP_SE_no_mapping0.8400.8170.841
    MAP_SE_uniquely0.8290.8210.850
    MAP_SE_multiple0.8600.8050.859
    MAP_SE_overall0.8400.8170.841
    MAP_MI_no_mapping0.8370.8090.825
    MAP_MI_uniquely0.8390.8170.841
    MAP_MI_multiple0.8470.7980.841
    MAP_MI_overall0.8460.8100.827
    LOC_Promoter0.7190.7170.717
    LOC_5_UTR0.7060.6920.687
    LOC_3_UTR0.7450.7110.725
    LOC_1st_Exon0.6990.6880.702
    LOC_Other_Exon0.7330.7240.733
    LOC_1st_Intron0.6870.7160.705
    LOC_Other_Intron0.6860.7170.697
    LOC_Downstream0.6810.6880.697
    LOC_Distal_Intergenic0.7270.7190.710
    TSS_−45000.6900.6920.676
    TSS_−35000.6960.7070.710
    TSS_−25000.6820.6940.699
    TSS_−15000.7020.6850.708
    TSS_−5000.7030.7310.723
    TSS_+5000.7060.7180.718
    TSS_+15000.6860.7020.724
    TSS_+25000.6910.7090.722
    TSS_+35000.6770.7030.712
    TSS_+45000.6910.6920.701
    MAP_PE_con_no_mapping0.8330.7110.711
    MAP_PE_con_uniquely0.8520.7720.772
    MAP_PE_con_multiple0.8300.7080.708
    MAP_PE_dis_uniquely0.7710.6760.676
    MAP_PE_cod_no_mapping0.8370.6260.626
    MAP_PE_cod_uniquely0.7730.6730.673
    MAP_PE_cod_multiple0.8490.6550.655
    MAP_PE_overall0.8530.7240.724
    • Classification performance is measured as area under Receiver Operating Characteristic curve (auROC) from 0.5 for a random classification and 1.0 for a perfect classification. MAP features perform best over all three groups. The RAW features that perform well do so over all three groups. For LOC and TSS only some of the features show good performance on average; however, these still can be more important for some of the subsets in each group.

Supplementary Materials

  • Figures
  • Tables
  • Supplemental Data 1.

    The Supplementary_trees.pdf is the PDF document containing all decision trees, respective file counts, and information about the features. All decision trees, an interactive tableau of the data, and static tables of all metrics for each feature and subset can be found at: https://cbdm.uni-mainz.de/ngs-guidelines.[LSA-2021-01113_Supplemental_Data_1.pdf]

  • Table S1 Quality features.

  • Table S2 Feature TSS_+4500 in group A subsets related to mouse paired-ended DNAse-seq.

  • Table S3 Feature MAP_MI_overall_mapping in group B subsets related to mouse single-ended histone ChIP-seq.

  • Table S4 Feature MAP_MI_multiple_mapping in group C subsets related to CTCF mouse single-ended TF ChIP-seq.

  • Table S5 Classification performance of individual features in group 1 subsets (area under ROC curve).

  • Table S6 Classification performance of individual features in group 2 subsets (area under ROC curve).

  • Table S7 Classification performance of the decision trees (classification on the training set).

PreviousNext
Back to top
Download PDF
Article Alerts
Sign In to Email Alerts with your Email Address
Email Article

Thank you for your interest in spreading the word on Life Science Alliance.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Statistical guidelines for quality control of next-generation sequencing techniques
(Your Name) has sent you a message from Life Science Alliance
(Your Name) thought you would like to see the Life Science Alliance web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Statistical NGS quality control guidelines
Maximilian Sprang, Matteo Krüger, Miguel A Andrade-Navarro, Jean-Fred Fontaine
Life Science Alliance Aug 2021, 4 (11) e202101113; DOI: 10.26508/lsa.202101113

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
Statistical NGS quality control guidelines
Maximilian Sprang, Matteo Krüger, Miguel A Andrade-Navarro, Jean-Fred Fontaine
Life Science Alliance Aug 2021, 4 (11) e202101113; DOI: 10.26508/lsa.202101113
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
Issue Cover

In this Issue

Volume 4, No. 11
November 2021
  • Table of Contents
  • Cover (PDF)
  • About the Cover
  • Masthead (PDF)
Advertisement

Jump to section

  • Article
    • Abstract
    • Introduction
    • Results
    • Discussion
    • Materials and Methods
    • Data Availability
    • Acknowledgements
    • References
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF

Subjects

  • Genomics & Functional Genomics
  • Methods & Resources
  • Systems & Computational Biology

Related Articles

  • No related articles found.

Cited By...

  • No citing articles found.
  • Google Scholar

More in this TOC Section

  • SorCS1 in amyloid-β synaptic pathology
  • The GET pathway acts in promoting mitophagy
  • Proximity to the SC promotes exchanges
Show more Research Article

Similar Articles

EMBO Press LogoRockefeller University Press LogoCold Spring Harbor Logo

Content

  • Home
  • Newest Articles
  • Current Issue
  • Archive
  • Subject Collections

For Authors

  • Submit a Manuscript
  • Author Guidelines
  • License, copyright, Fee

Other Services

  • Alerts
  • Twitter
  • RSS Feeds

More Information

  • Editors & Staff
  • Reviewer Guidelines
  • Feedback
  • Licensing and Reuse
  • Privacy Policy

ISSN: 2575-1077
© 2023 Life Science Alliance LLC

Life Science Alliance is registered as a trademark in the U.S. Patent and Trade Mark Office and in the European Union Intellectual Property Office.