Statistical analysis of over-represented words in human promoter sequences

Leonardo Mariño-Ramírez; John L Spouge; Gavin C Kanga; David Landsman

doi:10.1093/nar/gkh246

Statistical analysis of over-represented words in human promoter sequences

Nucleic Acids Res. 2004 Feb 12;32(3):949-58. doi: 10.1093/nar/gkh246. Print 2004.

Authors

Leonardo Mariño-Ramírez¹, John L Spouge, Gavin C Kanga, David Landsman

Affiliation

¹ Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA.

Abstract

The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from -2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.

MeSH terms

Base Sequence
Binding Sites
Computer Simulation
Data Interpretation, Statistical
Genome, Human
Humans
Markov Chains
Promoter Regions, Genetic*
Transcription Factors / metabolism
Transcription Initiation Site

Substances

Transcription Factors

Grants and funding

Z99 LM999999/Intramural NIH HHS/United States