Assessing protein coding region integrity in cDNA sequencing projects

Bioinformatics. 1998 Jun;14(5):384-90. doi: 10.1093/bioinformatics/14.5.384.

Abstract

Motivation: In cDNA sequencing projects, it is vital to know whether the protein coding region of a sequence is complete, or whether errors have occurred during library construction. Here we present a linear discriminant approach that predicts this completeness by estimating the probability of each ATG being the initiation codon.

Results: Because of the current shortage of full-length cDNA data on which to base this work, tests were performed on a non-redundant set of 660 initiation codon-containing DNA sequences that had been conceptually spliced into mRNA/cDNA. We also used an edited set of the same sequences that only contained the region following the initiation codon as a negative control. Using the criterion that only a single prediction is allowed for each sequence, a cut-off was selected at which discrimination of both positive and negative sets was equal. At this cut-off, 67% of each set could be correctly distinguished, with the correct ATG codon also being identified in the positive set. Reliability could be increased further by raising the cut-off or including homologues, the relative merits of which are discussed.

Availability: The prediction program, called ATGpr, and other data are available at http://www.hri.co.jp/atgpr

Contact: swintech@hri.co.jp

MeSH terms

  • Base Sequence
  • Codon, Initiator / genetics
  • Computational Biology
  • DNA, Complementary / genetics*
  • Databases, Factual
  • Humans
  • Open Reading Frames
  • Proteins / genetics*
  • RNA, Messenger / genetics
  • Sequence Analysis, DNA*

Substances

  • Codon, Initiator
  • DNA, Complementary
  • Proteins
  • RNA, Messenger