Data mining tools for biological sequences

Huiqing Liu; Limsoon Wong

doi:10.1142/s0219720003000216

Data mining tools for biological sequences

J Bioinform Comput Biol. 2003 Apr;1(1):139-67. doi: 10.1142/s0219720003000216.

Authors

Huiqing Liu¹, Limsoon Wong

Affiliation

¹ Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore. huiqing@i2r.a-star.edu.sg

PMID: 15290785
DOI: 10.1142/s0219720003000216

Abstract

We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.

Publication types

Review

MeSH terms

Artificial Intelligence
Base Sequence
Computational Biology*
Databases, Nucleic Acid
Humans
Peptide Chain Initiation, Translational
Protein Biosynthesis*
RNA, Messenger / genetics
Sequence Analysis, RNA / statistics & numerical data*

Substances

RNA, Messenger