Informatics for Unveiling Hidden Genome Signatures

  1. Takashi Abe1,2,3,
  2. Shigehiko Kanaya3,4,5,
  3. Makoto Kinouchi3,5,6,
  4. Yuta Ichiba1,3,
  5. Tokio Kozuki2,3, and
  6. Toshimichi Ikemura1,3,7
  1. 1Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, The Graduate University for Advanced Studies, Mishima, Shizuoka-ken 411-8540, Japan; 2Xanagen Inc., Sakado, Takatsu-ku, Kawasaki, Kanagawa-ken 213-0012, Japan; 3ACT-JST (Applying Advanced Computational Science and Technology, Japan Science and Technology Corp.), Kawaguchi, Saitama-ken, 332-0012, Japan; 4Department of Bioinformatics and Genomes, Graduate School of Information Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara-ken 630-0101, Japan; 5CREST JST (Core Research for Evolutional Science and Technology, Japan Science and Technology Corp.), Kawaguchi, Saitama-ken, 332-0012, Japan; 6Department of Bio-System Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata-ken 992-8510, Japan

Abstract

With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, a self-organizing map (SOM), to analyze di-, tri-, and tetranucleotide frequencies in a wide variety of prokaryotic and eukaryotic genomes. The SOM, which can cluster complex data efficiently, was shown to be an excellent tool for analyzing global characteristics of genome sequences and for revealing key combinations of oligonucleotides representing individual genomes. From analysis of 1- and 10-kb genomic sequences derived from 65 bacteria (a total of 170 Mb) and from 6 eukaryotes (460 Mb), clear species-specific separations of major portions of the sequences were obtained with the di-, tri-, and tetranucleotide SOMs. The unsupervised algorithm could recognize, in most 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature features of each genome. We were able to classify DNA sequences within one and between many species into subgroups that corresponded generally to biological categories. Because the classification power is very high, the SOM is an efficient and fundamental bioinformatic strategy for extracting a wide range of genomic information from a vast amount of sequences.

[Supplemental material is available online atwww.genome.org.]

Footnotes

  • 7 Corresponding author.

  • E-MAIL tikemura{at}lab.nig.ac.jp; FAX 81-55-981-6794.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.634603.

    • Received July 16, 2002.
    • Accepted January 28, 2003.
| Table of Contents

Preprint Server