Elsevier

Gene

Volume 365, 3 January 2006, Pages 27-34
Gene

Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes

https://doi.org/10.1016/j.gene.2005.09.040Get rights and content

Abstract

Novel tools are needed for comprehensive comparisons of interspecies characteristics of massive amounts of genomic sequences currently available. An unsupervised neural network algorithm, Self-Organizing Map (SOM), is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We modified the conventional SOM, on the basis of batch-learning SOM, for genome informatics making the learning process and resulting map independent of the order of data input. We generated the SOMs for tri- and tetranucleotide frequencies in 10- and 100-kb sequence fragments from 38 eukaryotes for which almost complete genome sequences are available. SOM recognized species-specific characteristics (key combinations of oligonucleotide frequencies) in the genomic sequences, permitting species-specific classification of the sequences without any information regarding the species. We also generated the SOM for tetranucleotide frequencies in 1-kb sequence fragments from the human genome and found sequences for four functional categories (5′ and 3′ UTRs, CDSs and introns) were classified primarily according to the categories. Because the classification and visualization power is very high, SOM is an efficient and powerful tool for extracting a wide range of genome information.

Introduction

Genome sequences, even protein-noncoding sequences, contain a wealth of information. The G + C content (G + C%) is a fundamental characteristic of individual genomes and used for a long period as a basic phylogenetic parameter to characterize individual genomes and genomic portions. The G + C%, however, is too simple a parameter to differentiate wide varieties of genomes. Many groups have reported that oligonucleotide frequency, which is an example of high-dimensional data, varies significantly among genomes and can be used to study genome diversity (Nussinov, 1984, Phillips et al., 1987, Karlin, 1998, Rocha et al., 1998, Gentles and Karlin, 2001, Bernardi, 2004; see also references cited by these papers), and various DNA sequence analysis linguistic tools have been developed (Vinga and Almeida, 2003, Bolshoy, 2003). Unsupervised neural network algorithm, Kohonen's Self-Organizing Map (SOM), is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional map (Kohonen, 1990, Kohonen et al., 1996, Kohonen, 1982). On the basis of batch learning SOM, we have developed a modification of the conventional SOM for genome sequence analyses, which makes the learning process and resulting map independent of the order of data input (Kanaya et al., 1998, Kanaya et al., 2001, Abe et al., 2002, Abe et al., 2003). We previously constructed the SOMs for di-, tri-, and tetranucleotide frequencies in 10-kb genomic sequences from 65 bacteria and 6 eukaryotes. In the resulting SOMs, the sequences were clustered (i.e., self-organized) according to species without any information regarding the species, and increasing the length of the oligonucleotides from di- to tetranucleotides increased the clustering power (Abe et al., 2003). In the present study, for investigating the power to detect and visualize differences among closely related eukaryotes, tri- and tetranucleotide frequencies in 10- and 100-kb sequence fragments derive from 38 eukaryotic genomes, which have been sequenced extensively, were analyzed. To analyze a massive amount of eukaryotic genome sequences, the Earth Simulator, which is one of the highest performance supercomputers in the world, was used.

Section snippets

Materials and methods

SOM implements nonlinear projection of multi-dimensional data onto a two-dimensional array of weight vectors, and this effectively preserves the topology of the high-dimensional data space (Kohonen, 1990, Kohonen et al., 1996, Kohonen, 1982). On the basis of batch learning SOM, we modified the conventional Kohonen's SOM for genome informatics to make the learning process and resulting map independent of the order of data input (Kanaya et al., 1998, Kanaya et al., 2001, Abe et al., 2002, Abe et

SOMs for 38 eukaryote genomes

To investigate clustering power of SOM for a wide range of eukaryote sequences, we analyzed tri- and tetranucleotide frequencies in 590,000 and 59,000 10- and 100-kb sequence fragments derived from 38 eukaryote genomes listed in the Fig. 1 legend (a total of 5.9 Gb), which represented a wide rage of eukaryote phylotypes. To prevent excess contribution of a large size of the human genome, sequences from a half of human chromosomes were used in the analysis in Fig. 1. First, oligonucleotide

Discussion

SOM can classify genomic sequences into known biological categories (into species in Fig. 1, Fig. 3 and into functional categories in Fig. 5) with no information other than oligonucleotide frequencies. Because the classification and visualization power is very high, SOM is a powerful bioinformatic tool for extracting a wide range of genomic information. In Fig. 3, territories of all six warm-blooded vertebrates extended significantly in the horizontal direction, and the territory of each

Acknowledgements

This work was supported by grants from ACT-Japan Science and Technology Corporation and the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Science” from the Ministry of Education, Culture, Sports, Science and Technology of Japan. A part of the present computation was done with the Earth Simulator of Japan Agency for Marine-Earth Science and Technology.

References (28)

  • G. Bernardi

    The mosaic genome of warm-blooded vertebrates

    Science

    (1985)
  • A. Bolshoy

    DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity

    Appl. Bioinformatics

    (2003)
  • A.J. Gentles et al.

    Genome-scale compositional comparisons in eukaryotes

    Genome Res.

    (2001)
  • P. Graziano et al.

    UTRdb: a specialized database of 5′- and 3′-untranslated regions of eukaryotic mRNAs

    Nucleic Acids Res.

    (1998)
  • Cited by (44)

    • A combined approach for the analysis of large occupational accident databases to support accident-prevention decision making

      2018, Safety Science
      Citation Excerpt :

      Within them, the SOM (Self Organising Map), an unsupervised learning algorithm, represents a promising tool for generating a map that is preserving the original topology, from a high-dimensional data vector space to a low-dimensional map space. SOM has already been used in many fields: web-based search applied on document classification (Kohonen et al., 2000) and web page clustering (Smith and Ng, 2003), bio informatics (Abe et al., 2006), and finance focused on quantitative analysis of debt and leasing (Sèverin, 2010) along with financial macroeconomic imbalances confrontation (López Iturriaga and Pastor Sanz, 2013). SOM algorithms have also been applied in many risk classification problems: Liang et al. (2012) proposed SOM to classify pipeline sections with the same risk level into different risk patterns, in order to set an effective risk control strategy to prevent pipe-line damages; Asgary et al. (2012) used SOM to classify and assess the risk levels of structural fire incidents.

    View all citing articles on Scopus
    View full text