Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes
Introduction
Genome sequences, even protein-noncoding sequences, contain a wealth of information. The G + C content (G + C%) is a fundamental characteristic of individual genomes and used for a long period as a basic phylogenetic parameter to characterize individual genomes and genomic portions. The G + C%, however, is too simple a parameter to differentiate wide varieties of genomes. Many groups have reported that oligonucleotide frequency, which is an example of high-dimensional data, varies significantly among genomes and can be used to study genome diversity (Nussinov, 1984, Phillips et al., 1987, Karlin, 1998, Rocha et al., 1998, Gentles and Karlin, 2001, Bernardi, 2004; see also references cited by these papers), and various DNA sequence analysis linguistic tools have been developed (Vinga and Almeida, 2003, Bolshoy, 2003). Unsupervised neural network algorithm, Kohonen's Self-Organizing Map (SOM), is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional map (Kohonen, 1990, Kohonen et al., 1996, Kohonen, 1982). On the basis of batch learning SOM, we have developed a modification of the conventional SOM for genome sequence analyses, which makes the learning process and resulting map independent of the order of data input (Kanaya et al., 1998, Kanaya et al., 2001, Abe et al., 2002, Abe et al., 2003). We previously constructed the SOMs for di-, tri-, and tetranucleotide frequencies in 10-kb genomic sequences from 65 bacteria and 6 eukaryotes. In the resulting SOMs, the sequences were clustered (i.e., self-organized) according to species without any information regarding the species, and increasing the length of the oligonucleotides from di- to tetranucleotides increased the clustering power (Abe et al., 2003). In the present study, for investigating the power to detect and visualize differences among closely related eukaryotes, tri- and tetranucleotide frequencies in 10- and 100-kb sequence fragments derive from 38 eukaryotic genomes, which have been sequenced extensively, were analyzed. To analyze a massive amount of eukaryotic genome sequences, the Earth Simulator, which is one of the highest performance supercomputers in the world, was used.
Section snippets
Materials and methods
SOM implements nonlinear projection of multi-dimensional data onto a two-dimensional array of weight vectors, and this effectively preserves the topology of the high-dimensional data space (Kohonen, 1990, Kohonen et al., 1996, Kohonen, 1982). On the basis of batch learning SOM, we modified the conventional Kohonen's SOM for genome informatics to make the learning process and resulting map independent of the order of data input (Kanaya et al., 1998, Kanaya et al., 2001, Abe et al., 2002, Abe et
SOMs for 38 eukaryote genomes
To investigate clustering power of SOM for a wide range of eukaryote sequences, we analyzed tri- and tetranucleotide frequencies in 590,000 and 59,000 10- and 100-kb sequence fragments derived from 38 eukaryote genomes listed in the Fig. 1 legend (a total of 5.9 Gb), which represented a wide rage of eukaryote phylotypes. To prevent excess contribution of a large size of the human genome, sequences from a half of human chromosomes were used in the analysis in Fig. 1. First, oligonucleotide
Discussion
SOM can classify genomic sequences into known biological categories (into species in Fig. 1, Fig. 3 and into functional categories in Fig. 5) with no information other than oligonucleotide frequencies. Because the classification and visualization power is very high, SOM is a powerful bioinformatic tool for extracting a wide range of genomic information. In Fig. 3, territories of all six warm-blooded vertebrates extended significantly in the horizontal direction, and the territory of each
Acknowledgements
This work was supported by grants from ACT-Japan Science and Technology Corporation and the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Science” from the Ministry of Education, Culture, Sports, Science and Technology of Japan. A part of the present computation was done with the Earth Simulator of Japan Agency for Marine-Earth Science and Technology.
References (28)
- et al.
AU-rich elements: characterization and importance in mRNA degradation
Trends Biochem. Sci.
(1995) - et al.
Translational regulation in development
Cell
(1995) - et al.
Diversity of cytoplasmic functions for the 3′ untranslated region of eukaryotic transcripts
Curr. Opin. Cell Biol.
(1995) - et al.
Global variation in G + C content along vertebrate genome DNA. Possible correlation with chromosome band structures
J. Mol. Biol.
(1988) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome
Gene
(2001)Global dinucleotide signatures and analysis of genomic heterogeneity
Curr. Opin. Microbiol.
(1998)- et al.
Translational control by the 3′-UTR: the ends specify the means
Trends Biochem. Sci.
(2003) - et al.
A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency
Genome Inform. Ser. Workshop Genome Inform.
(2002) - et al.
Informatics for unveiling hidden genome signatures
Genome Res.
(2003) Structural and Evolutionary Genomics: Natural Selection in Genome Evolution
(2004)
The mosaic genome of warm-blooded vertebrates
Science
DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity
Appl. Bioinformatics
Genome-scale compositional comparisons in eukaryotes
Genome Res.
UTRdb: a specialized database of 5′- and 3′-untranslated regions of eukaryotic mRNAs
Nucleic Acids Res.
Cited by (44)
A combined approach for the analysis of large occupational accident databases to support accident-prevention decision making
2018, Safety ScienceCitation Excerpt :Within them, the SOM (Self Organising Map), an unsupervised learning algorithm, represents a promising tool for generating a map that is preserving the original topology, from a high-dimensional data vector space to a low-dimensional map space. SOM has already been used in many fields: web-based search applied on document classification (Kohonen et al., 2000) and web page clustering (Smith and Ng, 2003), bio informatics (Abe et al., 2006), and finance focused on quantitative analysis of debt and leasing (Sèverin, 2010) along with financial macroeconomic imbalances confrontation (López Iturriaga and Pastor Sanz, 2013). SOM algorithms have also been applied in many risk classification problems: Liang et al. (2012) proposed SOM to classify pipeline sections with the same risk level into different risk patterns, in order to set an effective risk control strategy to prevent pipe-line damages; Asgary et al. (2012) used SOM to classify and assess the risk levels of structural fire incidents.