Elsevier

Gene

Volume 660, 20 June 2018, Pages 92-101
Gene

Research paper
The matrices and constraints of GT/AG splice sites of more than 1000 species/lineages

https://doi.org/10.1016/j.gene.2018.03.031Get rights and content

Highlights

  • A resource for matrices/constraints of GT/AG splice sites of Ensembl-annotated genomes

  • The matrices/constraints are highly diverse in a species-, genus- or phylum-specific way.

  • A bell-curve for the abundance of alternative splicing and the species constraint indexes of splice sites

Abstract

To provide a resource for the splice sites (SS) of different species, we calculated the matrices of nucleotide compositions of about 38 million splice sites from >1000 species/lineages. The matrices are enriched of aGGTAAGT (5′SS) or (Y)6N(C/t)AG(g/a)t (3′SS) overall; however, they are quite diverse among hundreds of species. The diverse matrices remain prominent even under sequence selection pressures, suggesting the existence of diverse constraints as well as U snRNAs and other spliceosomal factors and/or their interactions with the splice sites. Using an algorithm to measure and compare the splice site constraints across all species, we demonstrate their distinct differences quantitatively. As an example of the resource's application to answering specific questions, we confirm that high constraints of particular positions are significantly associated with transcriptome-wide, increased occurrences of alternative splicing when uncommon nucleotides are present. More interestingly, the abundance of alternative splicing in 16 species correlates with the average constraint index of splice sites in a bell curve. This resource will allow users to assess specific sequences/splice sites against the consensus of every Ensembl-annotated species, and to explore the evolutionary changes or relationship to alternative splicing and transcriptome diversity. Web-search or update features are also included.

Introduction

Splice sites (SS) demarcate exons and introns allowing the proper joining of exons during the expression of most eukaryotic genes. Their selective usage during alternative splicing produces more than one transcript from a single gene thereby contributing to transcriptomic and proteomic diversity (Black, 2003; Nilsen and Graveley, 2010). Their importance has been clearly demonstrated by splice site mutations that cause diseases (Tazi et al., 2009; Scotti and Swanson, 2016; Daguenet et al., 2015; Feng and Xie, 2013). Genomic analyses of different individual species/groups have given a glimpse of the consensus and diversity of both constitutive and alternative splice sites (Dou et al., 2006; Thanaraj and Stamm, 2003; Rogozin and Milanesi, 1997; Sibley et al., 2016; Szczesniak et al., 2013; Abril et al., 2005; Garg and Green, 2007; Burset et al., 2001). However, a centralized source for an overview of the annotated, millions of splice sites among all the currently sequenced eukaryotes remains to be created.

In biological or biomedical research, one often needs to assess the strength of particular sequences as splice sites or compare them between species, in fields such as genetics, cell biology, biochemistry or physiology. A resource with quantitative, comparable measurements of the splice site consensus and constraints of different species would be a very helpful reference. We thus compiled this resource for the consensus and diversity of the splice sites of the GT/AG introns of the Ensembl-annotated eukaryotic species as a reference for simple search or further exploration.

The GT/AG splice sites present in the majority of eukaryotic introns, characterized in humans with a consensus AGGTRAGT at the 5′ splice site and A(Y)nNYAGG (underlined: intron start/end GT/AG, A: branch point, Yn: polypyrimidine tract Py, N: A,C,G or T, n:6–35) at the 3′ splice site, with variations (except the GT/AG) in other species (Sibley et al., 2016; Moore, 2000; Burge et al., 1998; Spingola et al., 1999; Mount et al., 1992; Lorkovic et al., 2000). These sequences are recognized through direct base-pairing by the snRNAs of snRNP splicing factors (U1, U2, U5 and U6, with the participation of U4) or through contact by accessory proteins such as U2AFs during the dynamic assembly of spliceosomes (Will and Luhrmann, 2011; Shi, 2017). In this report, we used the Ensembl-annotated databases to compile a complete list of the matrices and constraints of the splice sites. Since the branch point is not as easy to assess accurately as the other motifs, it is not included here. Also not included are the minor AT/AC introns (<0.5%) (Burge et al., 1998; Verma et al., 2017; Wu and Krainer, 1999; Hall and Padgett, 1994; Levine and Durbin, 2001).

Section snippets

The matrices of GT/AG splice sites of >1000 eukaryotic species/lineages

We calculated the percent nucleotide compositions of the 5′ and 3′ GT/AG splice sites of 1074(5′)/1076(3′) species or their lineages/strains (hereafter ‘lineage’, to represent all in the same species, S_Tables I–II). An example of the resulting matrix format is shown in Fig. 1A, with the average percentages of >300 thousands of human splice sites. The matrices are enriched of (c/a)AGGT(A/g)AGt (5′SS) or (Y)20NCAGgt (3′SS, upstream beyond the (Y)20 is T/A-rich), similar to those based on about

A reference database for the splice site matrices/constraints of >1000 species/lineages

The matrix and constraint database of the annotated eukaryotic species has covered the 5′GT, 3′ polypyrimidine tract and 3′AG. It will be useful at least in the studies on genes or gene functions: as a reference for making effective mutations close to or within the splice sites in splicing assays, for assessing the strength of a particular splice site or mutant sequence in a species, for developing species-specific algorithm for more accurate prediction of unknown exons, for comparing splice

Genome data

The GenBank-format files of the genomes of all the species examined here were downloaded from the release 88 or Genome release 35 of the Ensembl database, of which the transcripts are based on experimental evidence (Aken et al., 2016).

Calculation of matrices and constraints

For matrices, the 5′ GT(−5–+50) or 3′ AG(−50–+2) splice sites were counted for their nucleotide compositions (percentages) at each position according to the intron coordinates in the annotated GenBank files of each species/lineage/strain, using Python scripts.

For

Acknowledgements

This work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-2016-06004) and a Manitoba Research Chair fund to J.X. We thank the anonymous reviewers for helpful suggestions.

References (80)

  • B.L. Aken, S. Ayling, D. Barrell, L. Clarke, V. Curwen, S. Fairley, J. Fernandez Banet, K. Billis, C. Garcia Giron, T....
  • S. Anders et al.

    Detecting differential usage of exons from RNA-seq data

    Genome Res.

    (2012)
  • N.L. Barbosa-Morais et al.

    The evolutionary landscape of alternative splicing in vertebrate species

    Science

    (2012)
  • D.L. Black

    Mechanisms of alternative pre-messenger RNA splicing

    Annu. Rev. Biochem.

    (2003)
  • M. Burset et al.

    SpliceDB: database of canonical and non-canonical mammalian splice sites

    Nucleic Acids Res.

    (2001)
  • M. Chen et al.

    Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches

    Nat. Rev. Mol. Cell Biol.

    (2009)
  • F. Clark et al.

    Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human

    Hum. Mol. Genet.

    (2002)
  • E. Daguenet et al.

    The pathogenicity of splicing defects: mechanistic insights into pre-mRNA processing inform novel therapeutic approaches

    EMBO Rep.

    (2015)
  • B. Daines et al.

    The Drosophila melanogaster transcriptome by paired-end RNA sequencing

    Genome Res.

    (2011)
  • Y. Dou et al.

    Genomic splice-site analysis reveals frequent alternative splicing close to the dominant splice site

    RNA

    (2006)
  • H. Du et al.

    The U1 snRNP protein U1C recognizes the 5′ splice site in the absence of base pairing

    Nature

    (2002)
  • D. Feng et al.

    Aberrant splicing in neurological diseases

    Wiley Interdiscip. Rev. RNA

    (2013)
  • C. Fields

    Information content of Caenorhabditis elegans splice site sequences varies with intron length

    Nucleic Acids Res.

    (1990)
  • A. Firrincieli et al.

    Genome sequence of the plant growth promoting endophytic yeast Rhodotorula graminis WP1

    Front. Microbiol.

    (2015)
  • M. Freund et al.

    Extended base pair complementarity between U1 snRNA and the 5′ splice site does not inhibit splicing in higher eukaryotes, but rather increases 5′ splice site recognition

    Nucleic Acids Res.

    (2005)
  • K. Garg et al.

    Differing patterns of selection in alternative and constitutive splice sites

    Genome Res.

    (2007)
  • T. Gehrmann et al.

    Schizophyllum commune has an extensive and functional alternative splicing repertoire

    Sci. Rep.

    (2016)
  • C. Hollins et al.

    U2AF binding selects for the high conservation of the C. elegans 3′ splice site

    RNA

    (2005)
  • G.A. Hudson et al.

    Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides

    RNA

    (2013)
  • R.M. Illias et al.

    l-Mandelate dehydrogenase from Rhodotorula graminis: cloning, sequencing and kinetic characterization of the recombinant enzyme and its independently expressed flavin domain

    Biochem. J.

    (1998)
  • I. Kalvari et al.

    Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families

    Nucleic Acids Res.

    (2018)
  • S. Kandels-Lewis et al.

    Involvement of U6 snRNA in 5′ splice site selection

    Science

    (1993)
  • E. Kim et al.

    Different levels of alternative splicing among eukaryotes

    Nucleic Acids Res.

    (2007)
  • M. Kramer et al.

    Untangling the contributions of sex-specific gene regulation and X-chromosome dosage to sex-biased gene expression in Caenorhabditis elegans

    Genetics

    (2016)
  • Y. Lee et al.

    Mechanisms and regulation of alternative pre-mRNA splicing

    Annu. Rev. Biochem.

    (2015)
  • C.F. Lesser et al.

    Mutations in U6 snRNA that alter splice site specificity: implications for the active site

    Science

    (1993)
  • A. Levine et al.

    A computational scan for U12-dependent introns in the human genome sequence

    Nucleic Acids Res.

    (2001)
  • B.J. Loftus et al.

    The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans

    Science

    (2005)
  • P.P. Madsen et al.

    Short/branched-chain acyl-CoA dehydrogenase deficiency due to an IVS3+3A>G mutation that causes exon skipping

    Hum. Genet.

    (2006)
  • T. Maniatis et al.

    Alternative pre-mRNA splicing and proteome expansion in metazoans

    Nature

    (2002)
  • Cited by (8)

    • Comprehensive discovery of salt-responsive alternative splicing events based on Iso-Seq and RNA-seq in grapevine roots

      2021, Environmental and Experimental Botany
      Citation Excerpt :

      Based on Iso-Seq (Iso-Seq.collapsed.gtf), the nucleotide frequency of genes was analyzed, and obvious bias was found at the donor/acceptor site, where the motifs 'AGGT' (donor site) and 'TGCAGG' (acceptor site) were ultraconserved in grape vine roots (Fig. 2a, 2b). Similarly, Nguyen et al. (2018) regarded 'GGTAAGT' (5′SS) and '(Y)6 N(C/t)AG(g/a)t' (3′SS) as the most enriched motifs in splice sites by analyzing more than 1000 species/lineages. In eukaryotes, the most typical splice junction types at donor and acceptor sites are U2-type 'GT-AG' and 'GC-AG' (Sheth et al., 2006), which were also found in this study and accounted for 97.44 % and 1.32 % of all the splice sites in grapevine roots, respectively (Fig. 2c).

    View all citing articles on Scopus
    View full text