BAL31-NGS approach for identification of telomeres de novo in large genomes
Graphical abstract
Introduction
Development of this BAL31-NGS method was motivated by our effort to find telomeric sequences in two groups of plants, i) dicotyledons, genus Cestrum (C. elegans in this study) and ii) monocots, Allium species (e.g. A. ursinum and A. cepa) which lack typical plant telomere repeats (TTTAGGG). Classical attempts including analysis of repetitive DNA by reassociation experiments, searching by “terminal cloning” with ligation of adapters or labelling with terminal transferase, and detection of telomerase activity have always been unsuccessful or led to inconclusive results. In general, the reason for these unsuccessful attempts might be characterised by a quote: “direct analysis of chromosomal termini presents problems because of their very low abundance in nuclei” [1].
Therefore, we aimed to develop a method for de novo identification of telomere sequences in large genomes with large chromosomes (for example, ∼2 Gb per chromosome), the least favourable situation for telomere characterisation by previously existing methods. For this purpose, we developed the BAL31-NGS approach which should overcome the known shortcomings of previously existing methods [2], [3]. Besides its demonstrated power in large genomes and chromosomes, the method should be easily applicable to smaller genomes too, and in this case it can be even more accurate due to a higher per-base coverage. Nevertheless, there is a wide range of methods used to identify telomere repeats in previous studies which should be considered in parallel to BAL31-NGS depending on the genome/chromosome characteristics of a given organism, the availability of instrumentation, sequencing and bioinformatic services or know-how, and the overall budget. In order to facilitate the correct choice, we start this article with an overview of previously published strategies for telomere identification, which is followed by the description of the BAL31-NGS approach.
A wide range of model organisms, including protozoans, yeasts, animals and plants, have been used extensively in telomere biology research. Each of these model systems shows specific advantages and features. Plants, for example, in contrast to mammals, have advantages which include reversible telomerase regulation, the absence of permanent telomerase silencing, and the maintenance of stable telomeres during their entire ontogenesis [4], [5]. These characteristics are also relevant to the application of telomere biology for development of telomerase regulators e.g., to treat cancer. However, there are several additional reasons why knowledge of telomeres is useful and interesting. These include: i) the unknown origin of telomerase, ii) evolutionary and functional causes and consequences of telomere changes in taxons, and iii) construction of artificial chromosomes, e.g., plant minichromosomes for crop breeding.
The number of known telomeric sequence motifs is increasing, and despite rare exceptions [6], [7], [8], [9] these motifs are mostly derived from the consensus TxAyGz. The simplest explanation for such a conserved feature is that the chromosomal end-protection system has been evolutionarily optimised. Although alternatives exist and are available for selection through evolution, they are limited to linear mitochondrial DNA [10], linear bacterial genomes and plasmids [11], linear viruses [12], and some tumors [13]. Similarity among telomere motifs has on the one hand aided research (e.g. the Tetrahymena sequence used for screening genomic libraries from related taxons). On the other hand, however, it can lead to the sometimes erroneous conclusion that a hybridization signal means identical probe and target sequences, since in closely related taxons this usually holds true. The final proof of what sequence forms the real chromosomal end requires direct proof from DNA sequencing. Sequences of DNA fragments that are derived either from telomeres or from products of telomerase reactions are the best evidence for identity of telomeric DNA. We describe here our method of BAL31-NGS, which is a straightforward approach suitable even for large genomes and can be used for identification of an unknown telomeric sequence even when no preliminary data is available.
The first successful strategy to obtain a telomeric DNA sequence [14] was based on the advantageous ratio of the high number of telomeres to the small genome size. This straightforward mode of telomere detection (direct labelling, end-cloning) is powerful and convincing. On the other hand, it is difficult to use in the case of giant genomes with a low number of chromosomes (tens of Gb distributed in 2n = 14, 16…).
The unicellular free-living ciliated protozoan Tetrahymena thermophila exhibits nuclear dimorphism. Each cell contains a generative diploid micronucleus (mic) and a polyploid somatic macronucleus (mac). The mic is organised into 2n = 10 chromosomes which undergo site-specific fragmentation into 200–300 minichromosomes ranging from 20 kb to 3 Mb when the mac (∼2 × 104 Mb) is formed. An average of ca. 50 copies per minichromosome occurs in the mac, unlike minichromosomes containing rRNA genes (rDNA) which are amplified to approximately 10,000 copies and have a palindromic character with two rDNA loci arranged as inverted repeats, reviewed in [15]. Each rDNA minichromosome, referred to as an autonomously replicating piece, possesses a well-defined restriction map and, most interestingly, telomeres at its termini. Unlike internal sequences, the terminal restriction fragments of rDNA minichromosomes were found to be readily accessible for nick translation in vitro using DNA polymerase I (without DNase I treatment) due to natural discontinuities in its structure. This specific labelling enabled further analysis. The sequence “5′ CCCCAA 3′ was determined by depurination and nearest-neighbour analysis of the in vitro labelled rDNA, and by analysis of the products of digestion with T4 endonuclease IV” [14]. Historically, thanks to this study telomere biology was promoted to the molecular level.
The advantage of amplified linear extrachromosomal DNA has been exploited in other eukaryotes. During the 80′s and 90′s of the last century, a more-or-less modified method from the Tetrahymena studies was used to identify telomeric sequences in other Alveolata, Cryptophyta, Rhizaria, Excavata, and Amoebozoa (see Table 1). In some related species like Trichomonas vaginalis, Giardia lamblia, Euglena, Crithidia fasciculate, Leishmania tropica and L. tarantulae, Trypanosoma cruzi and Leptomonas, the presence of (TTAGGG) was tested in terminal restriction fragments by Southern hybridization after BAL31 digestion without previous cloning. Except for Trichomonas and Giardia, all species were positive [16]. This approach is widely used for telomeric sequence identification when information about closely related taxons is available (see STRATEGY 3). Sometimes it uncovers evolutionary switches in telomeric sequences through the lack of positive terminal signals, such as in Giardia (see the changed motif in Table 1), but on the other hand it might ignore differences that do not change the hybridization profile, such as a mixture of two motifs in Leishmania brazilienzis (TTAGGG/TCCACGGGTTAGGG) [17], and even species from the same genus may differ in telomeric sequences [18].
Soon after discovery of the first ciliate telomeric sequences, a search for yeast telomeres commenced. The genome size of both is comparable, of the order of tens or hundreds of Mb; the main difference is the absence of a macronucleus with tens of thousands of telomeres in yeasts. On the other hand, the promising and rapidly developing field of yeast genetic methodology provided other valuable tools such as yeast artificial chromosomes (YACs), whole genome libraries, and chromosomal walking. Functional YACs possess Autonomously Replicating Sequences (ARS), centromeric sequences for mitotic stability, selective markers, and telomeric sequences at the termini of linear versions. It has been shown that Tetrahymena telomeres can cover the termini of Saccharomyces cerevisiae linear YACs [34]. Inversely, the Tetrahymena telomere could be replaced by a functional equivalent (ca. 300 bp per telomere) from the S. cerevisiae genome (2n = 32); the Tetrahymena telomere was first removed from one arm of a YAC using a restriction enzyme and then a restriction fragment from the S. cerevisiae genome, which, after ligation, was able to functionally replace the lost arm, was identified, subcloned, and its complex telomeric sequence was characterised as TG1-3 [1]. For Neurospora crassa [35], [36], a genomic walk in a cosmid library using the most distal gene in the selected linkage group as the start point in searching for the telomere motif (TTAGGG) was used. In some studies, relatively small genomes made it possible to replace chromosome walking by whole genome shotgun sequencing using the Sanger method [37] (Rhizopus oryzae, genome 45.3 Mb, TTGTGG) [38] and later improved by NGS (Aspergillus fumigatus, genome 29.4 Mb, TTAGGG) [39].
To search for telomeres in giant plant genomes, methods based on artificial chromosomes and chromosome walking are inappropriate, since functional chromosomal elements like centromeres and origins of replication are not strongly determined by sequence and ubiquitously scattered repeats complicate assembly. Plant minichromosomes (equivalent to YACs) can be prepared by telomere-mediated truncation of natural chromosomes. The method uses knowledge of telomeric sequences to induce chromosomal breaks by introduction of telomeric repeats into chromosomes, resulting in truncation of the distal portion of the chromosome and formation of a new telomere at the integration site [40], [41], a process termed “telomere healing”. On the other hand, for the purpose of telomere identification a mitotically stable linear vector could serve as a substrate in telomere healing during transformation, even without its own proper telomeric sequence such as was observed for linear plasmids from Histoplasma capsulatum (TTAGGG) and Cryptococcus neoformans (TTAG3–5, BAL31 test inconclusive) [42], [43]. An approach based on the repetitive character of terminal sequences was used in Candida albicans; screening of a genomic library on nitrocellulose filters using genomic DNA to identify repetitive clones (rDNA and mtDNA were excluded by using specific probes) was successful and one clone was verified to carry a telomeric sequence. This sequence, (ACGGATGTCTAACTTCTTGGTGT), has been shown to heal linearized pBR322 ligated onto the chromosomal terminus [44], [45].
In many yeast species, a straightforward approach of direct end-cloning turned out to be efficient. The backbone of these experiments is uncut yeast genomic DNA (untreated or pre-treated with T4 DNA polymerase in order to produce blunt ends), which is ligated to a blunt-ended linearized plasmid vector and the ligated mix is then digested with a restriction enzyme specific for the polylinker. Plasmids with terminal fragments are recircularized and identified by sequencing randomly selected clones or using library screening with already known telomeric probes. In this way, telomeric sequences similar to the C. albicans motif were identified in Candida tropicalis, C. maltosa, C. guillermondii, C. pseudotropicalis, K. lactis, C. glabrata, C. maltose [46]. Similarly, (TTAGGGTCAACA) in Aspergillus oryzae [47], (TCTGGGTG(TG)0–3) in Saccharomyces castellii, (TCTGGGTG(TG)0–2/TCTGGG) in S. diarensis, (TGGGTG(TG)0–3/TGGTG(TG)0–4) in S. cerevisiae and (TGGGTG(TG)0–2/TGGTG(TG)0–4) in S. exiguous were characterised [48]. Intermingled telomeric motifs (G(A/G)GCCT(C/T)CT), (GAGCCTTGTTT) and (GAGACGCAGAGTGTTGCCAGGATG) were identified in the obligate intracellular amitochondriate parasite Encephalitozoon cuniculi in the same way [49].
Long telomeric repeat units might be used to identify TR in closely related yeasts, and in a reverse approach based on a putative TR, telomeric repeats can be predicted such as in Candida dubliniensis, C. lusitaniae, C. metapsilosis, C. orthopsilosis, C. parapsilosis, C. sojae, C. tropicalis and Lodderomyces elongisporus [50]. Searching for TR sequence homologs among species is strictly limited to closely related species. In addition, the first plant TR was identified only recently [51], and except for putative TR orthologues in related Brassicaceae, TR is not known in any other plant family.
All telomeric sequences identified in previous studies have a repetitive character, and thus a similar feature was expected for all higher eukaryotes. Approaches aimed at isolation of repetitive telomere DNA led to the identification of Arabidopsis and human telomeric sequences (TTTAGGG and TTAGGG, respectively) in the same year [52], [53]. A human telomeric sequence was found in a library of clones with highly repetitive DNA; clones which hybridized with hamster highly repetitive DNA and did not carry rDNA were then examined by FISH and BAL31 sensitivity tests [52]. Under stringent conditions, a large-scale screening FISH study on bony fish, reptiles, amphibians, birds, and mammals has shown a likely conservation of ancestral (TTAGGG) in vertebrates [54]. A similar extensive screening based on FISH, with a similar conclusion about the ancestral (TTAGGG) motif in animals, has been carried out for basal Metazoa [55].
It has been almost forgotten that Drosophila telomeres were also identified among a set of repetitive clones. Its tandem array of retrotransposon-like sequences was cloned and characterised by restriction enzyme digestion and in situ hybridization [8], [56] shortly after publication of the Tetrahymena sequence [14]. The relatively high complexity of retrotransposon unit sequences complicates conventional telomere approaches. A convincing demonstration of retrotransposon telomere function in Drosophila was presented by an opened ring chromosome with ends terminating in He-T retrotransposon DNA [57]. In other insects, the prevalent (TTAGG) motif in genomic clones was identified using Southern hybridization with a degenerate (TTNGGG)5 probe [58]. Further FISH screening has shown that the (TTAGG) motif, besides being found in most insects, is also common in other arthropods. Numerous exceptions were described, however, e.g. in Diptera, Heteroptera, Dermaptera and Chelicerata [59], [60]. An alternative motif (TCAGG) has been found in some Coleoptera, see review [61], [62], as a tandem repeat candidate from analysis in silico [63], [64]. Similarly, in nematodes a (TTAGGC) telomeric motif in Ascaris lumbricoides was predicted from randomly cloned genomic DNA fragments containing 16 tandem repeats [65].
The identification of telomere candidates in genomic DNA or libraries using specific probes with sequences from related species has only a limited predictive value. Until the clone or genomic fragment is sequenced, the motif can only be considered as a likely candidate, as in the case of Aspergillus nidulans where it was first predicted [66] and only later sequenced [67], or in Cladosporium fulvum where screening of a cosmid library using a (TTAGGG) probe identified a positive clone that was subcloned and the terminal fragment was sequenced [68].
As with the human telomeric motif and its distribution among other vertebrates, the first plant telomeric sequence from Arabidopsis was identified as a repetitive element. An enriched library of repetitive DNA fragments obtained from end-cloning was used for identification of the telomeric sequence, which was then tested in other plant species by FISH. The identification de novo included a protocol involving blunt-end ligation of an M13 cloning vector to the ends of high molecular weight DNA (hmwDNA). In the second step, MboI and BamHI digested most of genomic fragments and then the plasmid was circularized by ligation. A parallel library was prepared in the same way but the inverted orientation of MCS in the M13 vector was used. Single stranded DNA was prepared from both by M13 infection and the final mixture of ssDNA libraries was left to reanneal. Mostly repetitive sequences like telomeric, satellite and rDNA repeats, were expected to reanneal efficiently from these enriched end-cloning mixed libraries. For further enrichment, rDNA and 180 bp satellite clones were removed from analysis, leaving ca. 800 clones for a cherry-pick. At this point, BAL31 sensitivity was applied as a tool for telomeric sequence identification instead of using it as a final proof. Pools of six clones were tested as a probe for BAL31 sensitivity using Southern hybridization on nylon membranes to reveal progressively shortened terminal restriction fragments. After screening of 300 clones, a telomeric sequence, (TTTAGGG), was found [53]. It should be noted that Arabidopsis is a well-studied model, even for telomere searching, due to its small genome with relatively low content of repetitive DNA. This method was successfully used in the identification of (TTGCA) telomeric repeats in Parascaris univalens and (TTTTAGGG) in Chlamydomonas reinhardtii [69], [70]. The final screening of enriched repetitive libraries in Chlamydomonas was done by Southern hybridization, using the Arabidopsis telomere clone as a probe. Probably, giant plant genomes with chromosomes in the order of even several Gb, which are actually enriched with interspersed and other repeats, dissuaded researchers from direct end-cloning, chromosomal walking, etc. Instead, as in animals, identification of the first plant telomeric sequence was followed by studies focussing either on a limited repertoire of species or large-scale screenings based on FISH with a consensus probe [71], [72], [73], [74], [75]. Nearly identical telomeric sequences possessing minor variability, such as (TT(T/A)AGGG) in Lycopersicon esculentum [76], were neglected without cloning or sequencing genomic DNA or telomerase products. Interestingly, in Aloe spp. a human-like (TTAGGG) motif was detected by FISH [3], [77], [78], [79], [80].
A combination of genomic data mining, NGS, and TRAP (Telomere Repeat Amplification Protocol) has been used recently for a highly diverse phylogenic survey of telomeric sequence evolution in Algae [81], [82]. Whole genomic NGS data were used in various species as an optimal way to obtain tandem repeat candidates for the design of a FISH probe and a reverse primer that is necessary for the second step of TRAP – amplification of telomerase products – which are subsequently cloned and sequenced [82], [83], [84]. Unless at least partial knowledge of the annealing site for the reverse primer on the telomerase product is available, or if telomerase activity in the organism was lost, the TRAP strategy is not useful. In addition, the availability of telomerase-active tissue may be problematic.
The exceptional plant family Alliaceae (recently subfamily Allioideae from Amaryllidaceae) and the genus Aloe from Xanthorrhoeaceae have been shown to lack typical plant telomeric sequences [72], [85], and replacement of typical plant telomeres with human-like repeats was then demonstrated in a number of Asparagales families [78], including Alliaceae [79], [80], by sequencing of telomerase products. At the very top of Asparagales telomere evolution, the sequence in the genus Allium had remained uncharacterised. Although previous physical mapping, end-cloning, or FISH approaches resulted in promising candidates such as 380 bp satellites, retrotransposons, or rDNA, none of these were confirmed as genuine telomeric sequences [86], [87], [88], [89]. To solve this problem, we developed the BAL31-NGS method, modified comparative whole genome shotgun sequencing, as described further in this article [2]. Other flowering plants possessing enigmatic telomeres are the closely related species Cestrum, Vestia and Sessea from Solanaceae in dicots [78], [90].
Eventually, information about species characterised only by the hybridization screenings mentioned in Section 1.2.3 will be updated by sequences of telomerase products such as in Daphnia [84], identification of the template region of TR subunits e.g. in [91], [92], [93], [94], or data from genome sequencing projects.
With the boom of genomic sequencing projects, information on telomeric sequences has expanded. From time to time, genome sequencing de novo uncovers surprising telomeric sequences such as (AATGGGGGG) in Cyanidioschyzon merolae (16.5 Mb, 20 chromosomes) [95]. Additional verification of terminal sequences using PolyC-tailing and PCR-based methods is then feasible [96].
Using tandem repeat candidates from NGS data in combination with verification by FISH, TRAP, and BAL31 sensitivity, diverse telomeric motifs including the plant consensus (TTTAGGG) and unusual (TTCAGG) and (TTTCAGG) telomeric repeats were found recently among species of the carnivorous plant genus Genlisea [18].
A valuable tool – RepeatExplorer – for analysis of interspersed and tandem repeats in large genomes such as identification de novo, frequency, evolution, and assessment of variability has been developed with an NGS approach [97], [98]. For low complexity sequences, particularly tandem repeats, Tandem Repeats Finder (TRFi) is more suitable [99]. Both tools are able to analyse multiple datasets in parallel, and we used both for identification of telomeric sequence candidates in Cestrum and Allium species using our BAL31-NGS strategy (see Section 2).
For the purpose of telomeric sequence identification de novo, we combined the NGS potential of massive parallel sequencing with the presumed BAL31 sensitivity of genuine telomeres, and the results were processed using new bioinformatic tools such as RepeatExplorer, Tandem Repeats Finder, and our custom-made scripts collectively called Tandem Repeat Merger [2], [3]. The principle of this approach is comparative NGS analysis of two aliquots of a hmwDNA sample with unknown telomeres (Fig. 1). One of these aliquots is digested with BAL31 and the other remains untreated. Although it is impossible to extract intact chromosomes that are several Gb long, BAL31 degrades telomeres systematically, unlike intrachromosomal regions which are degraded from random nicks and double-strand breaks. To identify telomere candidates among BAL31-sensitive sequences, a second criterion is needed: as all known telomeres are composed of repeated DNA, we assume this also holds true for unknown telomeres. If the target telomeric repeat is reasonably frequent, it should be included in the barcoded genomic library (prepared by a PCR-free protocol) and even low-coverage sequencing is able to detect it since the repeat coverage is increased according to its copy-number per genome. An assembly of target repeats and an estimation of their under-representation in BAL31-treated samples can be assessed, e.g. from RepeatExplorer and TRFi data, based on the number of reads belonging to the assembled repeats. The RepeatExplorer pipeline includes an option of comparative clustering and assembly, and comparative statistics (how many reads belong to sample 1 and sample 2, etc.) are available in the pipeline output, and we used this feature for identification of Cestrum telomere candidates. The statistical significance of differences among samples must be verified by the user. Since we compared hundreds of repeats from two samples (BAL31-treated and untreated) we narrowed down the number of candidates using the Chi-squared test as described in [3]. Similarly, data from TRFi used for identification of Allium candidates were tested. The frequency of identified candidate tandems (number of reads with particular tandem repeats normalized according to size of dataset) was compared in BAL31-treated and untreated samples [2].
Section snippets
High molecular weight DNA preparation in low melting temperature agarose
HmwDNA was prepared in low melting temperature agarose as described earlier for preparation of samples for pulsed-field gel electrophoresis (PFGE), including deproteinization with Proteinase K, washing, and inactivation of Proteinase K. After extensive washing with 1 mM Tris-HCl – 0.1 mM EDTA, pH 8, the integrity of the DNA was checked by PFGE. HmwDNA suitable for analysis should migrate in the compression zone, corresponding to fragments longer than 1 Mb (see [79] for details).
Agarose plugs
Discussion
The DNA end-replication and end-protection problems have been solved in many different ways through evolution [107]. The most prevalent one in eukaryotes seems to be the telomerase system. Non-telomerase models (Drosophila, telomerase-negative survivors in yeasts, etc.) are interesting as examples of alternative telomere maintenance systems. Knowledge of the telomere DNA sequence is always the first step towards a deeper understanding of telomeres in selected models. A single, absolutely
Conclusions and perspectives of the BAL31-NGS method
NGS has become a common tool in research and diagnostics. Sequencing data of genomes with completion and release of assemblies have been progressively refined, and information on newly-sequenced organisms and clinical samples is increasing exponentially. Thus, NGS data will probably become a commonly used source of qualitative and quantitative information about telomeric sequences in model organisms, representatives of less explored species, species with unknown telomeric sequences, and
Acknowledgements
This research was supported by the Czech Science Foundation, projects no. 13-10948P and 16-01137S, and the Ministry of Education, Youth and Sports of the Czech Republic under the project CEITEC 2020 (LQ1601). We thank the GeneCore (EMBL, Heidelberg, Germany) for NGS sequencing, and Vladimír Beneš for his expertise. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme ‘Projects of Large
References (110)
- et al.
A new beginning with new ends: linearisation of circular chromosomes during bacterial evolution
FEMS Microbiol. Lett.
(2000) - et al.
A tandemly repeated sequence at the termini of the extrachromosomal ribosomal RNA genes in Tetrahymena
J. Mol. Biol.
(1978) - et al.
Identification of a telomeric DNA sequence in Trypanosoma brucei
Cell
(1984) - et al.
Sequence-specific fragmentation of macronuclear DNA in a holotrichous ciliate
Cell
(1981) - et al.
Sequence and structure of a Plasmodium falciparum telomere
Mol. Biochem. Parasitol.
(1988) - et al.
Telomeric sequences of Cryptosporidium parvum
Mol. Biochem. Parasitol.
(1998) - et al.
Structure of the growing telomeres of Trypanosomes
Cell
(1984) - et al.
An irregular satellite sequence is found at the termini of the linear extrachromosomal rDNA in Dictyostelium discoideum
Cell
(1981) - et al.
Cloning yeast telomeres on linear plasmid vectors
Cell
(1982) Characterization of telomere DNA from Neurospora crassa
Gene
(1990)