ReviewThe Drosophila genome
Introduction
With the sequencing of the Drosophila genome completed this year in an unprecedented collaboration between the Berkeley Drosophila Genome Project and Celera Genomics, we can start to address the question posed by Ogden Nash. A set of papers in the March 24th, 2000 issue of Science constitutes the first synthesis of the computational analysis of the fly's DNA sequence. Historically, studies of the fly have provided many of the essential components for this synthesis: the chromosome theory of heredity (1915), cytogenetic maps (1938), X-ray induced mutations (1927), the first genomic libraries (1975) and the first genome-wide mutational screens to identify the genes regulating development (1980) (for review, see [1]). The fly genome has been sequenced using the whole-genome ‘shotgun’ method [2, [3 and the assembly verified by comparison to a traditional physical map [4], [5]. This sequence and the annotation of the genome has permitted an initial comparative analysis [6]. The implications for biology and medicine are significant, as described both by Kornberg and Krasnow [7] and by Rubin et al. [6]. Another report [8] describes the status of the cDNA resources needed both to accurately annotate the genomic sequence and for functional studies. In this review, I highlight these major advances in Drosophila research and, in addition, pay particular attention to a recent advance, homologous recombination, that may make targeted gene disruption widely available in this model organism.‘The Lord in his wisdom made the flyBut forgot to tell us why'Ogden Nash, The Fly.
Drosophila is the third eukaryotic genome to be sequenced, following the 12 Mb yeast (Saccharomyces cerevisiae) [9] and the 97 Mb nematode worm (Caenorhabditis elegans) [10]. The Drosophila genome is ∼180 Mb, a third of which is centric heterochromatin. The centric heterochromatin cannot be cloned stably and therefore the sequence (Release 1) is primarily that of the euchromatic portion of the fly genome. The 120 Mb of euchromatin resides on four chromosomes: two large autosomes (second and third), the X chromosome, and a small fourth chromosome containing only ∼1Mb of euchromatin.
Section snippets
Whole-genome shotgun assembly of Drosophila
With much skepticism from the sequencing community, Gene Myers and his colleagues at Celera Genomics [3] undertook and successfully generated an assembly of the euchromatic portion of the Drosophila genome. Approximately 24,000 sequence reads were generated from bacterial artificial chromosome (BAC) clones (∼163kb) and 3 million sequence reads were generated from 2 and 10kb genomic clones. Paired-end sequence was essential to the correct assembly: 72% of the sequence reads were in the form of
Gene annotation
Computational analysis was used to predict transcript and protein sequence as well as potential functions for each putative protein. Genes were identified using two gene-finding programs, ‘Genscan’ [12] and ‘Genie’ [13], [14], in conjunction with the results of complementary DNA and database searches. The final gene structures were determined by human curation. The initial computational analysis predicts 13,601 genes, just twice the number for the simple single-celled yeast and fewer than
Comparative genomics
With the sequencing of three eukaryotic genomes now complete, Rubin et al. [6] have compared their core proteomes. The core proteome is defined as the set of non-redundant proteins produced in each organism. The core proteomes of yeast, worms and flies contain 4383, 9453 and 8065 protein families, respectively. For the comparison of the core proteomes, a protein is defined as an ortholog if it shows similarity for at least 80% of the length of its sequence. Flies share 16% of their genes with
Implications for human disease
It is hard to imagine that this small invertebrate, the fruitfly, can serve as a model for human disease. Yet with the cloning of the Drosophila homeobox genes (for review, see [27]), it became apparent that numerous processes controlling metazoan development are conserved in higher organisms (reviewed in [7]). In an attempt to estimate the extent to which different types of human disease genes are found in flies, Rubin et al. [6] identified 289 genes. This set of genes implicated in human
Gene expression and protein function
Integral to interpreting the complete genomic sequence is having a cDNA that represents each gene. These cDNAs will be used for studies of protein function, and their sequence will be used to determine gene structure, including 5′ and 3′ non-coding exons and intron/exon boundaries. Rubin et al. [8] describe a set of cDNAs (‘Drosophila Gene Collection Release 1.0’) corresponding to >40% of the genes in the fly, and strategies for isolating cDNAs representing the remaining genes. This set of
Homologous recombination in Drosophila
One significant limitation to Drosophila melanogaster as a model organism has been the inability to make targeted gene disruptions, that are possible in yeast and mice. The first paper demonstrating a system for homologous recombination in flies was published this summer by Yikang Rong and Kent Golic [41]. Although they have not yet generated mutants, they have successfully rescued yellow (y) mutant flies by substituting the wild-type allele for the y mutant.
Their system is dependent on three
Conclusions and future directions
Two large questions remain. Embedded in the vast amount of non-coding sequence are the control elements that direct proper spatial and temporal gene expression. How do we identify them? And with the initial identification of the proteome of Drosophila, what are the functions of the proteins and how do they interact with one another? With the complete D. melanogaster sequence and anticipated sequence from another Drosophila species, possibly D. pseudoobscura, comparative analysis can be used to
Acknowledgements
I thank Catherine Nelson, Joanne Topol and Gerry Rubin for critically reading the manuscript. I thank Kent Golic, Bruce Hay, Paul Lasko, Troy Littleton and David Wassarman for providing preprints of their manuscripts. I would also like to thank members of the Berkeley Drosophila Genome Project and Celera Genomics for their united efforts to produce the invaluable Drosophila genome sequence. This work was supported by NIH grant P50HG00750.
References and recommended reading
Papers of particular interest, published within the annual period of review,have been highlighted as:
of special interest
of outstanding interest
References (45)
- et al.
Prediction of complete gene structures in human genomic DNA
J Mol Biol
(1997) - et al.
Expanded polyglutamine protein forms nuclear inclusions and causes neural degeneration in Drosophila
Cell
(1998) - et al.
Drosophila in cancer research
Trends Genet
(2000) - et al.
Drosophila p53 binds a damage response element at the reaper locus
Cell
(2000) - et al.
Drosophila p53 is a structural and functional homolog of the tumor suppressor p53
Cell
(2000) - et al.
A brief history of Drosophila's contributions to genome research
Science
(2000) - et al.
The genome sequence of Drosophila melanogaster
Science
(2000) - et al.
A whole-genome assembly of Drosophila
Science
(2000) - et al.
A BAC-based physical map of the major autosomes of Drosophila melanogaster
Science
(2000) - et al.
A physical map of the polytenized region (101EF-102F) of chromosome 4 in Drosophila melanogaster
Genetics
(2000)
Comparative genomics of the eukaryotes
Science
The Drosophila genome sequence: implications for biology and medicine
Science
A Drosophila complementary DNA resource
Science
Life with 6000 genes
Science
A biologist's view of the Drosophila genome annotation assessment project
Genome Res
A generalized hidden Markov model for the recognition of human genes in DNA
Intelligent Systems Mol Biol
Genie-gene finding in Drosophila melanogaster
Genome Res
Gene ontology: tool for the unification of biology
Nat Genet
Pre-messenger RNA processing factors in the Drosophila genome
J Cell Biol
The Drosophila genome: translation factors and RNA binding proteins
J Cell Biol
Cited by (12)
Viral variant visualizer (VVV): A novel bioinformatic tool for rapid and simple visualization of viral genetic diversity
2021, Virus ResearchCitation Excerpt :Routine sequencing of nucleic acids molecules started in the late 1970’s, using the Sanger’s chain-termination method (Sanger et al., 1977) in which different length amplicons are generated using target molecule specific primers and modified nucleotides that randomly terminate elongation. The rise of this technique allowed genome sequencing of multiple living species (Moraes and Góes, 2016; Vogel, 2000; Celniker, 2000; Consortium, 1998; Goffeau et al., 1996) but these projects were expensive and time-consuming. Most recently, in virology and other fields, there has been a substantial increase in the use of a newly developed technique known as high throughput sequencing and globally named as Next-Generation Sequencing (NGS).
Drosophila as a Model Organism for Investigating Molecular and Cellular Etiologies Underlying Complex Neurological Disorders in Humans
2006, Journal of Asia-Pacific EntomologyOf flies and men - Studying human disease in Drosophila
2001, Current Opinion in Genetics and DevelopmentReannotation of eight Drosophila genomes
2018, bioRxivRe-annotation of eight Drosophila genomes
2018, Life Science AllianceInsect Molecular Genetics
2013, Insect Molecular Genetics