Defining the expression of piRNA and transposable elements in Drosophila ovarian germline stem cells and somatic support cells

Comprehensive transcriptional characterization of mRNA and small RNA in early Drosophila germline stem cells reveals novel piRNA clusters, transposon dynamics, and alternative splicing events.


Introduction
In animals, germ cells are dedicated to faithfully transmitting the genome from generation to generation. The genome contains many heterochromatic regions, which are rich in transposable elements (TEs), including both DNA transposons and retrotransposons. Mobilized TEs can mutate protein-coding genes, regulatory regions, and impair genome stability in germ cells. Piwi-interacting RNAs (piRNAs) are generated to repress TE activity and maintain genome stability, given that disrupting piRNA production causes infertility in Drosophila and mice because of DNA damage-induced blockage of germ cell differentiation Thomson & Lin, 2009;Khurana & Theurkauf, 2010;Saito & Siomi, 2010;Banisch et al, 2012). The Drosophila ovary contains germline stem cells (GSCs) that provide a continuous supply of differentiated germ cells and eventually mature oocytes throughout their lifetimes (Spradling et al, 2011;Xie, 2013). Although piRNA components are required intrinsically for maintaining GSCs in the Drosophila ovary (Ma et al, 2014(Ma et al, , 2017, piRNAs in GSCs and their immediate progeny have yet to be characterized. This study uses the Drosophila ovary as a model to reveal that piRNA expression levels in GSCs and early progeny are higher than in terminally differentiated germ cells and discovers previously unidentified piRNA clusters. The adult Drosophila ovary contains 12-16 ovarioles with each carrying 2-3 GSCs in its germarium at the tip. GSCs continuously generate cystoblasts (CBs) via asymmetric cell division; Bam and Bgcn function as key differentiation regulators driving CBs to form connected 2-cell, 4-cell, 8-cell, and 16-cell cysts via synchronous division as evidenced by mutations in bam and bgcn completely blocking further CB differentiation and causing accumulation of GSC-/CB-like cells (McKearin & Spradling, 1990;Ohlstein & McKearin, 1997). bam is repressed in GSCs by niche-activated BMP signaling but is then expressed in CBs and dividing cysts (Xie & Spradling, 1998;Chen & McKearin, 2003;Song et al, 2004). Constitutive BMP signaling causes the accumulation of GSC-like cells outside the GSC niche (Xie & Spradling, 1998;Chen & McKearin, 2003;Casanueva & Ferguson, 2004;Song et al, 2004). This study investigates piRNA and TE transcriptional profiles using constitutive BMP signaling and bam/bgcn mutations to enrich GSCs and CBs, respectively.
Two distinct piRNA pathways exist in the Drosophila ovary, the one in the germline uses three PIWI family proteins (Piwi,Aub,and Ago3) and the other in the soma requires only Piwi (Saito et al, 2006;Brennecke et al, 2007;Yin & Lin, 2007;Li et al, 2009;). Soma-derived piRNAs originate from uni-stranded clusters, whereas most germline-derived piRNAs generally originate from dual-stranded clusters. piRNAs are produced by distinct transcriptional mechanisms and processing machineries. In the soma of the Drosophila ovary, the piRNA clusters are transcribed from only one DNA strand through the use of specific PolII promoters. These transcripts are then transported to the Yb body and further processed by Piwi and Zucchini into individual uni-stranded piRNAs (Ross et al, 2014;Huang et al, 2017;Yamashiro & Siomi, 2018). In germ cells (presumably nurse cells), large piRNA clusters in heterochromatin regions exhibit an enrichment for H3K9me3, which can recruit the RDC complex and subsequently the Moon-TRF2-TFIIA-S complex to initiate transcription from both strands in a PolIIdependent, but promoter-independent manner (Pane et al, 2011;Zhang et al, 2012Zhang et al, , 2014Mohn et al, 2014;Chen et al, 2016;Andersen et al, 2017). Cuff, a major component of the RDC complex, has been shown to be necessary for the productive expression of piRNA precursors from dual-stranded clusters by interfering with the recruitment of CPSF and preventing transcript degradation by the exonuclease Rat1 (Chen et al, 2016). These transcripts from both DNA strands are then transported to the perinuclear structure known as the nuage, where they are processed to generate primary piRNAs by Piwi and Zucchini (Saito et al, 2006;Brennecke et al, 2007;Qi et al, 2011;Han et al, 2015;Mohn et al, 2015;Hayashi et al, 2016). Aub-bound primary piRNAs then recognize the sense transcripts, and these sense transcripts can be subsequently cleaved by Ago3 at the 10th nucleotide position complementary with the sense piRNAs. Ago3-bound sense piRNAs target complementary antisense transcripts, thus, resulting in further processing that generates antisense piRNAs. This feed-forward piRNA amplification loop is known as the "ping-pong" cycle (Brennecke et al, 2007;Gunawardane et al, 2007). Besides transcripts from piRNA clusters, those from active TEs also serve as templates for the piRNA processing machinery resulting in the generation of mature piRNAs and, thus, simultaneous silencing of TEs.
Consistent with critical roles of piRNAs in selectively silencing TEs and safeguarding genome integrity, mutations in the genes required for piRNA production lead to elevated DNA damage and subsequent checkpoint activation in germ cells, ultimately resulting in female sterility (Chen et al, 2007;Klattenhoff et al, 2007). In addition, piRNA profiling experiments have been performed on whole ovaries, which are enriched in late differentiated germ cells as GSCs and CBs represent a minority of the total population of germ cells, and furthermore, early functional studies of piRNAs have been focused on late germ cells, nurse cells, and oocyte (Brennecke et al, 2007;Klattenhoff et al, 2007;Li et al, 2009;Zhang et al, 2012). In the Drosophila ovary, GSCs continuously generate differentiating CBs, which ultimately give rise to oocytes, and mutations in a GSC will be passed on to all of its future oocytes. In contrast, mutations occurring in late germ cells only affect individual oocytes. Thus, one might expect tighter control over TE expression in GSCs, given the importance of avoiding the continuous transfer of deleterious mutations into future offspring.
Although the piRNA pathway is demonstrated to be critical for GSC maintenance (Ma et al, 2014(Ma et al, , 2017, the composition of piRNAs in GSCs is yet to be defined. In this study, we used cultured GSCs, niche cells, and early differentiation-defective bam-and bgcn-mutant ovaries to define piRNA clusters and composition in GSCs and their early progeny. Interestingly, the levels of TE-targeting piRNAs are significantly higher in GSCs and CBs than those in the whole ovary. To this end, we developed an algorithm that sequentially scans the genome and finds regions with characteristics that identify them as potential piRNA clusters. Finally, we also used RNA sequencing of purified GSCs and enriched CBs in bam and bgcn mutants to probe TE expression in GSCs and their early progeny. Surprisingly, most of the known soma-specific TEs are also highly expressed in GSCs and early progeny. Therefore, our profiling of piRNAs and transposons in GSCs and their early progeny provides novel insight into the maintenance of genome integrity in GSCs and opens the door for more focused studies in the future.

Results
piRNAs are more abundant in GSCs and early GSC progeny than late differentiated germ cells In the Drosophila ovary, two or three GSCs and CBs can be identified by the presence of a spherical spectrosome, and mitotic cysts can be recognized by the presence of a branched fusome (Lin et al, 1994;de Cuevas & Spradling, 1998) (Fig 1A). Both the spectrosome and the fusome are the same germ cell-specific organelles expressing cytoskeletal proteins, including hu li-tai shao (Hts) (Fig 1A). GSCs and CBs can be easily and reliably distinguished from each other: GSCs directly contact cap cells and express pMad but CBs do not (Xie, 2013) (Fig 1A and A'). nanos-gal4-driven germ cell-specific expression of the constitutively active BMP type I receptor thickvein (tkv M1 ) results in the accumulation of spectrosome-containing GSClike cells, which are also positive for pMad (Casanueva & Ferguson, 2004) (Fig 1B). Because bam and bgcn work in a mutually dependent manner to drive the differentiation of CBs into mitotic cysts (McKearin & Spradling, 1990;Ohlstein et al, 2000), bam-and bgcnmutant germaria accumulate many more undifferentiated pMadnegative and spectrosome-containing CB-like cells in addition to two or three pMad-positive GSCs (Fig 1C and D). Thus, this study used tkv M1 -overexpressing ovaries and bam-/bgcn-mutant ovaries to enrich GSCs and CBs, respectively, for small RNA and TE analysis using deep RNA sequencing.
To compare the levels of piRNAs and TEs in GSCs, CBs, and whole ovaries, total RNA samples were collected from the wild-type, tkv M1expressing, bam-mutant, and bgcn-mutant ovaries in triplicates. Small RNAs were gel-purified to select for species smaller than 50 nt and libraries were prepared for deep sequencing (see the Materials and Methods section). All small RNA reads were aligned uniquely to the genome (allowing only a single mismatch). For the reads mapped to the Drosophila genome, those reads that matched to the sequences of tRNAs, ribosomal RNAs, small nucleolar RNAs, and small nuclear RNAs were excluded from further analysis ( Fig S1A). The remaining mapped reads were then classified into three different categories based on their sizes and previous studies Wen et al, 2014): siRNAs, miRNAs, and piRNAs ( Fig S1A).
In Drosophila, RNA hairpin structures trimmed by the Drosha/ Pasha complex are loaded onto the Dicer-1/Loqs complex to produce miRNAs if forming imperfect RNA duplexes, or the Dicer-2/ R2D2 complex to produce siRNAs if forming perfect RNA duplexes (Ghildiyal & Zamore, 2009). In our analyses, miRNA reads were identified by aligning the 24-nt or shorter reads to the annotated pre-miRNA locations (Ensembl v87 genome annotations), whereas siRNA reads were identified by aligning 22-nt or shorter reads to the annotated endo-siRNA locations or aligning exactly 21-nt reads to transposons (Wen et al, 2014). The remaining 23-30-nt reads that do not align to other small RNA features were considered as piRNAs. Overall miRNA expression levels in wild-type, tkv M1 -expressing, and bgcn-mutant ovaries are comparable, but those in bam-mutant ovaries appear to be lower (Fig S1B-F). Those miRNAs also exhibit similar size distribution with 22-nt-long miRNAs being the most abundant species in wild-type, tkv M1 -expressing, bam-, and bgcnmutant ovaries (Fig S1B-F). It is worth noting that 23-and 24-ntlong miRNAs are more abundant in bgcn-mutant ovaries than in tkv M1 -expressing or bam-mutant ovaries. In the future, it will be interesting to investigate if Bam and Bgcn are involved in miRNA processing in germ cells ( Fig S1E). For siRNAs, tkv M1 -expressing and bam-and bgcn-mutant ovaries appear to have less overall siRNAs than wild-type ovaries, and 21-nt-long siRNAs are the most dominant species present in wild-type and mutant ovaries (Fig S1B-E and G). These results suggest that wild-type ovaries have higher overall expression levels of siRNAs than differentiationdefective mutant ovaries, but overall miRNAs are at comparable levels among wild-type and mutant ovaries, excluding bam mutants.
The 23-30-nt-long piRNA reads exhibit the known piRNA characteristics (hereafter referred to as piRNAs) with the most abundant lengths of 24-27 nt (Juliano et al, 2011;Banisch et al, 2012;Weick & Miska, 2014;Huang et al, 2017) (Fig 1E). Comprehensive details of the genomic piRNA are provided in the Supplementary Materials (Supplemental Data 1). Although the 23-30-nt-long RNA reads mapped to TEs are more abundant than those mapped to unique genomic locations, the size distribution of piRNAs in wild-type and mutant ovaries is quite similar (Fig 1F and G) (Brennecke et al, 2007). In addition, these 23-30-nt-long RNAs are biased toward a U nucleotide at their first position ( Fig 1H). One caveat of this study is that we did not perform a Piwi/Aub/Ago3 pull-down or a detection of 29-O-methylation at the 39 end to confirm the authenticity of our identified piRNAs, which we identified based on the length. Three normalization methods, to total small RNA reads, miRNA reads, or siRNA reads, yield consistent results (Figs 1I, S1H and I, and S2G). Interestingly, these tkv M1 -expressing ovaries containing GSC-like cells have significantly more piRNAs than the wild-type ovaries, suggesting that GSCs have more piRNAs than germ cells at later developmental stages (Figs 1I, S1H, and S2G). In addition, CB-rich bam-and bgcn-mutant ovaries also express higher piRNA levels than wild-type ovaries, suggesting that early GSC progeny also have more piRNAs than germ cells at later developmental stages (Figs 1I, S1H and I, and S2G). These results, taken together, suggest that GSCs and early progeny express higher piRNA levels than further differentiated later germ cells.

Cultured GSCs express significantly more piRNAs than somatic support cells (SCs)
Because these tkv M1 -overexpressing and bam-/bgcn-mutant ovaries also contain various types of somatic cells, including follicle cells, we used our cultured GSCs for the identification of small RNAs. As reported previously (Ma et al, 2017), those GSCs can be expanded in vitro, with ovarian somatic cells, to a large number for molecular analysis, and can also be induced to differentiate into cysts upon Bam expression. Because those cultured GSCs, but not co-cultured somatic cells, express vas-GFP, we used FACS to separate GFPpositive GSCs from the co-cultured somatic cells for sequencing mRNAs and small RNAs. Based on normalized mRNA counts (comparable expression levels of mRps7 and mRpL47 encoding mitochondrial ribosomal proteins), purified GFP-positive GSCs highly express germ cell-specific genes, vasa and nos, but the GFP-negative SCs also express low levels of vasa and nos,~2-7% of those in germ cells, indicating that there are low levels of GSC contamination in SCs ( Fig S2A). In addition, those GSCs also express put and tkv (encoding type II and type I BMP receptors, respectively, for receiving BMP signal from the somatic niche) and shg (encoding E-cadherin for GSC interaction with somatic niche cells) like GSCs in vivo. The in vitro GSCsupporting SCs express the markers for escort cells (such as tkv and fax), cap cells (such as shg, lamC, and dpp) or both (such as ptc and vkg), suggesting that they resemble somatic progenitor cells, having the properties of both cap cells and escort cells (Fig 2A). In addition, we also examined the expression of various somatic and germline piRNA pathway components in GSCs and SCs. Known germline-specific piRNA components, aub, ago3, vas, qin, rhi, and cuff, are highly expressed in GSCs, whereas krimp, spn-E, tud, piwi, vret, mael, armi, and Yb are expressed in both GSCs and SCs ( Fig 2B). Surprisingly, somatic piRNA component Yb is also expressed in GSCs, which needs further in vivo confirmation (Ross et al, 2014). These results indicate that piRNA pathway components are properly expressed in cultured GSCs and SCs.
Similar to small RNAs in the ovary, more than 95% of small RNA reads from the cultured GSCs and SCs belong to piRNAs, miRNAs, and siRNAs ( Fig S2B). In this set of experiments, synthetic spike-in RNA standards were added to normalize the small RNA library sizes. miRNAs in GSCs and SCs are overall shifted to smaller sizes compared than those in the ovaries of wild-type and mutants, although 22-nt-long miRNAs are still the dominant size (Fig S2C and D). In addition, GSCs have less miRNAs and siRNAs than SCs (Fig S2E  and F). piRNA size distributions in GSCs and SCs are similar to those in the wild-type and mutant ovaries (Fig 2C-E). Both TE-derived and unique genome-derived piRNAs are more abundant in GSCs than in SCs, further supporting the notion that piRNAs have more important roles in germ cells than in somatic cells (Fig 2D and E). Those purified GSCs have more abundant piRNAs than the wild-type and mutant ovaries ( Fig S2F). Consistently, the piRNAs in these cultured GSCs have the bias toward U at the first nucleotide. Taken together, our results suggest that GSCs and early progeny have more abundant piRNAs, and thus, might also exert tighter piRNA-mediated repression of TEs than late germ cells and SCs.

Canonical piRNA clusters show distinct expression patterns in GSCs and SCs
Given the observed higher piRNA expression levels in GSC-/CBenriched ovaries and cultured GSCs, we decided to investigate how changes in overall piRNA expression levels could be related to changes in individual piRNA clusters. To do this, we examined the expression levels of canonical piRNA clusters as previously defined (Brennecke et al, 2007(Brennecke et al, , 2008. All the known major piRNA clusters show lower expression in the SCs than the wild-type and GSC/CB tumorous ovaries except cluster 8/flamenco (Fig 3). Consistent with the idea that the flamenco cluster is primarily transcribed in the ovarian soma, its expression is higher in the SCs than GSCs, but surprisingly, it is also expressed at moderate levels in purified GSCs, suggesting that it is not truly soma-specific ( Fig  3A). Most of other canonical piRNA clusters, including 1/42AB, 2/ 20A, 5/38C, 6/80E-F, 9/20B, and 11/100F, are primarily expressed in GSCs (Fig 3B-J). Taken together, these results suggest that transcripts from known piRNA clusters are differentially expressed in GSCs and SCs.
In germ cells, piRNAs can be amplified, and TEs can be effectively silenced through the "ping-pong" cycle (Brennecke et al, 2007;Gunawardane et al, 2007). We used a recently developed software package to scan for potential piRNA-derived ping-pong signals across the genome (Uhrig & Klein, 2018). For all the canonical piRNA clusters, the overall number of ping-pong signatures is high in GSCs, and is extremely low in SCs, potentially due to low-level GSC contamination, indicating that the ping-pong amplification cycle is only active in GSCs (Fig 3K). Among them, cluster5/38C and cluster1/ 42AB show the highest ping-pong signatures in GSCs, whereas the cluster2/20A exhibits the lowest ping-pong signature in GSCs (Fig  3K'). These results further support the notion that the ping-pong amplification pathway is active in GSCs but not in SCs.

Identification of new piRNA clusters in GSCs and SCs
Although known piRNA clusters contribute to a global increase of piRNAs in GSCs and CBs, we wanted to investigate if other previously un-annotated clusters also contribute to this up-regulation. To this end, we developed a sliding window algorithm that scanned the genome for regions that were at least 2,000 nt in size, more than 1,000 nt away from neighboring clusters, and that exhibited high densities of unique piRNAs. Our method successfully identified piRNA clusters that overlapped with more than 97% of the previously identified clusters, validating the effectiveness of our new algorithm (Brennecke et al, 2007;). We also provide herein detailed information about the genomic coordinates of all piRNA clusters identified and their overlaps with previous datasets (Table S1). Interestingly, these newly identified piRNA clusters are distributed throughout the genome but show some bias toward heterochromatic regions such as telomeres and centromeres ( Fig S3). Our algorithm has successfully identified 3,365 potential new piRNA clusters, 2,829 of which show relatively high expression levels in wildtype ovaries, mutant tumorous ovaries, GSCs, and SCs ( Fig 4A). Among  the newly identified clusters, 18, 28, 10, and 331 new piRNA clusters are uniquely expressed in wild-type ovaries, mutant tumorous ovaries, GSCs, and SCs, respectively (Fig 4A'). Of the newly identified piRNA clusters, 81 are up-regulated, whereas 58 are downregulated, in the ovaries carrying only GSC-/CB-like cells compared with the wild-type ovaries (false discovery rate [FDR] ≤ 0.05 and fold-change ≥ 2) (Fig 4). This proportion of up-regulated clusters versus down-regulated clusters is similar to the ratio encountered for the canonical clusters. Only 429 of the 3,356 new clusters are located in the regions that do not overlap with any annotated genes. Of these clusters, 195 are significantly upregulated in GSCs relative to SCs, indicating that most of the newly identified piRNA clusters exhibit higher expression in GSCs relative to SCs (FDR ≤ 0.05 and fold-change ≥ 2). Similar to the canonical piRNA clusters, the overall number of ping-pong signatures for our newly defined piRNA clusters is much higher in GSCs than in SCs (Fig 4B). Among them, most of the new clusters with ping-pong signatures are primarily expressed in GSCs ( Fig  4B'). For example, clusters 1,069, 1,077, 3,838, and 1,017 are expressed in GSCs at higher levels than in SCs (Fig 4C-F). Therefore, this study has identified more than 3,000 new piRNA clusters and shown that new piRNA clusters exhibiting more ping-pong signatures tend to have higher expression in GSCs relative to SCs.
TE transcripts appear to be more abundantly expressed in GSCs/ CBs than late germ cells Because the major function of piRNAs is to repress TEs in the cell, we then investigated if GSCs/CBs have less TE transcripts than late stages of germ cells by sequencing mRNAs from GSC/CB tumorous ovaries, wild-type ovaries, cultured GSCs, and SCs. Surprisingly, the GSC/CB tumorous ovaries exhibit significantly higher levels of all TE transcripts compared with wild-type ovaries (Fig 5A). Among the mutant ovaries, the bgcn-mutant ovaries show less up-regulation of TE transcripts than the bam-mutant and tkv M1 -overexpressing ones (Fig 5A). We then further investigated if higher TE transcript levels in GSCs/CBs can be attributed to particular TEs by examining the expression of individual TE elements in the wild-type and mutant ovaries. The results from the heat map show that all the known germline-enriched TEs are dramatically up-regulated in the mutant ovaries compared with those in the control, and similarly sense piRNAs produced by these TEs are also up-regulated, suggesting that higher piRNA expression levels in GSCs and CBs are due to high TE expression levels ( Fig 5B and B'). In addition, the TEs known to be expressed in both germline and soma, as well as their corresponding piRNA levels, are also drastically up-regulated in the mutant ovaries ( Fig 5B and B'). Surprisingly, the previously defined soma-specific TEs are also up-regulated in the mutant ovaries, as are their corresponding piRNA expression levels ( Fig 5B and B').
These results indicate that higher piRNA expression levels in GSCs and CBs are likely a consequence of higher TE expression levels, but not related to an increase in TE repression. Then, we examined the mRNA expression levels of TEs and piRNAs in cultured GSCs and SCs. Overall, cultured GSCs also express significantly higher TE transcripts than the wild-type ovaries, whereas SCs express even higher levels of TEs than GSCs (Fig 5A). The elevated levels of TE transcript in SCs can mostly be attributed to nine elements, including ZAM, gypsy, copia, diver, 1731, mdg3, and mdg1 (Fig 5C). In contrast, copia, 3S18, ZAM, accord, and gypsy12 are the most abundantly expressed TEs in cultured GSCs (Fig 5C). Finally, piRNA expression levels of both sense and antisense piRNAs for individual TEs are not correlated with their corresponding mRNA expression in either GSCs or SCs (Fig 5C and C"). Interestingly, the known soma-specific TEs, gypsy and ZAM, are also expressed in GSCs in addition to being highly expressed in SCs, whereas the known germline-specific TEs, including accord, Doc, TART-C, and copia, are also expressed in SCs in addition to germ cells (Fig 5D-K). It is likely that Doc, jockey, and copia are expressed at comparable levels in germ cells of different developmental stages, whereas TART-C, gypsy-2, accord, ZAM, and blood are expressed at higher levels in GSCs and early progeny than germ cells of advanced differentiated stages (Fig 5D-K). Along with earlier observations, these results show that GSCs/CBs have higher TE transcripts and piRNA levels than late stage germ cells.

piRNAs derived from 39 UTRs might be involved in the regulation of SC-and GSC-specific gene expression
It has been previously reported that piRNAs can be produced from 39 UTRs of protein-coding mRNAs, such as traffic jam (tj) Saito et al, 2009). Interestingly, 454 gene transcripts can produce piRNAs from their 39 UTRs in GSCs and SCs: 261, 169, and 24 gene transcripts produce sense, antisense, and both sense and antisense piRNAs, respectively (Fig 6A-F'). For example, CG15628 and ari-1 express 39 UTR-derived antisense piRNAs in GSCs and SCs ( Fig  6A and B), whereas CG3812 express 39 UTR-derived sense piRNAs in GSCs (Fig 6C). Among the 261 gene transcripts producing sense piRNAs, 75 of them show significantly higher expression levels in GSCs than in SCs (Fig 6C and Table S2). Among the 75 gene transcripts showing differential sense piRNA expression between GSCs and SCs, 39 of them have transposon fragments in their 39 UTRs, and the other 36 do not (Fig 6G). Interestingly, for the gene transcripts producing antisense piRNAs, 109 of them exhibit differential piRNA expression between GSCs and SCs: 108 of them show higher piRNA expression levels in GSCs, whereas only a single gene exhibits higher piRNA expression levels in SCs (Fig 6A and B and Table S3). Among the 109 gene transcripts showing differential antisense piRNA expression between GSCs and SCs, 63 of them have transposon fragments in their 39 UTRs, and the remaining 46 do not (Fig 6G). For the gene transcripts producing both sense and antisense piRNAs, 12 of them show significantly higher piRNA expression levels in GSCs than in SCs ( Fig  6D-F'). These results show that piRNA production from 39 UTRs is regulated in a cell type-dependent manner.
Next, we examined if there was any correlation between piRNA and mRNA levels in GSCs and SCs. For the gene transcripts producing only antisense piRNAs, 32 of them show differential mRNA expression between GSCs and SCs ( Fig 6H and Table S4). Interestingly, 9 and 23 genes show the expression changes in the same direction or the opposite direction for piRNAs and mRNAs, respectively (Tables S3 and S4). For the gene transcripts producing only sense piRNAs, 18 of them exhibit differential expression between GSCs and SCs: 10 and 8 show the expression changes in the same direction or the opposite direction for piRNAs and mRNAs, respectively (Fig 6H and Tables S2 and S4). These results suggest that the differential piRNA production via 39 UTRs might regulate germline-or soma-specific gene expression via complex mechanisms.
Alternative promoter usage and mRNA splicing are two mechanisms for modulating the production of different protein isoforms in GSCs and SCs Although microarray-based gene profiling for purified GSCs has been performed (Kai et al, 2005), RNA-seq has not been previously performed on purified GSCs. Our RNA sequencing results have further reaffirmed the results of previous gene expression studies (based on gene ontology [GO] term enrichment analysis for the genes expressed in GSCs). Here, we compared gene expression differences between the cultured GSCs and SCs. Interestingly, SCenriched genes are often involved in the regulation of cell-cell adhesion, extracellular matrix and cell-cell communications, further underscoring the importance of somatic cells supporting the development of GSCs and progeny (Fig S4A). In contrast, GSCenriched genes include the genes important for signal reception, RNA regulation, and chromosomal organization, further supporting the notion that the development of GSCs and their progeny is dependent on both signals from SCs as well as transcriptional and posttranscriptional regulation (Fig S4B). The enrichment of the P granules, pole plasm, and ribonucleoprotein GO terms supports this conclusion as they are related to the regulation of RNA stability and translation.
Among the genes expressed in both GSCs and SCs, 102 genes use alternative promoters to produce the different transcripts with distinct 59 ends in their mRNAs in GSCs and SCs, which can result in different 59 UTRs and/or protein N termini (Table S5). For example, tkv and nuf use different promoters in GSCs and SCs to generate mRNAs with distinct 59 ends, producing proteins with longer N termini in GSCs than in SCs (Fig 7A and B). Based on pMad and Dad-lacZ reporter, BMP signaling is only active in GSCs and newly formed CBs, but not in somatic niche cells, which also express all BMP receptors and downstream components. Excitingly, our results here show that GSCs and SCs express different tkv isoforms: the Tkv expressed by GSCs contains a signal peptide, whereas the Tkv expressed in SCs lacks most of it. The signal peptide is known to be important for proper transmembrane receptor presentation on the cytoplasmic membrane. The different Tkv isoforms in GSCs and SCs could offer an explanation as to why only GSCs, but not SCs, are responsive to the niche-derived BMP signal. Our results suggest that alternative promoter usage in GSCs and SCs could function as a mechanism for regulating gene expression.
Alternative splicing can also result in the generation of multiple isoforms from a single gene locus with different coding sequences or mRNAs with different 39 UTRs (allowing for the control of gene expression). Our results show that approximately 634 genes exhibit different splicing patterns in GSCs and SCs (Table S5). Interestingly, GSCs sometimes produce mRNAs encoding longer protein isoforms than SCs. For example, hts and pop2 generate proteins with longer C termini in GSCs than in SCs via alternative splicing (Fig 7C and D). The SC-expressed 718-aa-long Hts is known to be localized on the follicle cell membrane, whereas the GSC-expressed Hts with a long C terminus has been shown to be cleaved into different parts and is localized to fusomes and ring canals in germ cells (Whittaker et al, 1999) (Fig 7C). Pop2 has recently been shown to be important for GSC maintenance and lineage differentiation by functioning as a component in the deadenylase CCR4-NOT complex (Fu et al, 2015;Newton et al, 2015). The GSC-specific Pop2 isoform has a longer C terminus than SC-expressing one (Fig 7D). For RpL11, the SC-expressing mRNA has a longer 39 UTR than the GSC-expressing one, but they include the same coding sequence (Fig 7E). eIF4G2 and CG33111 share the first two exons but encode completely different proteins. Because of alternative splicing, only GSCs express the eIF4G2 protein, whereas SCs express the CG33111 protein (Fig 7F). eIF4G2 has been shown to be important for spermatogenesis via translational regulation (Baker & Fuller, 2007;Ghosh & Lasko, 2015). Therefore, alternative splicing functions as a mechanism to modulate both gene function and regulation in the Drosophila ovary, especially in regards to GSCs and niche cells.

Discussion
piRNAs have recently been subjected to extensive studies in the Drosophila ovary for their roles in repressing TEs in germ cells and somatic cells (Khurana & Theurkauf, 2010;Juliano et al, 2011;Banisch et al, 2012;Handler et al, 2013;Weick & Miska, 2014;Huang et al, 2017;Yamashiro & Siomi, 2018). Although piRNA pathway components have been shown to be critical for GSC maintenance (Ma et al, 2014(Ma et al, , 2017, piRNA expression in GSCs has never been compared with other stages of germ or niche cells. In this study, our results have shown that GSCs and their early progeny express higher levels of piRNAs while simultaneously exhibiting higher levels of TE transcripts than germ cells of later stages. In addition, we developed a new algorithm to identify piRNA clusters. Furthermore, we revealed that GSCs and SCs express different levels of 39 UTR-derived piRNAs. Finally, we have also shown that GSCs and SCs use alternative promoters and splicing modulation to control differential gene expression (DGE). Therefore, our results suggest that GSCs and their early progeny regulate piRNA production and/or function differently from late differentiated germ cells as well as somatic niche cells. The datasets generated by this study are a valuable resource for further investigation of piRNA function, TE repression, and DGE in GSCs, early progeny, and somatic niche cells.

GSCs and their progeny express both higher levels of piRNAs and TE transcripts than late germ cells in the Drosophila ovary
In this study, we sequenced small RNAs from wild-type ovaries, GSC-/CB-rich ovaries and cultured GSCs and niche cells. As expected, most of the abundant piRNAs come from the previously defined canonical clusters. Interestingly, most canonical clusters are up-regulated in GSCs and early progeny in comparison with late germ cells in the Drosophila ovary, including flamenco. Although flamenco has been previously defined as a somatic cell-specific cluster, our results also show that it is also expressed at moderate levels in GSCs and perhaps even in early progeny. Our new algorithm has successfully identified the canonical piRNA clusters and 3,356 new piRNA clusters. Similarly, most of the newly identified piRNA clusters exhibit significantly up-regulated expression in GSCs and early progeny compared with late germ cells. In addition, although most of them are expressed in cultured niche cells and GSCs, some of them are only restricted to germ cells or somatic cells. Based on the important role of piRNAs in repressing TEs, we might expect that GSCs and early progenitor cells would have lower levels of TE transcripts than late germ cells. Surprisingly, GSCs and early progeny have higher levels of TE transcripts than late germ cells. One plausible explanation is that GSCs and CBs have a more open chromatin than late germ cells, which results in elevated TE expression and consequently increased piRNA production. Overall, our results establish an interesting baseline for future studies into piRNA dynamics in GSCs and reveal a distinct pattern of piRNA and corresponding TE levels that has not been previously described.
This study has also identified piRNAs derived from 39 UTRs in both GSCs and SCs. Like transposon-derived piRNAs, 39 UTR-derived piRNAs can come from sense, antisense, or both strands. Unlike the previous model that 39 UTR-derived piRNA production is negatively correlated with mRNA expression, our results do not show a clear trend for the relationship between 39 UTR-derived piRNAs and mRNA levels. In addition, many 39 UTR-derived piRNAs are differentially expressed in cultured GSCs and SCs. For the 39 UTR generating both sense and antisense piRNAs, sense and antisense piRNAs do not always correlate in GSCs and SCs. Our results suggest that piRNA production from different 39 UTRs are regulated differently in GSCs and SCs.

Alternative promoter usage and splicing likely play an important role in regulating the functions of GSC/progeny and SCs
The germline segregates from the soma during early embryogenesis, and each expresses a unique set of genes that is specifically important for their corresponding lineage development. GSCs and niche cells develop from embryonic germ cells and somatic cells, respectively (Zhu & Xie, 2003;Asaoka & Lin, 2004). As expected, our RNA-seq results have identified GSC-specific and SC-specific genes that fulfill the unique functions of stem cells and niche cells. For example, the piRNA pathway genes implicated in the ping-pong piRNA amplification cycle are only expressed in GSCs, but not in SCs. In addition, we have identified the genes that are expressed in both GSCs and SCs but with different mRNA isoforms. These different isoforms can be generated from the same genes using different promoters in GSCs and SCs. Our RNA-seq results have identified many genes using different promoters in GSCs and niche cells, such as tkv and nuf. Some of these mRNAs still have the same coding region but with different 59 UTRs, which might be important for controlling protein expression levels, whereas other mRNAs encode proteins with different N-terminal sequences, which may provide an additional level of regulation. Furthermore, we have also identified hundreds of genes that exhibit different splicing patterns in GSCs and SCs, yielding mRNAs with different 59 UTRs, coding regions, and/or 39 UTRs. The mRNAs with different 59 UTRs or 39 UTRs could exhibit changes in mRNA stability, translation, or both. The mRNAs with different coding regions could also give rise to proteins with different functions. Therefore, our RNA-seq results in GSCs and SCs provide a rich resource for other investigators to study the roles of alternative promoter usage and splicing in the regulation of stem cell development.
In summary, this study has used small RNA sequencing to survey the expression of small RNAs, including siRNAs, miRNAs, and piRNAs, in GSCs, early progeny, and SCs. In addition, we have also used mRNA sequencing to carefully examine the expression of TEs and mRNAs in GSCs, early progenitor cells, and SCs. This study has revealed that GSCs and early progeny exhibit more abundant piRNAs and TEs than late germ cells.
The vasa-GFP; hs-bam bam Δ86 /bam Δ86 GSC line was established according to the published procedures (Niki et al, 2006). The GSCs cultured in 150-mm culture dishes were dissociated in StemPro Accutase Cell Dissociation Reagent (#A11105-01; Thermo Fisher Scientific) for 5 min and collected by centrifugation at 700g, 4°C for 5 min. The cell pellet was resuspended in PBS, passed through a 22 1 /2 G needle five times, and filtered with 70-μm Filcon (340605; BD). GFPpositive GSCs and GFP-negative SCs were purified by sorting dissociated single cells at 20 ψ with 100 μm tip (InFlux; BD) immediately into TRIzol LS Reagent (10296028; Thermo Fisher Scientific) for total RNA isolation. Before RNA isolation, spike-in RNA standards from ExiSEQ NGS spike-in kit (800100; ExiSEQ) were added to the samples.

Small RNA sequencing and mapping
Total RNAs from Drosophila ovaries and purified GSCs and SCs were isolated using Trizol, and further purified by organic extraction followed by isopropanol precipitation according to the manufacturer's manual. Small RNA-seq libraries (targeting small RNAs of~20 nt) were generated from 1 μg high-quality total RNA, as assessed using the Agilent 2100 Bioanalyzer. Libraries were made according to the manufacturer's directions for the TruSeq Small RNA Sample Preparation Kit (RS-200-0012; Illumina). Resulting polyacrylamide gel size-selected libraries were checked for quality and quantity using the Bioanalyzer and Qubit Fluorometer (Life Technologies). Equal molar libraries were pooled, requantified, and sequenced as 50-bp single read on the Illumina HiSeq 2500 instrument using HiSeq Control Software 2.0.10.0, 2.0.12.0, or 2.2.58. After trimming adapters from the reads, remaining reads longer than 15 nt were mapped to the canonical Drosophila melanogaster TE sequences downloaded from FlyBase (http://flybase.org). The alignment was performed using Bowtie2 v2.2.0 with default parameters (Langmead & Salzberg, 2012). Unmapped reads were subsequently mapped to the dm6 genome using Bowtie v1.0.0 with the options "-v0 -k1 -m1" (Langmead et al, 2009). Multi-mapped reads were retained exclusively for the cluster discovery step. However, for downstream analysis, multi-mapped genomic reads were discarded.
The counts of aligned reads corresponding to putative piRNAs (reads with lengths between 23 and 30 nt) overlapping unique genomic features (e.g., clusters, UTRs, etc.) were computed in R using the Bioconductor package GenomicRanges (Lawrence et al, 2013). Reads were associated with different biotypes depending on where they mapped in the genome. In the analysis, miRNA reads were identified by overlapping sequencing reads that were 24 nt or shorter to the annotated pre-miRNA locations, whereas siRNA reads were identified by overlapping 22-nt or shorter reads to the annotated endo-siRNA locations, or in the case of transposons any reads that were exactly 21 nt in length (Wen et al, 2014). The remaining 23-30-nt reads that did not align to either ribosomal RNA, small nucleolar RNA, small nuclear RNA, or tRNA features were considered piRNAs.
For the small RNA bar plots of piRNAs, 39 UTRs, and piRNA clusters, the CPM-mapped read values were normalized to the total number of mapped miRNAs, similar to previous publications (Qi et al, 2011;Hayashi et al, 2016), or for the GSC and SC samples, to the effective library size, unless otherwise stated. For the coverage plots of 39 UTRs and piRNA clusters, a pseudo-count of one was added to the counts in the coverage plots before log-transformation and loess smoothing. mRNA mapping mRNA-seq libraries were generated from 500 ng of high-quality total RNA, as assessed using the Agilent 2100 Bioanalyzer. Libraries were made according to the manufacturer's directions for the TruSeq Stranded mRNA LT Kit (RS-122-2101; Illumina). Resulting short fragment libraries were checked for quality and quantity using the Bioanalyzer or LabChip GX (Perkin Elmer) and Qubit Fluorometer (Life Technologies). Equal molar libraries were pooled, requantified, and sequenced as 50-bp single read on the Illumina HiSeq 2500 instrument using HiSeq Control Software 2.0.10.0, 2.0.12.0, or 2.2.58. After sequencing, Illumina Primary Analysis version RTA 1.18.64 or 1.17.21.3 and Secondary Analysis version CASAVA-1.8.2 or bcl2fastq2v2.17 were run to demultiplex reads for all libraries and generate FASTQ files. Reads were also first aligned to the FlyBase canonical TE sequences. Unmapped reads were subsequently mapped to the dm6 genome (Ensembl release 87) using TopHat v2.0.9 (with options "-v 2 -a -best -strata"). Gene-level reads were generated by tabulating the number of reads that uniquely overlapped with the collapsed set of exons for each gene. For the mRNA bar plots of transposons and gene expression, the Reads Per Kilobase of transcript, per Million mapped reads (RPKM) or CPM values were computed using the total number of reads mapped to transposons and exonic regions in the genome.

DGE analysis
Unless otherwise stated, all differentially expressed features were defined using the Bioconductor package edgeR using the generalized linear model approach. For mRNA analysis, features exhibiting a CPM greater than 10 in fewer than three samples were excluded from the analysis. The CPM cutoff for excluding features in the small RNA analysis was 20, 700, and, 1,000 for piRNA clusters, antisense 39 UTRs, and sense 39 UTRs, respectively. For the mutant whole ovary samples, the normalization factors of small RNA-seq reads were calculated based on miRNA counts per library to account for the high potential of global shifts in piRNA abundance between samples. The two most differentially expressed miRNA were excluded from these counts. For the GSC and SC samples, using edgeR, the normalization factors of small RNA-seq reads were calculated based on small RNA spike-ins, present in each library. In these cases, the total library size further used in generating the bar plots was the effective library size (i.e., the total number of mapped reads multiplied by normalization factors). Significantly differentially expressed features were defined as those exhibiting an absolute fold-change of at least two and a FDR of less than or equal to 0.05. The FDR was calculated using the Benjamini-Hochberg procedure (Benjamini, 1995).

Cluster analysis
Genomic locations of canonical piRNA clusters were downloaded from previous publications (Brennecke et al, 2007;. Genomic coordinates of each were converted to the current genome (dm6) assembly using the UCSC liftOver tool. Cluster fragments were collapsed and those separated by less than 25 kb were merged and only the largest remaining fragment was kept. Only clusters found on the standard chromosomes were used for downstream analysis. Furthermore, in the DGE analysis, only clusters that did not overlap with any known genomic features were used.
To discover novel clusters, we developed a sliding window algorithm that scanned the genome for regions that contained putative piRNAs, reads ranging in size from 23 to 35 nt. Identified regions within 250 nt of one another were merged. The resulting regions were filtered, keeping only those with at least three unique piRNA reads per kilobase. Of the remaining clusters, neighboring regions within 1,000 nt of one another were merged. Finally, only clusters greater than 2,000 nt in size were retained for analysis.

Sequence logos
piRNA sequence logos were generated using the R package ggseqlogo. For each experiment, the starting position of reads identified as piRNA were collapsed and only unique start sites were retained. These sites were extended by 10 nt and their underlying DNA sequences were used to generate each logo.

GO enrichment
GO Term analysis was performed using the Bioconductor GOstats package (Falcon & Gentleman, 2007). A maximum of 20 GO terms were retained for each comparison.

Alternative splicing
Changes in exon usage were investigated using the Bioconductor DEXSeq package (Anders et al, 2012). Significant events were identified as those exhibiting an adjusted P-value of less than or equal to 0.01 and an absolute fold-change of greater than or equal to log2(3) when comparing GSC and SC samples. Sashimi plots were generated using rmats2sashimiplot.py (https://github.com/Xinglab/ rmats2sashimiplot).

Data Availability
All the Drosophila stocks are available upon request, and the GEO accession number for all the sequencing raw data is GSE119862.