Abstract
Essential genes have been studied by copy number variants and deletions, both associated with introns. The premise of our work is that introns of essential genes have distinct characteristic properties. We provide support for this by training a deep learning model and demonstrating that introns alone can be used to classify essentiality. The model, limited to first introns, performs at an increased level, implicating first introns in essentiality. We identify unique properties of introns of essential genes, finding that their structure protects against deletion and intron-loss events, especially centered on the first intron. We show that GC density is increased in the first introns of essential genes, allowing for increased enhancer activity, protection against deletions, and improved splice site recognition. We find that first introns of essential genes are of remarkably smaller size than their nonessential counterparts, and to protect against common 3′ end deletion events, essential genes carry an increased number of (smaller) introns. To demonstrate the importance of the seven features we identified, we train a feature-based model using only these features and achieve high performance.
Introduction
Essential genes, those where a single gene–knockout results in lethality or severe loss of fitness, have been well studied in many bacterial genomes to develop therapeutic targets for pathogens. Now, stemming from the discovery that the loss of an essential-nearby gene can introduce latent-vulnerabilities specific to cancer cells, the study of human-essential genes has come of practical importance (1). This importance is magnified as essential genes for cancer cell growth are found to be located close to target deletion genes (1). Therefore, identifying properties of essential genes can further therapeutic developments.
Older genes, with earlier phyletic origin, are more likely to be essential, as well as genes that are hubs in major protein–protein interaction (PPI) networks (2, 3, 4). Essential genes are highly connected with many protein systems, and thus, consistent transcription timing, maintenance of transcript length, and conservation of gene regulation is of high importance (5). Identification of human-essential genes has been approached through the use of single-gene-knockouts, high-throughput mutagenesis, RNAi, and in most recent work, CRISPR–Cas9 editing (6).
However, moving toward an in vivo analysis of gene essentiality, to lend more practical therapeutic insights, studies have focused on the close link between duplication and gene essentiality (7, 8).
Duplication is a biological mechanism used throughout evolution to generate new genetic material (7). A positive association between singleton, highly expressed, developmental genes and essentiality is observed, suggesting that essential genes resist duplication events (7, 9). Stemming from these results, copy number variants, which result from unequal crossing-over, retroposition, or chromosomal duplication, were included in efforts to identify essential human genes (1, 10). Intron loss, occurring at an especially greater rate after gene duplication, is the most frequent copy-number variant in humans, suggesting a likely link between introns and gene essentiality (11, 12).
Introns, which make-up more than half of the noncoding genome, have important regulatory and evolutionary functions. Intron losses and deletions can modulate gene expression patterns and even alter gene function (11). Typically occurring at the 3′ end of a gene, losses and deletions arise from mediated recombination of a gene with the reverse-transcribed RNA during duplication events or through irregular splice sites (10, 13). Furthermore, intron deletions are most common to longer introns (12). Intron 1, typically the longest intron, has frequent intron deletions (30.4% of all known deletions) which are especially serious as the first intron preferentially contains regulatory regions and exhibits the highest density of chromatin marks allowing for gene expression (13, 14, 15). GC patterns in intronic sequences are associated with an increase in enhancer activity, correct splice site recognition, and protection from intronic deletions (12, 16, 17).
It has been suggested that in highly expressed genes, selection has resulted in smaller introns that reduce transcriptional cost, which agrees with reports of shorter introns in essential genes (12, 18). Adding to the seeming importance of introns in essential genes, intron deletions in three-essential-yeast genes drastically decreased RNA levels and caused major growth defects (19).
Owing to the capability of intron losses and deletions to alter gene duplication, expression, and transcription timing, we hypothesize that essential genes, which demand consistency, have developed systems to minimize these events. We thus aim our study to (i) identify whether essential gene introns differ from those of nonessential genes and (ii) characterize the unique properties of essential gene introns to allow for later therapeutic developments.
Results
We extracted 2,135 introns from 165 human-essential genes, 74,147 introns from 6,716 human conditional genes, and 115,089 introns from 12,449 human nonessential genes from the Ensembl database (20, 21). Human gene essentiality data were gathered from the database of Online Gene Essentiality (OGEE) that gathers experimental data from 18 large-scale experiments to classify genes by essentiality (3, 6). Conditional genes are genes where experiments have disagreed on essentiality.
We trained a convolutional neural network, based on DeepBind, to predict gene essentiality based on recurring base-pair motifs of 1,000-bp-long intronic sequence input (22) (Fig 1A). We set aside 20% of the data for the test set and use a three-way random split on the training data to perform threefold cross-validation for hyperparameter selection. We selected our model’s hyperparameters by performing a grid search of our model’s dropout rate, convolutional layer window size, activation function, and L2 regularization strength. We assessed 36 potential models based on average validation performance across the threefolds. The best hyperparameters are then used to train the final model on the entire training set. At training time, we balance our training and validation sets by equally sampling from the essential and nonessential classes. We also ensure that all the introns of a gene lie in the same set so that no gene specific information affects the validity of our accuracy on the test set. We trained two separate models using the first and last 1,000 bp of introns and combined these models by a double classifier which averages essentiality scores from all introns of a gene given by both models. The double classifier optimizes the area under the curve (AUC) of the receiver operating characteristic (ROC) curve used to quantify the diagnostic ability of the model. For the purposes of the neural network, we sought to predict either essentiality or nonessentiality, and thus classified conditional genes from the database as essential if more than 50% of experiments agreed on essentiality. If introns of essential genes and nonessential genes have no markedly characteristic properties, we would expect an AUC of 0.5. Rather, our double classifier achieved an AUC of 0.846 (Fig 1B).
Our results demonstrate that introns of essential and nonessential genes have unique properties. To identify these unique properties, we used a computational approach. We also found that the model performs better at classifying introns of strictly essential or nonessential genes, suggesting that conditional genes do not fit well in either essential or nonessential motifs. Therefore, we now include all OGEE classified “conditional genes” as separate entities in our computation to characterize properties of introns by essentiality.
Although introns of essential genes differ from introns of nonessential genes by size and number (Fig 2A–C), they also differ by base specific traits. GC motif density is significantly greater in the first introns of essential genes (Fig 3A and B). To account for CpG island presence as a potential cofounder of GC density, we show CpG island presence distributions in introns by first or later, and by essentiality. The results of this support that CpG island presence is not a cofounder of GC density results, especially as essential first introns are found to less frequently contain a CpG island than nonessential first introns, and conditional first introns contain the most, which is not parallel to the distribution of GC density among the six classes. Furthermore, we show that GC content, subtracted by GC motif density, is significantly greater in the first introns of essential genes than first introns of nonessential genes as well as later introns of both essential and nonessential genes. However, essential later introns have a significantly lesser GC content than nonessential later introns. GC content has a remarkably similar distribution to GC density (Fig 3A and B).
Eukaryotic intron 5′ and 3′ splice sites for pre-RNA processing, 5′-GU-AG-3′ boundaries, are highly conserved, whereas some minor classes of introns have different boundaries (23). We report that essential gene first introns are less likely to have an unusual 5′ or 3′ splice site when compared with first introns of conditional and nonessential genes. The same trend is true of the 5′ splice site in later introns, to a lesser degree (Fig 3C and D).
Having identified the above features that differentiate introns of essential genes from introns of nonessential genes, we trained a feature-driven deep learning model to predict gene essentiality so as to determine the importance of the identified features. By only training the model with information on seven features we identified (average intron size, number of introns in gene, intronic bp in gene, GC density [first intron], GC density [later introns], GC count [not including GC motifs] [first intron], and GC count [not including GC motifs] [later introns]), we can determine the importance of these identified features in essentiality. Each feature vector corresponds to one gene, where the later intron features correspond to the mean of the value over all later introns. We trained an ensemble deep learning model to combine results from multiple (10) neural network models so as to reduce variances and generalization errors. The final AUC obtained was 0.787. This provides serious evidence indicating high importance of the seven identified features in characterizing essential and nonessential genes.
Discussion
Although essentiality is not wholly an intrinsic property of a gene, the ability of our model to predict essentiality or nonessentiality from just intronic sequences, and our later model from just seven features, suggests that there exist characteristic motifs unique to introns of essential genes. The first model’s accuracy for selecting essential introns increases when only testing the first intron as demonstrated by the greater AUC. This suggests that the first introns of essential genes have especially unique characteristics when compared with the first introns of nonessential genes. We followed up on these results with computational analysis of intronic sequences of essential, conditional, and nonessential genes with regard to all introns, only first introns, and only later introns. The primary findings can be summarized in that (i) first introns of essential genes are much shorter than first introns of unessential genes, (ii) essential genes have more introns per gene but these later introns are markedly shorter than the later introns found in nonessential genes, (iii) essential first introns have a greater GC density than first introns of nonessential genes as well as later essential introns, (iv) essential first introns, with essential later introns slightly less so, infrequently have unusual 5′ or 3′ splice sites compared to the first introns of nonessential genes.
Using these features to train feature-based deep learning model to predict gene essentiality, we provide evidence that suggests seven features as major contributing differences between introns of essential and nonessential genes.
From these results, essential genes appear to exhibit intronic characteristics that protect their first introns from loss and deletions. The first intron is crucial for regulation of gene expression; for essential genes which are central to PPI hubs, any deletion in the first intron has the potential to disrupt an entire network (3, 14). Because deletions occur in longer introns at much higher frequency, we found essential first introns are on average over three times less the size of nonessential first introns (12). First introns of essential genes were found to have a greater GC motif density which allows for an increase in enhancer activity, correct splice site recognition, and protection from intron deletions (12, 16, 17). The distribution of GC count closely resembled GC motif density, potentially suggesting that essential first introns have undergone a marked increase in their GC content so as to increase GC motif frequencies for the purposes discussed above. Similarly, as unusual splice sites can allow for alternative splicing, introns of essential genes, especially the first introns, have the lowest frequency of unusual 5′ and 3′ splicing sites (24). Furthermore, as most of the deletions occur at the 3′ end, essential genes have an increased number of introns. These later introns, however, are smaller than the average nonessential intron, avoiding long introns in essential genes so as to limit any intron loss or deletions. Because deletions in introns of essential genes would alter transcript length and thus interrupt the timing of a complex molecular network, the unique properties of essential introns appear to have been selected to avoid intron losses and deletions.
Whereas we select essential genes based on intronic patterns, other prediction methods of essential genes have been based on support vector machines (SVMs), information theoretic statistics, and PPI network leverage (25, 26, 27). Information theoretic approaches, trained and tested on the same organism, have shown AUC scores of 0.73–0.90 with an average of 0.84, which is slightly lower than our double classifier AUC of 0.84626. However, these information-theoretic approaches were not applied to human genes and have lowered accuracy when applied inter-organism. SVM approaches to human-essential genes based on 800 selective features report results of mean AUC of 0.8347 and highest AUC of 0.8854, thus with similar accuracy to our intron-based model and of only slightly increased accuracy to our seven features–based deep learning model (28). PPI network leverage combined with feature information models have shown successes with DeepHe, using PPI network information and 89 sequence derived features, having reported AUC of 0.94 which outperforms SVM, Naive Bayes, Random Forest, and Adaboost models (29). However, our intron-sequence model having AUC of 0.846 and our seven features–based model having AUC of 0.787 is a surprising result because of the previously unknown role of introns in gene essentiality. With further characterization of introns in essential genes, future feature-based models may rival current PPI approaches. Future studies can apply such algorithms to identify unrecognized essential genes and verify such findings in the laboratory. It would be of particular interest to determine shared characteristics of false positives of nonessential genes that are classified as essential by introns. In this way, such an analysis would further uncover what our model that has trained on, further uncovering the role of introns in gene essentiality.
Conditional genes are correlated between essential and nonessential genes, suggesting a middle ground for both gene stability and alterations of gene functionality. This middle ground is necessary for successful evolution of the genome. We hypothesize that this reflects the desire of the genome to both innovate its genes as well as to conserve its most essential genes. Whereas selecting for deletion-adverse essential intron systems promote basic network stability, selecting for long, first introns of nonessential genes allows deletions to alter regulation of nonessential genes and even innovate gene function.
The results presented here introduce the concept that essential genes have characteristically unique introns from nonessential genes, and we identify seven features to characterize this difference. These differences, as outlined above, may eventually be exploited to target tumors by disrupting nearby essential genes (1). Interrupting the complex safety net around the first intron can alter regulation and thereby disrupt a network necessary for tumor growth. Similarly, using targeted CRISPR–Cas9 therapies to force deletions of introns within carefully selected essential genes could likewise stunt cancers.
Materials and Methods
Model
Our deep learning model is a convolutional neural network based on DeepBind, a predictive model that has shown state-of-the-art performance in predicting sequence specificities of DNA- and RNA-binding proteins (22). Our model predicts the essentiality of the gene of an intronic sequence “s” by calculating an essentiality score f(s) = net(pool(rect(conv(s)))). Fig 1A depicts our model architecture. Our model accepts 1,000-bp sequences encoded as one-hot vectors. The convolutional layer (conv) contains multiple filters that detect motifs within the intronic sequence. We apply the ReLU activation function (rect), then the pooling layer (pool) averages each filter’s response across the sequence to determine the cumulative presence of motifs. The resulting values are fed into a small neural network (net) consisting of a fully connected layer followed by a two-value softmax output layer corresponding to the probabilities of the parent gene being essential or nonessential. The fully connected layer also uses the ReLU activation function, and the softmax function is applied to the output to normalize prediction probabilities. We prevent our model from overfitting by using L1 and L2 regularization as well as dropout (30).
Data
Human DNA sequences and annotations were collected from the Ensembl genome database project (20, 21, 31). We used the transcript for each gene corresponding to its longest coding sequence as this has been suggested, in recent work, as the most accurate and most biologically relevant (32 Preprint). However, both the deep learning results as well as the bio-computationally identified feature results were extraordinarily similar when using introns from the longest transcript of each gene. We used the provided annotations to separate out intronic sequences. Before training, the intronic sequences are transformed using one hot encoding such that each sequence is represented as an L×4 matrix for a sequence of length L. For our analysis of CpG island presence, we used an algorithm derived by Takai and Jones to detect CpG islands from a nucleotide sequence (33, 34).
We assign labels using gene essentiality information from OGEE, which experimentally classifies genes by essentially (3, 6). OGEE gathers data from 18 databases of large-scale experiments to provide a reference of how many studies found a gene essential or nonessential (3, 6). For the model, because of the ambiguity of conditional genes, we discard all conditional genes that have been found to be essential in less than half of studies. Genes are assigned binary labels of essential or nonessential, where the remaining conditional genes are grouped with essential genes.
We trained two models, one on the first 1,000 bp of introns, and one on the last 1,000 bp. This includes the 5′ splice site in the first 1,000 bp, as well as the 3′ splice site and the branch site in the last 1,000 bp. These are the three best characterized regions of eukaryotic introns and are the sites that are most directly involved in spliceosomal modification of the transcript to form mRNA.
Training procedure
We set aside 20% of the data for the test set and use a three-way random split on the training data to perform threefold cross-validation for hyperparameter selection. We selected our model’s hyperparameters by performing a grid search of our model’s dropout rate, convolutional layer window size, activation function, and L2 regularization strength. We assessed 36 potential models based on average validation performance across the threefolds. The best hyperparameters are then used to train the final model on the entire training set. At training time, we balance our training and validation sets by equally sampling from the essential and nonessential classes. We also ensure that all the introns of a gene lie in the same set so that no gene-specific information affects the validity of our accuracy on the test set.
We trained all models using Adam gradient descent and a cross entropy loss minimization objective (35 Preprint). The model is trained for 30 epochs with a batch size of 64. We implemented our model using the Keras library running on Tensorflow and trained on an NVIDIA Tesla M60 GPU.
Prediction and evaluation
We evaluate our model on our test set using the area under the curve (AUC) of the ROC curve, which measures how well our model distinguishes between essential and nonessential classes. The model produces an essentiality score corresponding to the predicted confidence in the essentiality of the gene of an intron, and the ROC curve is generated by measuring the sensitivity and specificity of the model at varying prediction thresholds of the essentiality score. We also took advantage of both of our models to better classify an intron by averaging the scores produced by our two models on the first and last 1,000 bp of the intron.
Our model can be extended to classify entire genes with even better accuracy. Rather than classifying the essentiality of individual introns, we classify whether an entire gene is essential or nonessential by combining information from all of its introns. To classify in this manner, we introduce a majority classification method. We accept the list of all intronic sequences of a specific gene and run each individual intron through the model to get the essentiality score of each intron. Then we calculate a gene’s essentiality score as the mean of the essentiality scores of its introns.
We attained our highest AUC using a double majority classifier which uses both the first 1,000 and last 1,000 bp of each intron to classify a gene. We run the first and last 1,000 bp from each intron through the models trained on the first and last 1,000 bp of each intron, respectively. Then we similarly calculate a gene’s essentiality score as the mean of the essentiality scores of its introns from both models.
By combining information from multiple parts of multiple introns, the double majority classifier achieves the highest accuracy.
Feature model
Our final feature-based model used seven features extracted from each gene. The normalized feature vectors are used to train a neural network consisting of several fully connected layers with a binary output. The architecture and hyperparameters of this network were selected by a grid search over the number of hidden layers, number of nodes in each hidden layer, the dropout rate, and L2 normalization strength. Models were evaluated via threefold cross validation just as with the convolutional model, with the best hyperparameters used to train the final model on the entire test set. We trained all feature models using Adam gradient descent and a cross entropy loss minimization objective (35 Preprint). The model is trained for 30 epochs with a batch size of 64. For final evaluation, we trained an ensemble deep learning model to combine results from multiple (10) neural network models so as to reduce variances and generalization errors.
Data Availability
All data were obtained from publicly available data sources as described in the methodology. All the code used for data processing, figure generation, and model training, as well as the weights of our final models, are provided at https://github.com/evendrow/Intron-Essentiality/.
Acknowledgments
Author Contributions
E Schonfeld: conceptualization, data curation, formal analysis, supervision, investigation, visualization, methodology, and writing—original draft, review, and editing.
E Vendrow: data curation, software, investigation, visualization, methodology, and writing—review and editing.
J Vendrow: software, investigation, methodology, and writing—review and editing.
E Schonfeld: conceptualization, formal analysis, investigation, visualization, and writing—review and editing.
Conflict of Interest Statement
The authors declare that they have no conflict of interest.
- Received November 1, 2020.
- Revision received April 12, 2021.
- Accepted April 15, 2021.
- © 2021 Schonfeld et al.
This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).