On the relation of gene essentiality to intron structure: a computational and deep learning approach

Ethan Schonfeld; Edward Vendrow; Joshua Vendrow; Elan Schonfeld

doi:10.26508/lsa.202000951

Research Article

Published 27 April 2021. DOI: 10.26508/lsa.202000951

Abstract

Essential genes have been studied by copy number variants and deletions, both associated with introns. The premise of our work is that introns of essential genes have distinct characteristic properties. We provide support for this by training a deep learning model and demonstrating that introns alone can be used to classify essentiality. The model, limited to first introns, performs at an increased level, implicating first introns in essentiality. We identify unique properties of introns of essential genes, finding that their structure protects against deletion and intron-loss events, especially centered on the first intron. We show that GC density is increased in the first introns of essential genes, allowing for increased enhancer activity, protection against deletions, and improved splice site recognition. We find that first introns of essential genes are of remarkably smaller size than their nonessential counterparts, and to protect against common 3′ end deletion events, essential genes carry an increased number of (smaller) introns. To demonstrate the importance of the seven features we identified, we train a feature-based model using only these features and achieve high performance.

Introduction

Essential genes, those where a single gene–knockout results in lethality or severe loss of fitness, have been well studied in many bacterial genomes to develop therapeutic targets for pathogens. Now, stemming from the discovery that the loss of an essential-nearby gene can introduce latent-vulnerabilities specific to cancer cells, the study of human-essential genes has come of practical importance (1). This importance is magnified as essential genes for cancer cell growth are found to be located close to target deletion genes (1). Therefore, identifying properties of essential genes can further therapeutic developments.

Older genes, with earlier phyletic origin, are more likely to be essential, as well as genes that are hubs in major protein–protein interaction (PPI) networks (2, 3, 4). Essential genes are highly connected with many protein systems, and thus, consistent transcription timing, maintenance of transcript length, and conservation of gene regulation is of high importance (5). Identification of human-essential genes has been approached through the use of single-gene-knockouts, high-throughput mutagenesis, RNAi, and in most recent work, CRISPR–Cas9 editing (6).

However, moving toward an in vivo analysis of gene essentiality, to lend more practical therapeutic insights, studies have focused on the close link between duplication and gene essentiality (7, 8).

Duplication is a biological mechanism used throughout evolution to generate new genetic material (7). A positive association between singleton, highly expressed, developmental genes and essentiality is observed, suggesting that essential genes resist duplication events (7, 9). Stemming from these results, copy number variants, which result from unequal crossing-over, retroposition, or chromosomal duplication, were included in efforts to identify essential human genes (1, 10). Intron loss, occurring at an especially greater rate after gene duplication, is the most frequent copy-number variant in humans, suggesting a likely link between introns and gene essentiality (11, 12).

Introns, which make-up more than half of the noncoding genome, have important regulatory and evolutionary functions. Intron losses and deletions can modulate gene expression patterns and even alter gene function (11). Typically occurring at the 3′ end of a gene, losses and deletions arise from mediated recombination of a gene with the reverse-transcribed RNA during duplication events or through irregular splice sites (10, 13). Furthermore, intron deletions are most common to longer introns (12). Intron 1, typically the longest intron, has frequent intron deletions (30.4% of all known deletions) which are especially serious as the first intron preferentially contains regulatory regions and exhibits the highest density of chromatin marks allowing for gene expression (13, 14, 15). GC patterns in intronic sequences are associated with an increase in enhancer activity, correct splice site recognition, and protection from intronic deletions (12, 16, 17).

It has been suggested that in highly expressed genes, selection has resulted in smaller introns that reduce transcriptional cost, which agrees with reports of shorter introns in essential genes (12, 18). Adding to the seeming importance of introns in essential genes, intron deletions in three-essential-yeast genes drastically decreased RNA levels and caused major growth defects (19).

Owing to the capability of intron losses and deletions to alter gene duplication, expression, and transcription timing, we hypothesize that essential genes, which demand consistency, have developed systems to minimize these events. We thus aim our study to (i) identify whether essential gene introns differ from those of nonessential genes and (ii) characterize the unique properties of essential gene introns to allow for later therapeutic developments.

Results

We extracted 2,135 introns from 165 human-essential genes, 74,147 introns from 6,716 human conditional genes, and 115,089 introns from 12,449 human nonessential genes from the Ensembl database (20, 21). Human gene essentiality data were gathered from the database of Online Gene Essentiality (OGEE) that gathers experimental data from 18 large-scale experiments to classify genes by essentiality (3, 6). Conditional genes are genes where experiments have disagreed on essentiality.

We trained a convolutional neural network, based on DeepBind, to predict gene essentiality based on recurring base-pair motifs of 1,000-bp-long intronic sequence input (22) (Fig 1A). We set aside 20% of the data for the test set and use a three-way random split on the training data to perform threefold cross-validation for hyperparameter selection. We selected our model’s hyperparameters by performing a grid search of our model’s dropout rate, convolutional layer window size, activation function, and L2 regularization strength. We assessed 36 potential models based on average validation performance across the threefolds. The best hyperparameters are then used to train the final model on the entire training set. At training time, we balance our training and validation sets by equally sampling from the essential and nonessential classes. We also ensure that all the introns of a gene lie in the same set so that no gene specific information affects the validity of our accuracy on the test set. We trained two separate models using the first and last 1,000 bp of introns and combined these models by a double classifier which averages essentiality scores from all introns of a gene given by both models. The double classifier optimizes the area under the curve (AUC) of the receiver operating characteristic (ROC) curve used to quantify the diagnostic ability of the model. For the purposes of the neural network, we sought to predict either essentiality or nonessentiality, and thus classified conditional genes from the database as essential if more than 50% of experiments agreed on essentiality. If introns of essential genes and nonessential genes have no markedly characteristic properties, we would expect an AUC of 0.5. Rather, our double classifier achieved an AUC of 0.846 (Fig 1B).

Figure 1. Details of convolutional neural network and testing results.

(A) Our model uses a convolutional architecture to predict intron essentialities. The convolutional layer contains multiple filters that detect motifs within the intronic sequence. Then, the pooling layer averages each filter’s response across the sequence to determine the cumulative presence of motifs. The resulting values are fed into a fully connected layer followed by a two-value softmax output layer corresponding to the probabilities of the intron being part of an essential or nonessential gene. The best performing model from our hyperparameter search used 128 convolutional filters with a window size of 24 and a fully connected layer with 128 neurons. We found best results when training with an L2 regularization parameter of 10–6 and a dropout rate of 0.2. We trained two models, one on the first 1,000 bp of introns and one on the last 1,000 bp. This includes the 5′ splice site in the first 1,000 bp, as well as the 3′ splice site and the branch site in the last 1,000 bp. In all following results, these models are tested on their respective sections of the intronic sequence. (B) Our model, trained on the first 1,000 bp of introns, had an AUC of 0.734. Our model, trained on the last 1,000 bp of introns, had an AUC of 0.725. We predicted gene essentiality using a majority classifier on all introns of a gene. The majority classifier of the model trained on the first 1,000 bp of introns saw an AUC of 0.825, and the majority classifier of the model trained on the last 1,000 bp of introns saw an AUC of 0.823. We further improved accuracy by averaging the outputs of both majority classifiers. This combined classification strategy achieved an AUC of 0.846. (C) As the first intron is known to have unique properties, we separately tested the models on only first introns, seeing improved accuracy. On first introns, the model trained on the first 1,000 bp of introns had an AUC of 0.745 and the model trained on the last 1,000 bp of introns had an AUC of 0.763. We further improved first intron essentiality prediction by averaging the outputs of both models to make a dual average prediction, achieving an AUC of 0.793. These results suggest unique properties characterize first introns in essential versus nonessential genes.

Our results demonstrate that introns of essential and nonessential genes have unique properties. To identify these unique properties, we used a computational approach. We also found that the model performs better at classifying introns of strictly essential or nonessential genes, suggesting that conditional genes do not fit well in either essential or nonessential motifs. Therefore, we now include all OGEE classified “conditional genes” as separate entities in our computation to characterize properties of introns by essentiality.

Although introns of essential genes differ from introns of nonessential genes by size and number (Fig 2A–C), they also differ by base specific traits. GC motif density is significantly greater in the first introns of essential genes (Fig 3A and B). To account for CpG island presence as a potential cofounder of GC density, we show CpG island presence distributions in introns by first or later, and by essentiality. The results of this support that CpG island presence is not a cofounder of GC density results, especially as essential first introns are found to less frequently contain a CpG island than nonessential first introns, and conditional first introns contain the most, which is not parallel to the distribution of GC density among the six classes. Furthermore, we show that GC content, subtracted by GC motif density, is significantly greater in the first introns of essential genes than first introns of nonessential genes as well as later introns of both essential and nonessential genes. However, essential later introns have a significantly lesser GC content than nonessential later introns. GC content has a remarkably similar distribution to GC density (Fig 3A and B).

Figure 2. Introns of essential genes differ from introns of nonessential genes by size, number, and position.

(A) The dashed-green line represents the mean and the notches are calculated using a Gaussian-based asymptotic approximation to represent confidence intervals around the medians (orange lines). The first introns for essential (P = 0.0001), conditional (P < 0.00001), and nonessential (P < 0.00001) genes are larger than the gene’s later introns; however, essential gene first introns are longer than the later introns to a lesser degree than those of nonessential introns. The nonessential first intron is much longer (mean 3.3 times greater) than the essential first intron (P < 0.00001). For later introns, nonessential are longer than essential (P < 0.00001), but these lengths are closer than the disparity between first intron sizes. Conditional introns typically fall within the middle. (B) Essential genes have a greater number of introns than both conditional (P = 0.021) and nonessential (P < 0.00001) genes. (C) However, essential genes have a lesser total length of intronic sequence than both conditional (P < 0.00001) and nonessential (P < 0.00001) genes.

Figure 3. Introns of essential genes differ from introns of nonessential genes by GC density and lower frequency of unusual 5′/3′ splice sites.

(A) The first introns of essential (P < 0.00001), conditional (P < 0.00001), and nonessential (P < 0.00001) genes have a higher GC density than the later introns. Essential (P = 0.0004) and conditional (P < 0.00001) genes have a higher density of GC regions in their first introns than nonessential first introns. The proportion of GC density of the first intron to later introns for nonessential genes is 1.13, for conditional genes is 1.22, and for essential genes is 1.35. GC density is greater in first introns of essential genes. (B) GC content, with GC motif content subtracted, has similar distribution to GC motif density among introns split by first/later and essential/conditional/nonessential. GC content is particularly important in annealing strength and increasing gene stability. (C) Essential gene introns less frequently have an unusual 5′ splice site than conditional introns which in turn less frequently have an unusual 5′ splice site than nonessential introns. The first intron of essential genes is less likely to have an unusual 5′ splice site than conditional or nonessential first introns. In addition, essential first introns are less likely to have an unusual 5′ splice site than essential later introns. A conditional first intron is less likely to have an unusual 5′ splice site than nonessential first introns, so we see that this effect correlates with essentiality. The first intron of nonessential genes is most likely to have an unusual 5′ splice site. (D) The first intron of essential genes is less likely to have an unusual 3′ splice site than conditional genes which in turn are less likely to have an unusual 3′ splice site than first introns of nonessential genes. We see that this effect again correlates with essentiality.

Eukaryotic intron 5′ and 3′ splice sites for pre-RNA processing, 5′-GU-AG-3′ boundaries, are highly conserved, whereas some minor classes of introns have different boundaries (23). We report that essential gene first introns are less likely to have an unusual 5′ or 3′ splice site when compared with first introns of conditional and nonessential genes. The same trend is true of the 5′ splice site in later introns, to a lesser degree (Fig 3C and D).

Having identified the above features that differentiate introns of essential genes from introns of nonessential genes, we trained a feature-driven deep learning model to predict gene essentiality so as to determine the importance of the identified features. By only training the model with information on seven features we identified (average intron size, number of introns in gene, intronic bp in gene, GC density [first intron], GC density [later introns], GC count [not including GC motifs] [first intron], and GC count [not including GC motifs] [later introns]), we can determine the importance of these identified features in essentiality. Each feature vector corresponds to one gene, where the later intron features correspond to the mean of the value over all later introns. We trained an ensemble deep learning model to combine results from multiple (10) neural network models so as to reduce variances and generalization errors. The final AUC obtained was 0.787. This provides serious evidence indicating high importance of the seven identified features in characterizing essential and nonessential genes.

Discussion

Although essentiality is not wholly an intrinsic property of a gene, the ability of our model to predict essentiality or nonessentiality from just intronic sequences, and our later model from just seven features, suggests that there exist characteristic motifs unique to introns of essential genes. The first model’s accuracy for selecting essential introns increases when only testing the first intron as demonstrated by the greater AUC. This suggests that the first introns of essential genes have especially unique characteristics when compared with the first introns of nonessential genes. We followed up on these results with computational analysis of intronic sequences of essential, conditional, and nonessential genes with regard to all introns, only first introns, and only later introns. The primary findings can be summarized in that (i) first introns of essential genes are much shorter than first introns of unessential genes, (ii) essential genes have more introns per gene but these later introns are markedly shorter than the later introns found in nonessential genes, (iii) essential first introns have a greater GC density than first introns of nonessential genes as well as later essential introns, (iv) essential first introns, with essential later introns slightly less so, infrequently have unusual 5′ or 3′ splice sites compared to the first introns of nonessential genes.

Using these features to train feature-based deep learning model to predict gene essentiality, we provide evidence that suggests seven features as major contributing differences between introns of essential and nonessential genes.

From these results, essential genes appear to exhibit intronic characteristics that protect their first introns from loss and deletions. The first intron is crucial for regulation of gene expression; for essential genes which are central to PPI hubs, any deletion in the first intron has the potential to disrupt an entire network (3, 14). Because deletions occur in longer introns at much higher frequency, we found essential first introns are on average over three times less the size of nonessential first introns (12). First introns of essential genes were found to have a greater GC motif density which allows for an increase in enhancer activity, correct splice site recognition, and protection from intron deletions (12, 16, 17). The distribution of GC count closely resembled GC motif density, potentially suggesting that essential first introns have undergone a marked increase in their GC content so as to increase GC motif frequencies for the purposes discussed above. Similarly, as unusual splice sites can allow for alternative splicing, introns of essential genes, especially the first introns, have the lowest frequency of unusual 5′ and 3′ splicing sites (24). Furthermore, as most of the deletions occur at the 3′ end, essential genes have an increased number of introns. These later introns, however, are smaller than the average nonessential intron, avoiding long introns in essential genes so as to limit any intron loss or deletions. Because deletions in introns of essential genes would alter transcript length and thus interrupt the timing of a complex molecular network, the unique properties of essential introns appear to have been selected to avoid intron losses and deletions.

Whereas we select essential genes based on intronic patterns, other prediction methods of essential genes have been based on support vector machines (SVMs), information theoretic statistics, and PPI network leverage (25, 26, 27). Information theoretic approaches, trained and tested on the same organism, have shown AUC scores of 0.73–0.90 with an average of 0.84, which is slightly lower than our double classifier AUC of 0.846²⁶. However, these information-theoretic approaches were not applied to human genes and have lowered accuracy when applied inter-organism. SVM approaches to human-essential genes based on 800 selective features report results of mean AUC of 0.8347 and highest AUC of 0.8854, thus with similar accuracy to our intron-based model and of only slightly increased accuracy to our seven features–based deep learning model (28). PPI network leverage combined with feature information models have shown successes with DeepHe, using PPI network information and 89 sequence derived features, having reported AUC of 0.94 which outperforms SVM, Naive Bayes, Random Forest, and Adaboost models (29). However, our intron-sequence model having AUC of 0.846 and our seven features–based model having AUC of 0.787 is a surprising result because of the previously unknown role of introns in gene essentiality. With further characterization of introns in essential genes, future feature-based models may rival current PPI approaches. Future studies can apply such algorithms to identify unrecognized essential genes and verify such findings in the laboratory. It would be of particular interest to determine shared characteristics of false positives of nonessential genes that are classified as essential by introns. In this way, such an analysis would further uncover what our model that has trained on, further uncovering the role of introns in gene essentiality.

Conditional genes are correlated between essential and nonessential genes, suggesting a middle ground for both gene stability and alterations of gene functionality. This middle ground is necessary for successful evolution of the genome. We hypothesize that this reflects the desire of the genome to both innovate its genes as well as to conserve its most essential genes. Whereas selecting for deletion-adverse essential intron systems promote basic network stability, selecting for long, first introns of nonessential genes allows deletions to alter regulation of nonessential genes and even innovate gene function.

The results presented here introduce the concept that essential genes have characteristically unique introns from nonessential genes, and we identify seven features to characterize this difference. These differences, as outlined above, may eventually be exploited to target tumors by disrupting nearby essential genes (1). Interrupting the complex safety net around the first intron can alter regulation and thereby disrupt a network necessary for tumor growth. Similarly, using targeted CRISPR–Cas9 therapies to force deletions of introns within carefully selected essential genes could likewise stunt cancers.

Materials and Methods

Model

Our deep learning model is a convolutional neural network based on DeepBind, a predictive model that has shown state-of-the-art performance in predicting sequence specificities of DNA- and RNA-binding proteins (22). Our model predicts the essentiality of the gene of an intronic sequence “s” by calculating an essentiality score f(s) = net(pool(rect(conv(s)))). Fig 1A depicts our model architecture. Our model accepts 1,000-bp sequences encoded as one-hot vectors. The convolutional layer (conv) contains multiple filters that detect motifs within the intronic sequence. We apply the ReLU activation function (rect), then the pooling layer (pool) averages each filter’s response across the sequence to determine the cumulative presence of motifs. The resulting values are fed into a small neural network (net) consisting of a fully connected layer followed by a two-value softmax output layer corresponding to the probabilities of the parent gene being essential or nonessential. The fully connected layer also uses the ReLU activation function, and the softmax function is applied to the output to normalize prediction probabilities. We prevent our model from overfitting by using L1 and L2 regularization as well as dropout (30).

Data

Human DNA sequences and annotations were collected from the Ensembl genome database project (20, 21, 31). We used the transcript for each gene corresponding to its longest coding sequence as this has been suggested, in recent work, as the most accurate and most biologically relevant (32 Preprint). However, both the deep learning results as well as the bio-computationally identified feature results were extraordinarily similar when using introns from the longest transcript of each gene. We used the provided annotations to separate out intronic sequences. Before training, the intronic sequences are transformed using one hot encoding such that each sequence is represented as an L×4 matrix for a sequence of length L. For our analysis of CpG island presence, we used an algorithm derived by Takai and Jones to detect CpG islands from a nucleotide sequence (33, 34).

We assign labels using gene essentiality information from OGEE, which experimentally classifies genes by essentially (3, 6). OGEE gathers data from 18 databases of large-scale experiments to provide a reference of how many studies found a gene essential or nonessential (3, 6). For the model, because of the ambiguity of conditional genes, we discard all conditional genes that have been found to be essential in less than half of studies. Genes are assigned binary labels of essential or nonessential, where the remaining conditional genes are grouped with essential genes.

We trained two models, one on the first 1,000 bp of introns, and one on the last 1,000 bp. This includes the 5′ splice site in the first 1,000 bp, as well as the 3′ splice site and the branch site in the last 1,000 bp. These are the three best characterized regions of eukaryotic introns and are the sites that are most directly involved in spliceosomal modification of the transcript to form mRNA.

Training procedure

We set aside 20% of the data for the test set and use a three-way random split on the training data to perform threefold cross-validation for hyperparameter selection. We selected our model’s hyperparameters by performing a grid search of our model’s dropout rate, convolutional layer window size, activation function, and L2 regularization strength. We assessed 36 potential models based on average validation performance across the threefolds. The best hyperparameters are then used to train the final model on the entire training set. At training time, we balance our training and validation sets by equally sampling from the essential and nonessential classes. We also ensure that all the introns of a gene lie in the same set so that no gene-specific information affects the validity of our accuracy on the test set.

We trained all models using Adam gradient descent and a cross entropy loss minimization objective (35 Preprint). The model is trained for 30 epochs with a batch size of 64. We implemented our model using the Keras library running on Tensorflow and trained on an NVIDIA Tesla M60 GPU.

Prediction and evaluation

We evaluate our model on our test set using the area under the curve (AUC) of the ROC curve, which measures how well our model distinguishes between essential and nonessential classes. The model produces an essentiality score corresponding to the predicted confidence in the essentiality of the gene of an intron, and the ROC curve is generated by measuring the sensitivity and specificity of the model at varying prediction thresholds of the essentiality score. We also took advantage of both of our models to better classify an intron by averaging the scores produced by our two models on the first and last 1,000 bp of the intron.

Our model can be extended to classify entire genes with even better accuracy. Rather than classifying the essentiality of individual introns, we classify whether an entire gene is essential or nonessential by combining information from all of its introns. To classify in this manner, we introduce a majority classification method. We accept the list of all intronic sequences of a specific gene and run each individual intron through the model to get the essentiality score of each intron. Then we calculate a gene’s essentiality score as the mean of the essentiality scores of its introns.

We attained our highest AUC using a double majority classifier which uses both the first 1,000 and last 1,000 bp of each intron to classify a gene. We run the first and last 1,000 bp from each intron through the models trained on the first and last 1,000 bp of each intron, respectively. Then we similarly calculate a gene’s essentiality score as the mean of the essentiality scores of its introns from both models.

By combining information from multiple parts of multiple introns, the double majority classifier achieves the highest accuracy.

Feature model

Our final feature-based model used seven features extracted from each gene. The normalized feature vectors are used to train a neural network consisting of several fully connected layers with a binary output. The architecture and hyperparameters of this network were selected by a grid search over the number of hidden layers, number of nodes in each hidden layer, the dropout rate, and L2 normalization strength. Models were evaluated via threefold cross validation just as with the convolutional model, with the best hyperparameters used to train the final model on the entire test set. We trained all feature models using Adam gradient descent and a cross entropy loss minimization objective (35 Preprint). The model is trained for 30 epochs with a batch size of 64. For final evaluation, we trained an ensemble deep learning model to combine results from multiple (10) neural network models so as to reduce variances and generalization errors.

Data Availability

All data were obtained from publicly available data sources as described in the methodology. All the code used for data processing, figure generation, and model training, as well as the weights of our final models, are provided at https://github.com/evendrow/Intron-Essentiality/.

Acknowledgments

Author Contributions

E Schonfeld: conceptualization, data curation, formal analysis, supervision, investigation, visualization, methodology, and writing—original draft, review, and editing.
E Vendrow: data curation, software, investigation, visualization, methodology, and writing—review and editing.
J Vendrow: software, investigation, methodology, and writing—review and editing.
E Schonfeld: conceptualization, formal analysis, investigation, visualization, and writing—review and editing.

Conflict of Interest Statement

The authors declare that they have no conflict of interest.

Received November 1, 2020.
Revision received April 12, 2021.
Accepted April 15, 2021.

https://creativecommons.org/licenses/by/4.0/

This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).

References

1.↵
1. Pertesi M,
2. Ekdahl L,
3. Palm A,
4. Johnsson E,
5. Järvstråt L,
6. Wihlborg A-K,
7. Nilsson B
(2019) Essential genes shape cancer genomes through linear limitation of homozygous deletions. Commun Biol 2: 262. doi:10.1038/s42003-019-0517-0
OpenUrl CrossRef
2.↵
1. Jeong H,
2. Mason SP,
3. Barabási A-L,
4. Oltvai ZN
(2001) Lethality and centrality in protein networks. Nature 411: 41–42. doi:10.1038/35075138
OpenUrl CrossRef PubMed
3.↵
1. Chen W-H,
2. Minguez P,
3. Lercher MJ,
4. Bork P
(2011) OGEE: An online gene essentiality database. Nucleic Acids Res 40: D901–D906. doi:10.1093/nar/gkr986
OpenUrl CrossRef
4.↵
1. Chen W-H,
2. Trachana K,
3. Lercher MJ,
4. Bork P
(2012) Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age. Mol Biol Evol 29: 1703–1706. doi:10.1093/molbev/mss014
OpenUrl CrossRef PubMed
5.↵
1. Seoighe C,
2. Korir PK
(2011) Evidence for intron length conservation in a set of mammalian genes associated with embryonic development. BMC Bioinformatics 12: S16. doi:10.1186/1471-2105-12-s9-s16
OpenUrl CrossRef
6.↵
1. Chen W-H,
2. Lu G,
3. Chen X,
4. Zhao X-M,
5. Bork P
(2016) OGEE v2: An update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res 45: D940–D944. doi:10.1093/nar/gkw1013
OpenUrl CrossRef
7.↵
1. Kabir M,
2. Wenlock S,
3. Doig AJ,
4. Hentges KE
(2019) The essentiality status of mouse duplicate gene pairs correlates with developmental co-expression patterns. Sci Rep 9: 3224. doi:10.1038/s41598-019-39894-9
OpenUrl CrossRef
8.↵
1. Bartha I,
2. Iulio JD,
3. Venter JC,
4. Telenti A
(2017) Human gene essentiality. Nat Rev Genet 19: 51–62. doi:10.1038/nrg.2017.75
OpenUrl CrossRef
9.↵
1. Woods S,
2. Coghlan A,
3. Rivers D,
4. Warnecke T,
5. Jeffries SJ,
6. Kwon T,
7. Rogers A,
8. Hurst LD,
9. Ahringer J
(2013) Duplication and retention biases of essential and non-essential genes revealed by systematic knockdown analyses. PLoS Genet 9: e1003330. doi:10.1371/journal.pgen.1003330
OpenUrl CrossRef PubMed
10.↵
1. Kaessmann H,
2. Vinckenbosch N,
3. Long M
(2009) RNA-based gene duplication: Mechanistic and evolutionary insights. Nat Rev Genet 10: 19–31. doi:10.1038/nrg2487
OpenUrl CrossRef PubMed
11.↵
1. Lin H,
2. Zhu W,
3. Silva JC,
4. Gu X,
5. Buell CR
(2006) Intron gain and loss in segmentally duplicated genes in rice. Genome Biol 7: R41. doi:10.1186/gb-2006-7-5-r41
OpenUrl CrossRef PubMed
12.↵
1. Rigau M,
2. Juan D,
3. Valencia A,
4. Rico D
(2019) Intronic CNVs and gene expression variation in human populations. PLoS Genet 15: e1007902. doi:10.1371/journal.pgen.1007902
OpenUrl CrossRef
13.↵
1. Roy SW,
2. Gilbert W
(2005) The pattern of intron loss. Proc Natl Acad Sci U S A 102: 713–718. doi:10.1073/pnas.0408274102
OpenUrl Abstract/FREE Full Text
14.↵
1. Bradnam KR,
2. Korf I
(2008) Longer first introns are a general property of eukaryotic gene structure. PLoS One 3: e3093. doi:10.1371/journal.pone.0003093
OpenUrl CrossRef PubMed
15.↵
1. Trynka G,
2. Raychaudhuri S
(2013) Using chromatin marks to interpret and localize genetic associations to complex human traits and diseases. Curr Opin Genet Dev 23: 635–641. doi:10.1016/j.gde.2013.10.009
OpenUrl CrossRef PubMed
16.↵
1. Chen L,
2. Fish AE,
3. Capra JA
(2018) Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput Biol 14: e1006484. doi:10.1371/journal.pcbi.1006484
OpenUrl CrossRef
17.↵
1. Wang D,
2. Yu J
(2011) Both size and GC-content of minimal introns are selected in human populations. PLoS One 6: e17945. doi:10.1371/journal.pone.0017945
OpenUrl CrossRef PubMed
18.↵
1. Castillo-Davis CI,
2. Mekhedov SL,
3. Hartl DL,
4. Koonin EV,
5. Kondrashov FA
(2002) Selection for short introns in highly expressed genes. Nat Genet 31: 415–418. doi:10.1038/ng940
OpenUrl CrossRef PubMed
19.↵
1. Juneau K,
2. Miranda M,
3. Hillenmeyer ME,
4. Nislow C,
5. Davis RW
(2006) Introns regulate RNA and protein abundance in yeast. Genetics 174: 511–518. doi:10.1534/genetics.106.058560
OpenUrl Abstract/FREE Full Text
20.↵
1. Hunt SE,
2. McLaren W,
3. Gil L,
4. Thormann A,
5. Schuilenburg H,
6. Sheppard D,
7. Parton A,
8. Armean IM,
9. Trevanion SJ,
10. Flicek P, et al.
(2018) Ensembl variation resources. Database 2018: bay119. doi:10.1093/database/bay119
OpenUrl CrossRef
21.↵
1. Embl-Ebi
. EBI. Available at: https://www.ebi.ac.uk/ena/data/view/GCA_000001405.28. Accessed 31 March, 2020.
22.↵
1. Alipanahi B,
2. Delong A,
3. Weirauch MT,
4. Frey BJ
(2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. doi:10.1038/nbt.3300
OpenUrl CrossRef PubMed
23.↵
1. Wu Q,
2. Krainer AR
(1996) U1-mediated exon definition interactions between AT-AC and GT-AG introns. Science 274: 1005–1008. doi:10.1126/science.274.5289.1005
OpenUrl Abstract/FREE Full Text
24.↵
1. Ast G
(2004) How did alternative splicing evolve? Nat Rev Genet 5: 773–782. doi:10.1038/nrg1451
OpenUrl CrossRef PubMed
25.↵
1. Li X,
2. Li W,
3. Zeng M,
4. Zheng R,
5. Li M
(2019) Network-based methods for predicting essential genes or proteins: A survey. Brief Bioinform 21: 566–583. doi:10.1093/bib/bbz017
OpenUrl CrossRef
26.↵
1. Nigatu D,
2. Sobetzko P,
3. Yousef M,
4. Henkel W
(2017) Sequence-based informationtheoretic features for gene essentiality prediction. BMC Bioinformatics 18: 473. doi:10.1186/s12859-017-1884-5
OpenUrl CrossRef
27.↵
1. Hua H-L,
2. Zhang F-Z,
3. Labena AA,
4. Dong C,
5. Jin Y-T,
6. Guo F-B
(2016) An approach for predicting essential genes using multiple homology mapping and machine learning algorithms. Biomed Res Int 2016: 7639397. doi:10.1155/2016/7639397
OpenUrl CrossRef
28.↵
1. Guo F-B,
2. Dong C,
3. Hua H-L,
4. Liu S,
5. Luo H,
6. Zhang H-W,
7. Jin Y-T,
8. Zhang K-Y
(2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33: 1758–1764. doi:10.1093/bioinformatics/btx055
OpenUrl CrossRef
29.↵
1. Zhang X,
2. Xiao W,
3. Xiao W
(2020) DeepHE: Accurately predicting human essential genes based on deep learning. PLoS Comput Biol 16: e1008229. doi:10.1101/2020.02.14.950048
OpenUrl Abstract/FREE Full Text
30.↵
1. Srivastava N,
2. Hinton G,
3. Krizhevsky A,
4. Sutskever I,
5. Salakhutdinov R
(2014) Dropout: A simple way to prevent neural networks from overfitting. J Machine Learn Res 15: 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html
OpenUrl
31.↵
1. Cunningham F,
2. Achuthan P,
3. Akanni W,
4. Allen J,
5. Amode MR,
6. Armean IM,
7. Bennett R,
8. Bhai J,
9. Billis K,
10. Boddu S, et al.
(2019) Ensembl 2019. Nucleic Acids Res 47: D745–D751. doi:10.1093/nar/gky1113
OpenUrl CrossRef PubMed
32.↵
1. Gilbert DG
(2019) Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes? BioRxiv doi:10.1101/829184 (Preprint posted November 2, 2019).
OpenUrl CrossRef
33.↵
1. Takai D,
2. Jones PA
(2002) Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99: 3740–3745. doi:10.1073/pnas.052410099
OpenUrl Abstract/FREE Full Text
34.↵
1. Nell L
(2020) Implementation of Takai and Jones’ algorithm for finding CpG islands in genomes. Available at: https://github.com/lucasnell/TaJoCGI. doi:10.5281/zenodo.3731298
35.↵
1. Kingma DP,
2. Ba J
(2017) Adam: A method for stochastic optimization. arXiv. (Preprint posted December 22, 2014).

Download PDF

Email Article

Citation Tools

Tweet Widget

Subjects

Cited By...

No citing articles found.

Google Scholar

More in this TOC Section

Show more Research Article

[1] 1.↵
Pertesi M,
Ekdahl L,
Palm A,
Johnsson E,
Järvstråt L,
Wihlborg A-K,
Nilsson B
(2019) Essential genes shape cancer genomes through linear limitation of homozygous deletions. Commun Biol 2: 262. doi:10.1038/s42003-019-0517-0
OpenUrl CrossRef

[2] Pertesi M,

[3] Ekdahl L,

[4] Palm A,

[5] Johnsson E,

[6] Järvstråt L,

[7] Wihlborg A-K,

[8] Nilsson B

[9] 2.↵
Jeong H,
Mason SP,
Barabási A-L,
Oltvai ZN
(2001) Lethality and centrality in protein networks. Nature 411: 41–42. doi:10.1038/35075138
OpenUrl CrossRef PubMed

[10] Jeong H,

[11] Mason SP,

[12] Barabási A-L,

[13] Oltvai ZN

[14] 3.↵
Chen W-H,
Minguez P,
Lercher MJ,
Bork P
(2011) OGEE: An online gene essentiality database. Nucleic Acids Res 40: D901–D906. doi:10.1093/nar/gkr986
OpenUrl CrossRef

[15] Chen W-H,

[16] Minguez P,

[17] Lercher MJ,

[18] Bork P

[19] 4.↵
Chen W-H,
Trachana K,
Lercher MJ,
Bork P
(2012) Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age. Mol Biol Evol 29: 1703–1706. doi:10.1093/molbev/mss014
OpenUrl CrossRef PubMed

[20] Chen W-H,

[21] Trachana K,

[22] Lercher MJ,

[23] Bork P

[24] 5.↵
Seoighe C,
Korir PK
(2011) Evidence for intron length conservation in a set of mammalian genes associated with embryonic development. BMC Bioinformatics 12: S16. doi:10.1186/1471-2105-12-s9-s16
OpenUrl CrossRef

[25] Seoighe C,

[26] Korir PK

[27] 6.↵
Chen W-H,
Lu G,
Chen X,
Zhao X-M,
Bork P
(2016) OGEE v2: An update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res 45: D940–D944. doi:10.1093/nar/gkw1013
OpenUrl CrossRef

[28] Chen W-H,

[29] Lu G,

[30] Chen X,

[31] Zhao X-M,

[32] Bork P

[33] 7.↵
Kabir M,
Wenlock S,
Doig AJ,
Hentges KE
(2019) The essentiality status of mouse duplicate gene pairs correlates with developmental co-expression patterns. Sci Rep 9: 3224. doi:10.1038/s41598-019-39894-9
OpenUrl CrossRef

[34] Kabir M,

[35] Wenlock S,

[36] Doig AJ,

[37] Hentges KE

[38] 8.↵
Bartha I,
Iulio JD,
Venter JC,
Telenti A
(2017) Human gene essentiality. Nat Rev Genet 19: 51–62. doi:10.1038/nrg.2017.75
OpenUrl CrossRef

[39] Bartha I,

[40] Iulio JD,

[41] Venter JC,

[42] Telenti A

[43] 9.↵
Woods S,
Coghlan A,
Rivers D,
Warnecke T,
Jeffries SJ,
Kwon T,
Rogers A,
Hurst LD,
Ahringer J
(2013) Duplication and retention biases of essential and non-essential genes revealed by systematic knockdown analyses. PLoS Genet 9: e1003330. doi:10.1371/journal.pgen.1003330
OpenUrl CrossRef PubMed

[44] Woods S,

[45] Coghlan A,

[46] Rivers D,

[47] Warnecke T,

[48] Jeffries SJ,

[49] Kwon T,

[50] Rogers A,

[51] Hurst LD,

[52] Ahringer J

[53] 10.↵
Kaessmann H,
Vinckenbosch N,
Long M
(2009) RNA-based gene duplication: Mechanistic and evolutionary insights. Nat Rev Genet 10: 19–31. doi:10.1038/nrg2487
OpenUrl CrossRef PubMed

[54] Kaessmann H,

[55] Vinckenbosch N,

[56] Long M

[57] 11.↵
Lin H,
Zhu W,
Silva JC,
Gu X,
Buell CR
(2006) Intron gain and loss in segmentally duplicated genes in rice. Genome Biol 7: R41. doi:10.1186/gb-2006-7-5-r41
OpenUrl CrossRef PubMed

[58] Lin H,

[59] Zhu W,

[60] Silva JC,

[61] Gu X,

[62] Buell CR

[63] 12.↵
Rigau M,
Juan D,
Valencia A,
Rico D
(2019) Intronic CNVs and gene expression variation in human populations. PLoS Genet 15: e1007902. doi:10.1371/journal.pgen.1007902
OpenUrl CrossRef

[64] Rigau M,

[65] Juan D,

[66] Valencia A,

[67] Rico D

[68] 13.↵
Roy SW,
Gilbert W
(2005) The pattern of intron loss. Proc Natl Acad Sci U S A 102: 713–718. doi:10.1073/pnas.0408274102
OpenUrl Abstract/FREE Full Text

[69] Roy SW,

[70] Gilbert W

[71] 14.↵
Bradnam KR,
Korf I
(2008) Longer first introns are a general property of eukaryotic gene structure. PLoS One 3: e3093. doi:10.1371/journal.pone.0003093
OpenUrl CrossRef PubMed

[72] Bradnam KR,

[73] Korf I

[74] 15.↵
Trynka G,
Raychaudhuri S
(2013) Using chromatin marks to interpret and localize genetic associations to complex human traits and diseases. Curr Opin Genet Dev 23: 635–641. doi:10.1016/j.gde.2013.10.009
OpenUrl CrossRef PubMed

[75] Trynka G,

[76] Raychaudhuri S

[77] 16.↵
Chen L,
Fish AE,
Capra JA
(2018) Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput Biol 14: e1006484. doi:10.1371/journal.pcbi.1006484
OpenUrl CrossRef

[78] Chen L,

[79] Fish AE,

[80] Capra JA

[81] 17.↵
Wang D,
Yu J
(2011) Both size and GC-content of minimal introns are selected in human populations. PLoS One 6: e17945. doi:10.1371/journal.pone.0017945
OpenUrl CrossRef PubMed

[82] Wang D,

[83] Yu J

[84] 18.↵
Castillo-Davis CI,
Mekhedov SL,
Hartl DL,
Koonin EV,
Kondrashov FA
(2002) Selection for short introns in highly expressed genes. Nat Genet 31: 415–418. doi:10.1038/ng940
OpenUrl CrossRef PubMed

[85] Castillo-Davis CI,

[86] Mekhedov SL,

[87] Hartl DL,

[88] Koonin EV,

[89] Kondrashov FA

[90] 19.↵
Juneau K,
Miranda M,
Hillenmeyer ME,
Nislow C,
Davis RW
(2006) Introns regulate RNA and protein abundance in yeast. Genetics 174: 511–518. doi:10.1534/genetics.106.058560
OpenUrl Abstract/FREE Full Text

[91] Juneau K,

[92] Miranda M,

[93] Hillenmeyer ME,

[94] Nislow C,

[95] Davis RW

[96] 20.↵
Hunt SE,
McLaren W,
Gil L,
Thormann A,
Schuilenburg H,
Sheppard D,
Parton A,
Armean IM,
Trevanion SJ,
Flicek P, et al.
(2018) Ensembl variation resources. Database 2018: bay119. doi:10.1093/database/bay119
OpenUrl CrossRef

[97] Hunt SE,

[98] McLaren W,

[99] Gil L,

[100] Thormann A,

[101] Schuilenburg H,

[102] Sheppard D,

[103] Parton A,

[104] Armean IM,

[105] Trevanion SJ,

[106] Flicek P, et al.

[107] 21.↵
Embl-Ebi
. EBI. Available at: https://www.ebi.ac.uk/ena/data/view/GCA_000001405.28. Accessed 31 March, 2020.

[108] Embl-Ebi

[109] 22.↵
Alipanahi B,
Delong A,
Weirauch MT,
Frey BJ
(2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. doi:10.1038/nbt.3300
OpenUrl CrossRef PubMed

[110] Alipanahi B,

[111] Delong A,

[112] Weirauch MT,

[113] Frey BJ

[114] 23.↵
Wu Q,
Krainer AR
(1996) U1-mediated exon definition interactions between AT-AC and GT-AG introns. Science 274: 1005–1008. doi:10.1126/science.274.5289.1005
OpenUrl Abstract/FREE Full Text

[115] Wu Q,

[116] Krainer AR

[117] 24.↵
Ast G
(2004) How did alternative splicing evolve? Nat Rev Genet 5: 773–782. doi:10.1038/nrg1451
OpenUrl CrossRef PubMed

[118] Ast G

[119] 25.↵
Li X,
Li W,
Zeng M,
Zheng R,
Li M
(2019) Network-based methods for predicting essential genes or proteins: A survey. Brief Bioinform 21: 566–583. doi:10.1093/bib/bbz017
OpenUrl CrossRef

[120] Li X,

[121] Li W,

[122] Zeng M,

[123] Zheng R,

[124] Li M

[125] 26.↵
Nigatu D,
Sobetzko P,
Yousef M,
Henkel W
(2017) Sequence-based informationtheoretic features for gene essentiality prediction. BMC Bioinformatics 18: 473. doi:10.1186/s12859-017-1884-5
OpenUrl CrossRef

[126] Nigatu D,

[127] Sobetzko P,

[128] Yousef M,

[129] Henkel W

[130] 27.↵
Hua H-L,
Zhang F-Z,
Labena AA,
Dong C,
Jin Y-T,
Guo F-B
(2016) An approach for predicting essential genes using multiple homology mapping and machine learning algorithms. Biomed Res Int 2016: 7639397. doi:10.1155/2016/7639397
OpenUrl CrossRef

[131] Hua H-L,

[132] Zhang F-Z,

[133] Labena AA,

[134] Dong C,

[135] Jin Y-T,

[136] Guo F-B

[137] 28.↵
Guo F-B,
Dong C,
Hua H-L,
Liu S,
Luo H,
Zhang H-W,
Jin Y-T,
Zhang K-Y
(2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33: 1758–1764. doi:10.1093/bioinformatics/btx055
OpenUrl CrossRef

[138] Guo F-B,

[139] Dong C,

[140] Hua H-L,

[141] Liu S,

[142] Luo H,

[143] Zhang H-W,

[144] Jin Y-T,

[145] Zhang K-Y

[146] 29.↵
Zhang X,
Xiao W,
Xiao W
(2020) DeepHE: Accurately predicting human essential genes based on deep learning. PLoS Comput Biol 16: e1008229. doi:10.1101/2020.02.14.950048
OpenUrl Abstract/FREE Full Text

[147] Zhang X,

[148] Xiao W,

[149] Xiao W

[150] 30.↵
Srivastava N,
Hinton G,
Krizhevsky A,
Sutskever I,
Salakhutdinov R
(2014) Dropout: A simple way to prevent neural networks from overfitting. J Machine Learn Res 15: 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html
OpenUrl

[151] Srivastava N,

[152] Hinton G,

[153] Krizhevsky A,

[154] Sutskever I,

[155] Salakhutdinov R

[156] 31.↵
Cunningham F,
Achuthan P,
Akanni W,
Allen J,
Amode MR,
Armean IM,
Bennett R,
Bhai J,
Billis K,
Boddu S, et al.
(2019) Ensembl 2019. Nucleic Acids Res 47: D745–D751. doi:10.1093/nar/gky1113
OpenUrl CrossRef PubMed

[157] Cunningham F,

[158] Achuthan P,

[159] Akanni W,

[160] Allen J,

[161] Amode MR,

[162] Armean IM,

[163] Bennett R,

[164] Bhai J,

[165] Billis K,

[166] Boddu S, et al.

[167] 32.↵
Gilbert DG
(2019) Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes? BioRxiv doi:10.1101/829184 (Preprint posted November 2, 2019).
OpenUrl CrossRef

[168] Gilbert DG

[169] 33.↵
Takai D,
Jones PA
(2002) Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99: 3740–3745. doi:10.1073/pnas.052410099
OpenUrl Abstract/FREE Full Text

[170] Takai D,

[171] Jones PA

[172] 34.↵
Nell L
(2020) Implementation of Takai and Jones’ algorithm for finding CpG islands in genomes. Available at: https://github.com/lucasnell/TaJoCGI. doi:10.5281/zenodo.3731298

[173] Nell L

[174] 35.↵
Kingma DP,
Ba J
(2017) Adam: A method for stochastic optimization. arXiv. (Preprint posted December 22, 2014).

[175] Kingma DP,

[176] Ba J

Main menu

User menu

Search

On the relation of gene essentiality to intron structure: a computational and deep learning approach

Abstract

Introduction

Results

Discussion

Materials and Methods

Model

Data

Training procedure

Prediction and evaluation

Feature model

Data Availability

Acknowledgments

Author Contributions

Conflict of Interest Statement

References

Citation Manager Formats

In this Issue

Subjects

Related Articles

Cited By...

More in this TOC Section

Similar Articles

Content

For Authors

Other Services

More Information

Main menu

User menu

Search

On the relation of gene essentiality to intron structure: a computational and deep learning approach

Abstract

Introduction

Results

Discussion

Materials and Methods

Model

Data

Training procedure

Prediction and evaluation

Feature model

Data Availability

Acknowledgments

Author Contributions

Conflict of Interest Statement

References

Citation Manager Formats

In this Issue

Jump to section

Subjects

Related Articles

Cited By...

More in this TOC Section

Similar Articles

Content

For Authors

Other Services

More Information