Hierarchical dinucleotide distribution in genome along evolution and its effect on chromatin packing

It describes how hierarchical CpG distribution on the genomes of species change in the evolution and the correlation with the chromatin structure.

Shachi Bhatt, Ph.D. Executive Editor Life Science Alliance https://www.lsajournal.org/ Tweet @SciBhatt @LSAjournal Interested in an editorial career? EMBO Solutions is hiring a Scientific Editor to join the international Life Science Alliance team. Find out more herehttps://www.embo.org/documents/jobs/Vacancy_Notice_Scientific_editor_LSA.pdf --An editable version of the final text (.DOC or .DOCX) is needed for copyediting (no PDFs).
--High-resolution figure, supplementary figure and video files uploaded as individual files: See our detailed guidelines for preparing your production-ready images, https://www.life-sciencealliance.org/authors --Summary blurb (enter in submission system): A short text summarizing in a single sentence the study (max. 200 characters including spaces). This text is used in conjunction with the titles of papers, hence should be informative and complementary to the title and running title. It should describe the context and significance of the findings for a general readership; it should be written in the present tense and refer to the work in the third person. Author names should not be mentioned.

B. MANUSCRIPT ORGANIZATION AND FORMATTING:
Full guidelines are available on our Instructions for Authors page, https://www.life-sciencealliance.org/authors We encourage our authors to provide original source data, particularly uncropped/-processed electrophoretic blots and spreadsheets for the main figures of the manuscript. If you would like to add source data, we would welcome one PDF/Excel-file per figure for this information. These files will be linked online as supplementary "Source Data" files. ***IMPORTANT: It is Life Science Alliance policy that if requested, original data images must be made available. Failure to provide original images upon request will result in unavoidable delays in publication. Please ensure that you have access to all original microscopy and blot data images before submitting your revision.*** In this manuscript, the authors explore the CpG distribution changes along evolution and its role in chromatin structure and gene expression. They find that, along evolution, the CpG density is decreasing while the variability is increasing, the correlation between CpG distribution with 3D chromatin organization and with gene expression is increasing. Overall, this paper provides some new insights into the how the 1D genomic sequence influence the 3D structure and gene expression. This work will inspire more researchers to explore the genomic sequence basement for chromatin organization.
Major issues: The CpG variability assessment In the first part of this paper, the authors define and calculate the variability of the CpG density based on the whole genome sequence. They find the variability along genome is increasing in higher species. It would be more convincing to evaluate the variability per chromosomes among species. If this increasing variability is a robust evolutional phenomena, I would expect a general increase among chromosomes of higher species.
The wavelet transform analyses The authors use the wavelet transform for the CpG density, structure vector and expression levels to explore the similarity among them. It's maybe a little difficult for most biologists to interpret these analyses. It would be helpful to present these results with more intuitive ways. Specifically, (1) Figure 5A show the higher correlation between the CpG island and the structure vector in higher species. Maybe with additional presentation like how consistent between CpG-rich/poor domain and A/B compartments and how this consistence change along evolution.
(2) Figure 5B show the higher correlation between the CpG island and the chromatin organization at the length scale of compartments. Maybe with additional presentation like the respective consistence between CpG-rich/poor domain boundaries with TAD boundaries and with A/B compartments boundaries.
(3) Figure 6 show the higher correlation between the CpG island and the expression levels in higher species. Maybe with additional presentation like group the genomic regions (genes better) by CpG density, compare the expression levels among groups and compare the correlation among species.
Chromatin structure segregation analysis In page 16, first paragraph, the chromatin contact frequency decay analyses demonstrate that a more segregated chromatin structure of human and mouse at length short length scale (< 1 Mb). In contrast, more long range contacts (>1 Mb) in other species. Size of A/B compartments is ~Mb scale and interactions with compartments is also ~Mb scale. And in the previous presentation ( Figures 5A-5C), the results indicate the CpG density is more well correlated with structure vector at compartments scale in human and mouse. These two results seem contradictory. If the authors want to explore whether the chromatin structure is more segregated in higher species, maybe checking the interaction segregation between A/B compartments (interactions of intercompartments divided by intra-compartments) or CpG-rich/poor domain is a better way.

Structure vectors
In page 13, first paragraph, the authors use nonnegative matrix factorization (NMF) to decompose the Hi-C matrix into two structure vectors. It's quiet smart to choose NMF as it allows only positive coefficients. However, there are several points need to be carefully considered, (1) It is unclear what's the structure interpretation of structure vectors. They authors should evaluate the consistence between structure vectors and chromatin structures (like compartments).
(2) NMF focuses more on local sub-structures compared with PCA. The previous studies use NMF to detect the local interaction domains (Hu et al., 2016;Lee and Roy, 2020). Considering the difference between matrix factorization methods, the authors should evaluate whether the correlation between CpG density and chromatin structure (Figures 4 and 5 ) is robust to the matrix factorization method. Reviewer #2 (Comments to the Authors (Required)): The authors analyzed the dinucleotide distribution in 38 publicly available genomes. Their main focus is on the distributions of CpGs since this particular dinculeotide seems to have the largest variability. The authors correlate the CpG distributions with Hi-C data (indicating chromatin organization) for a subset of their species and find a correlation between CpG distribution and chromatin structure. They can show that the CpG distribution patters are quite different depending on the respective species group. They conclude that CpG density is a better function and structure indicator than the more general G+C content. They also report the interesting finding that the temperature of the living environment seems to have a noticeable effect on the genomic CpG density variability.
Major points: 1) Within the manuscript the authors frequently refer to lower (plants, invertebrates, fish) and higher (mammals, birds) species. They also state "along evolution, the CpG density decreases" or "In early evolution, the CpG desnity decreases with little change of the mosaicity and in the later stage of evolution, the CpG density remains largely constant but the mosaicity increases." Such statements indicate a very outdated model of evolution. The terms "lower" and "higher" species should not be used. The classification is anyway very arbitrary since, for example, fish are much more similar to mammals and birds than to the other groups since they are all vertebrates. Separating birds and reptiles seems also unjustified since birds are reptiles. The terms "lower" and "higher species" should be removed altogether. The authors should just directly state which taxonomic groups they are referring to, e.g. amniotes (reptiles and mammals). Similar to that the authors should not refer to an early evolution and a later stage of evolution if they are referring to extant species. Early evolution refers to what happened in the past during the evolutionary history of a certain species or group. But not to extant species. There is no direction in evolution, starting from bacteria and leading to mammals. Instead of using "early" and "late" evolution the authors should, again, clearly refer to which taxonomic group they mean.
Minor points: 1) Occasionally, abbreviations are not explained when they are just for the first time, e.g. TAD in the introduction. In general I would recommend to write abbreviations in full again if its used for the first time in a section (introduction, results, ...).
2) The English of the paper is in general quite well. There are some typos throughout the manuscript. It should be therefore be proof-read again. Some examples: "and lack of noncoding elements"->"and lacks noncoding elements" "DNA sequence appear to"->"DNA sequence appears to" "compatments"->"compartments" "storng correlation exsits"->"strong correlation exists" 3) In the introduction the term "CGI(gene)" is used. I'm not sure if the "(gene)" is there by accident. If not, an explanation what the meaning is might be necessary.
Aside from the mentioned major point I think that the data analysis is appropriate and the results are a useful addition to the field. Therefore, I recommend acceptance with minor revisions. Nevertheless the acceptance should heavily depend on the successful response to the major point.

st Authors' Response to Reviewers
May 19, 2021 Dear Dr. Shachi Bhatt: Thank you for your letter and for the reviewers' comments concerning our manuscript entitled "Hierarchical dinucleotide distribution in genome along evolution and its effect on chromatin packing" (#LSA-2021-01028-T).
Those comments are very helpful for revising and improving our paper. We have studied the comments carefully and have made corrections accordingly. The revisions are marked in the manuscript. Our response to the reviewers' comments is listed in the following:

Reviewer 1:
In this manuscript, the authors explore the CpG distribution changes along evolution and its role in chromatin structure and gene expression. They find that, along evolution, the CpG density is decreasing while the variability is increasing, the correlation between CpG distribution with 3D chromatin organization and with gene expression is increasing. Overall, this paper provides some new insights into the how the 1D genomic sequence influence the 3D structure and gene expression. This work will inspire more researchers to explore the genomic sequence basement for chromatin organization.

Major issues:
The CpG variability assessment In the first part of this paper, the authors define and calculate the variability of the CpG density based on the whole genome sequence. They find the variability along genome is increasing in higher species. It would be more convincing to evaluate the variability per chromosomes among species. If this increasing variability is a robust evolutional phenomena, I would expect a general increase among chromosomes of higher species.

Response:
We thank the reviewer for these valuable suggestions. We have now added evaluations on the variability for each individual chromosome among different species and added the results in Figure S4. Species can be effectively clustered into different groups by the median value of CpG variabilities. The increasing trend of CpG variability of species can also be observed from this Figure. The wavelet transform analyses The authors use the wavelet transform for the CpG density, structure vector and expression levels to explore the similarity among them. It's maybe a little difficult for most biologists to interpret these analyses. It would be helpful to present these results with more intuitive ways. Specifically, (1) Figure 5A show the higher correlation between the CpG island and the structure vector in higher species. Maybe with additional presentation like how consistent between CpG-rich/poor domain and A/B compartments and how this consistence change along evolution.

Response:
We have added the Pearson correlation between CpG-rich/poor domains and A/B compartments of species in Table S9. The Pearson correlation between CpG-rich/poor domains and A/B compartments of species such as human, mouse and chicken is higher than that of A.thaliana and zebrafish. We have also showed the comparison of CpG-rich/poor domain index and A/B compartments index of human and A.thaliana in Figure S9. We have included related discussions in the text (L273: Consistently, we also compared the correlation between CpG-rich/poor domains and compartments in different species and found that it is higher in human, mouse and chicken than in A.thaliana and zebrafish (Table S9).) (2) Figure 5B show the higher correlation between the CpG island and the chromatin organization at the length scale of compartments. Maybe with additional presentation like the respective consistence between CpG-rich/poor domain boundaries with TAD boundaries and with A/B compartments boundaries. Response: We have added the distance distribution of TAD boundaries and A/B compartments boundaries to CpG-rich/poor domain boundaries in Figure S10. We have also included related discussions in the main text (L284: Consistently, the overlap between A/B compartments and CpG-rich/poor domains is more significant than the one between TADs and CpG-rich/poor domains.) (3) Figure 6 show the higher correlation between the CpG island and the expression levels in higher species. Maybe with additional presentation like group the genomic regions (genes better) by CpG density, compare the expression levels among groups and compare the correlation among species. Response: We have added comparisons on the expression levels of genes of different CpG densities in human, zebrafish and rice in Figure S13. In human, the expression level of genes generally increases with the CpG density at low CpG densities and then remains largely constant (with a slight decrease) at high CpG densities. The expression level of rice genes is low at both high and low CpG densities. The expression level of genes also shows a maximum at median CpG densities in zebrafish. The expression level of genes thus appears to be more positively correlated with CpG density in human than in zebrafish and in rice.
We have included a related discussion in the revised text (L326: Furthermore, we grouped genes by CpG density and compared the expression levels of genes of different CpG densities. The dependence of gene expression level on CpG density is seen to be different for different species ( Figure S13). For example, the expression levels of zebrafish and rice genes show a more pronounced peak at median CpG densities than that of human.) Figures 5A-5C), the results indicate the CpG density is more well correlated with structure vector at compartments scale in human and mouse. These two results seem contradictory. If the authors want to explore whether the chromatin structure is more segregated in higher species, maybe checking the interaction segregation between A/B compartments (interactions of inter-compartments divided by intra-compartments) or CpG-rich/poor domain is a better way.

Response:
Chromatin structure of human and mouse shows a strong segregation at a length scale similar to the length of one compartment. Strong contacts at the sub-Mb scale in human and mouse chromatins corresponds to a their well-resolved compartment formation. On the other hand, for human and mouse, the CpG density correlates well with structure vector, indicating fluctuations of CpG density and structure vector along the genome resemble each other at the compartments scale. Such a result indicates the significant contact difference between different CpG density regions, i.e. preferred interactions are formed between compartments of the same type (A-A or B-B) over compartments of different types(A-B). The two results mentioned by the reviewer thus reflect intensity of contacts at intra-and inter-compartment level respectively. We thank the reviewer for pointing out this possible confusion in the previous presentation.
Following the reviewer's suggestion, we have added an analysis on the interaction segregation ratio between A/B compartments (interactions of compartments of the same type divided by compartments of different types) in Table  S10. It can be seen that the interaction segregation ratio of human and mouse is higher than other species.
We have also included a related discussion (L314: we also calculated the interaction segregation ratio of compartments A/B for different species (Table S10). The interaction segregation ratio of human and mouse is higher than other species.)

Response:
We calculated the Pearson correlation of structure vectors and compartment vector. We found that one of the structure vectors is positively correlated with compartment vector, and the other is negatively correlated with compartment vector. We have listed the max absolute value of Pearson correlation of two structure vectors with compartment vector of species in Table S8. We also showed compartment and structure vectors of chicken in Figure S5, which can be seen to be closely correlated.
(2) NMF focuses more on local sub-structures compared with PCA. The previous studies use NMF to detect the local interaction domains (Hu et al., 2016;Lee and Roy, 2020). Considering the difference between matrix factorization methods, the authors should evaluate whether the correlation between CpG density and chromatin structure (Figures 4 and 5 ) is robust to the matrix factorization method. Response: The two references mentioned by the reviewer applied balanced nonnegative matrix factorization (BNMF) and Graph regularized nonnegative matrix factorization (GRNMF) in their studies, respectively. As pointed out by the reviewer, these methods were used to interrogate the local chromatin structures such as TADs.
We have added comparison of structure vectors factorized by NMF, BNMF and GRNMF in chicken, zebrafish, A.thaliana, with the results given in Figure S6. Structure vectors factorized by different matrix factorization methods are very similar to each other in trend, showing that the calculation on structure vector is robust to the matrix factorization methods.
We have also included related discussions (L258: We tested the robustness of structure vectors and found structure vectors factorized by different Nonnegative Matrix Factorization methods to be similar to each other after scaling ( Figure S6).). Response: We thank the reviewer for these corrections. We have corrected these mistakes in the manuscript.

Reviewer 2:
The authors analyzed the dinucleotide distribution in 38 publicly available genomes. Their main focus is on the distributions of CpGs since this particular dinculeotide seems to have the largest variability. The authors correlate the CpG distributions with Hi-C data (indicating chromatin organization) for a subset of their species and find a correlation between CpG distribution and chromatin structure. They can show that the CpG distribution patters are quite different depending on the respective species group. They conclude that CpG density is a better function and structure indicator than the more general G+C content. They also report the interesting finding that the temperature of the living environment seems to have a noticeable effect on the genomic CpG density variability.

Major points:
1) Within the manuscript the authors frequently refer to lower (plants, invertebrates, fish) and higher (mammals, birds) species. They also state "along evolution, the CpG density decreases" or "In early evolution, the CpG desnity decreases with little change of the mosaicity and in the later stage of evolution, the CpG density remains largely constant but the mosaicity increases." Such statements indicate a very outdated model of evolution. The terms "lower" and "higher" species should not be used. The classification is anyway very arbitrary since, for example, fish are much more similar to mammals and birds than to the other groups since they are all vertebrates. Separating birds and reptiles seems also unjustified since birds are reptiles. The terms "lower" and "higher species" should be removed altogether. The authors should just directly state which taxonomic groups they are referring to, e.g. amniotes (reptiles and mammals). Similar to that the authors should not refer to an early evolution and a later stage of evolution if they are referring to extant species. Early evolution refers to what happened in the past during the evolutionary history of a certain species or group. But not to extant species. There is no direction in evolution, starting from bacteria and leading to mammals. Instead of using "early" and "late" evolution the authors should, again, clearly refer to which taxonomic group they mean. Response: We thank the reviewer for these good suggestions. We have modified the corresponding statements in the manuscript.
(For example, L229: Different from CpG density, the CpG variabilities of invertebrates and plants are all generally very similar and small. In contrast, the variabilities of mammals and birds are much higher than those of other species.
L232: During the evolution of invertebrates and plants, the CpG density decreases with little change of the genome mosaicity and during the evolution of mammals and birds, the CpG density remains largely constant but the genome mosaicity increases.
L390: The sequence properties of bacteria, plants, invertebrates, fishes and reptiles are distinctly different from birds and mammals in terms of the dinucleotide distribution. ) Thank you for submitting your revised manuscript entitled "Hierarchical dinucleotide distribution in genome along evolution and its effect on chromatin packing". We would be happy to publish your paper in Life Science Alliance pending final revisions necessary to meet our formatting guidelines.
Please also attend to the following, -please upload both your main and supplementary figures as single files, not as part of the main text -please add your main, supplementary figure, and table legends to the main manuscript text after the references section; -please add ORCID ID for the corresponding author-you should have received instructions on how to do so -we encourage you to revise the figure legends for figures S11 such that the figure panels are introduced in an alphabetical order -please be sure to add callouts for all main and supplementary figures (including panels) to your main manuscript text. For example, please be sure to add callouts for S1A-D or Figure 3A, B...etc. -please add the methods from the supplementary text document to the main manuscript, the supplementary tables and supplementary figures as separate files to the system, and the legends for supplementary tables and figures to the main manuscript If you are planning a press release on your work, please inform us immediately to allow informing our production team and scheduling a release date.
To upload the final version of your manuscript, please log in to your account: https://lsa.msubmit.net/cgi-bin/main.plex You will be guided to complete the submission of your revised manuscript and to fill in all necessary information. Please get in touch in case you do not know or remember your login name.
To avoid unnecessary delays in the acceptance and publication of your paper, please read the following information carefully.
A. FINAL FILES: These items are required for acceptance.
--An editable version of the final text (.DOC or .DOCX) is needed for copyediting (no PDFs).
--High-resolution figure, supplementary figure and video files uploaded as individual files: See our detailed guidelines for preparing your production-ready images, https://www.life-sciencealliance.org/authors --Summary blurb (enter in submission system): A short text summarizing in a single sentence the study (max. 200 characters including spaces). This text is used in conjunction with the titles of papers, hence should be informative and complementary to the title. It should describe the context and significance of the findings for a general readership; it should be written in the present tense and refer to the work in the third person. Author names should not be mentioned.

B. MANUSCRIPT ORGANIZATION AND FORMATTING:
Full guidelines are available on our Instructions for Authors page, https://www.life-sciencealliance.org/authors We encourage our authors to provide original source data, particularly uncropped/-processed electrophoretic blots and spreadsheets for the main figures of the manuscript. If you would like to add source data, we would welcome one PDF/Excel-file per figure for this information. These files will be linked online as supplementary "Source Data" files. **Submission of a paper that does not conform to Life Science Alliance guidelines will delay the acceptance of your manuscript.** **It is Life Science Alliance policy that if requested, original data images must be made available to the editors. Failure to provide original images upon request will result in unavoidable delays in publication. Please ensure that you have access to all original data images prior to final submission.** **The license to publish form must be signed before your manuscript can be sent to production. A link to the electronic license to publish form will be sent to the corresponding author only. Please take a moment to check your funder requirements.** **Reviews, decision letters, and point-by-point responses associated with peer-review at Life Science Alliance will be published online, alongside the manuscript. If you do want to opt out of having the reviewer reports and your point-by-point responses displayed, please let us know immediately.** Thank you for your attention to these final processing requirements. Please revise and format the manuscript and upload materials within 7 days.
Thank you for this interesting contribution, we look forward to publishing your paper in Life Science Alliance.