CNVs with adaptive potential in Rangifer tarandus: genome architecture and new annotated assembly

Next-generation sequencing of three caribou ecotypes aligned to a new annotated genome assembly revealed divergent CNVs, including genes with annotations in line with adaptation.

spreadsheets for the main figures of the manuscript. If you would like to add source data, we would welcome one PDF/Excel-file per figure for this information. These files will be linked online as supplementary "Source Data" files. ***IMPORTANT: It is Life Science Alliance policy that if requested, original data images must be made available. Failure to provide original images upon request will result in unavoidable delays in publication. Please ensure that you have access to all original microscopy and blot data images before submitting your revision.***
In the last decade, the study of CNVs has largely been limited to specific genes of interest and to model organisms. Few studies to date have examined the role of CNVs in non-model species, despite recent analytical advances that allow their detection. This paper demonstrates that broad surveys of CNVs can bring substantial information about genetic variability among wild ecotypes and the inform about the role of structural variants in local adaptation.
I think the manuscript is well written, is easy to follow and it will make a very valuable contribution to structural variant resources in wild species. I think it's very close to being ready for publication, but I would suggest the following small additions/corrections: [1] Introduction : I enjoyed the reading of the introduction. I think that you can also mentionned that genome-wide CNVs datasets can also be used for population structure analyses, and that sometime, CNVs can support more genetic structure than SNPs (see Dorant et al., 2020).
[2] Results : In section Genome annotation inferred from RNAseq de novo assembly, second paragraph : « Transcripts were more numerous in the Transabyss transcriptome assembly (1,711,588) than in a5 one (223,597). This process resulted in the identification of 17,172 annotated genes based on the a5 assembly and 30,731 based on the Transabyss assembly. Overlap between both assemblies resulted in 20,419 corroborated annotated gene structures that were distributed over 2,759 genome assembly scaffolds. » I'm not sure to fully understand how the overlap between 17,172 genes and 30,731 genes can result in 20,419 genes. Please, can you reword this sentence to clarify this result.
In section Large CNVs clustered in hotspots and encompassed coding sequences, end of the paragraph 1 : « CNVs were not randomly distributed over the genome assembly but clustered into 31 hotspots including 227 CNVs (KS test; D = 0.047 and p = 0.001; Fig. 5). » This result is very interesting and I'm wondering if the authors also noted that many hotspots are localized at the scaffolds' boundaries (well visible in the yellow distribution). Does it there an explanation to this ? Could be a link between CNVs hotspot and the chromosomal position of these ? I think that it could be interesting to test if the hotspot position of scaffolds' boundaries is "random" or not with a statistical test. This reflexion come from another study focused on CNVs in balsam poplar (Prunier al., 2018). I noted that the authors performed a DAPC to identify putative CNVs loci related to caribou ecotypes (i.e. boreal desentary and migrating ecotypes). However, I did not find any information about the DAPC procedure itself. Did the authors used a DAPC with prior information ? Did they used a cross-validation information to select the number of substantial PCs ? How many PCs they used ? I think that this information should be added to the supplementary material.
On another side, to assess putative association between CNVs data and populations ecotypes, I would suggest to test it with a Redundancy Analysis (RDA), which is specifically built to test correlation between genetic and ecological data. Moreover, RDA allow users to define an FDR threshold to identify canditate loci involved in genetic adaptation. It would be easy and not so long (one month?) to compare the CNVs matrix with ecotype vector in such analysis. Here is the well written tutorial available online at : https://popgen.nescent.org/2018-03-27_RDA_GEA.html The discussion is well written and the authors greatly argue about possible bias or lack of data considering their dataset and their interpretations.
In the section « CNVs signatures related to adaptation in wild mammal ecotypes », I would suggest the study conducted by , which assess CNVs and adaptation in another circumpolar mammal species (i.e. bears). It could be possible to find interesting information in term of ecological adaptation. Note that this is just a litterature suggestion.
[4] Material and methods : In part « Whole genome long-read sequencing » and « Transcriptome analyses », please add number of samples used for each approach.
In part « Whole-genome short-read sequencing of the various ecotypes », the number of samples used and the sex info have been indicated. Perfect ! Can the authors add a small part about identification of putative adaptive CNVs.

st Authors' Response to Reviewers
October 29, 2021 Dear editor, We thank the anonymous reviewer for interesting insights and suggestions. We addressed all reviewer's comments and corrected the manuscript accordingly. You will see here after our detailed responses to reviewer but the main improvements and modifications are: -Integration of all publications suggested by the reviewer in both introduction and discussion which substantially improved the manuscript. -Clarification of the numbers for transcriptome assembly -Addition of details regarding DAPC analysis and identification of putative adaptive CNVs in the "Material&Methods" section.
We hope this new version of the manuscript will meet your expectations.
Best Regards, Julien Prunier, Claude Robert, on behalf of all co-authors.
Reviewer: I think the manuscript is well written, is easy to follow and it will make a very valuable contribution to structural variant resources in wild species. I think it's very close to being ready for publication, but I would suggest the following small additions/corrections: [1] Introduction : I enjoyed the reading of the introduction. I think that you can also mentionned that genome-wide CNVs datasets can also be used for population structure analyses, and that sometime, CNVs can support more genetic structure than SNPs (see Dorant et al., 2020).

Answer:
We agree and this is now mentioned in the first paragraph of the introduction section. Reviewer: [2] Results : In section Genome annotation inferred from RNAseq de novo assembly, second paragraph : « Transcripts were more numerous in the Transabyss transcriptome assembly (1,711,588) than in a5 one (223,597). This process resulted in the identification of 17,172 annotated genes based on the a5 assembly and 30,731 based on the Transabyss assembly. Overlap between both assemblies resulted in 20,419 corroborated annotated gene structures that were distributed over 2,759 genome assembly scaffolds. » I'm not sure to fully understand how the overlap between 17,172 genes and 30,731 genes can result in 20,419 genes. Please, can you reword this sentence to clarify this result.
Answer: We clarified the sentence by correcting the numbers. The number of annotated genes from the a5 and Transabyss assemblies were 20,419 and 30,731, respectively, while the final gff3 file includes 17,394 annotated gene models. Reviewer: In section Large CNVs clustered in hotspots and encompassed coding sequences, end of the paragraph 1 : « CNVs were not randomly distributed over the genome assembly but clustered into 31 hotspots including 227 CNVs (KS test; D = 0.047 and p = 0.001; Fig. 5). » This result is very interesting and I'm wondering if the authors also noted that many hotspots are localized at the scaffolds' boundaries (well visible in the yellow distribution). Does it there an explanation to this ? Could be a link between CNVs hotspot and the chromosomal position of these ? I think that it could be interesting to test if the hotspot position of scaffolds' boundaries is "random" or not with a statistical test. This reflexion come from another study focused on CNVs in balsam poplar (Prunier al., 2018).
Answer: Indeed, hotspots appeared to be localized at scaffold boundaries. It is logical to assume that such boundaries are caused by the presence of repeated elements which hinder assembly. In turn, the presence of repeated elements which can be found locally or elsewhere in the genome can also be instrumental to the generation of CNVs, by favouring non-allelic homologous recombination for instance. However, we looked at the distribution of CNV hotspots positions upon a scaffold (taking into account scaffold length) following the reviewer's comment and it was relatively homogenous. Contrastingly to CNV work in balsam poplar (Prunier et al. 2018) where a genome assembly resolved at the chromosome level was available, our data does not permit testing for relative position regarding telomeres or centromeres.
Reviewer: I noted that the authors performed a DAPC to identify putative CNVs loci related to caribou ecotypes (i.e. boreal desentary and migrating ecotypes). However, I did not find any information about the DAPC procedure itself. Did the authors used a DAPC with prior information ? Did they used a cross-validation information to select the number of substantial PCs ? How many PCs they used ? I think that this information should be added to the supplementary material.

Answer:
We applied the DAPC with the known ecotype for each individual as prior information. The number of substantial PCs was chosen according to the cumulative proportion of variance explained by PCs and accounted for 80% of the total variance. Only one discriminant function was retained to discriminate between both groups. This information was added in the "material and methods" section of the new manuscript version ("CNV detection and characterization" subsection).
Reviewer: On another side, to assess putative association between CNVs data and populations ecotypes, I would suggest to test it with a Redundancy Analysis (RDA), which is specifically built to test correlation between genetic and ecological data. Moreover, RDA allow users to define an FDR threshold to identify canditate loci involved in genetic adaptation. It would be easy and not so long (one month?) to compare the CNVs matrix with ecotype vector in such analysis.
Here is the well written tutorial available online at : https://popgen.nescent.org/2018-03-27_RDA_GEA.html Answer: Following reviewer's suggestions, we looked into applying a Redundancy Analysis as described online (https://popgen.nescent.org/2018-03-27_RDA_GEA.html). This approach is an interesting alternative for Genotype-Environment Analysis (GEA) as it allows to delineate a combination of genetic markers and a combination of environmental factors presenting the maximum correlation. We found two limitations making the RDA not ideal in our case. First, the number of samples (N=19) spread into two groups (forest and migrant ecotypes) was somewhat limited to apply this kind of statistical analysis. The second objection is that we actually have only one environmental variable and moreover a categorical one, while a RDA is meant to summarize variation over several variables (mostly numerical ones as shown in the online workflow). As a matter of fact, we wanted to apply something similar to GEA before choosing to apply a DAPC. When applying a DAPC, a categorical variable is expected to cluster individuals into groups and variants with a high loading scores upon a discriminant function between groups are the most varying amongst all variants. Thus a DAPC appears more adequate according to our sampling design. Similarly to what is described in the online tutorial for RDA, we selected outlier variants (outlier loading scores) as candidates to ecotype divergence.

Reviewer: [3] Discussion
The discussion is well written and the authors greatly argue about possible bias or lack of data considering their dataset and their interpretations.
In the section « CNVs signatures related to adaptation in wild mammal ecotypes », I would suggest the study conducted by , which assess CNVs and adaptation in another circumpolar mammal species (i.e. bears). It could be possible to find interesting information in term of ecological adaptation. Note that this is just a litterature suggestion.
Answer: As suggested, we referred to this publication in the mentioned section while exposing similarities in gene functions related to circumpolar adaptation in both genus despite the evolutionary distance between Ursus and Rangifer. Two interesting terms appeared in both genus: molecular functions related to fatty acid metabolism which is known to mitigate hypothermia, and UV-response which is likely related to high latitude life. Reviewer: [4] Material and methods : In part « Whole genome long-read sequencing » and « Transcriptome analyses », please add number of samples used for each approach.
In part « Whole-genome short-read sequencing of the various ecotypes », the number of samples used and the sex info have been indicated. Perfect ! Answer: The whole genome long-read sequencing as well as transcriptome analyses were both performed from one female sample. This information was added in the "Material&Methods" section ("Whole genome long-read sequencing" subsection).
Can the authors add a small part about identification of putative adaptive CNVs. Answer: As mentioned above, we added a few lines in the "material and methods" section ("CNV detection and characterization" subsection) to better describe the procedure we applied to identify putative adaptive CNVs. Thank you for submitting your revised manuscript entitled "CNVs with adaptive potential in Rangifer tarandus: genome architecture and new annotated assembly". We would be happy to publish your paper in Life Science Alliance pending final revisions necessary to meet our formatting guidelines.
Along with points mentioned below, please tend to the following: -please upload your main manuscript text as an editable doc file -please upload your main and supplementary figures as single files -please add the Twitter handle of your host institute/organization as well as your own or/and one of the authors in our system -please consult our manuscript preparation guidelines https://www.life-science-alliance.org/manuscript-prep and make sure your manuscript sections are in the correct order and labeled correctly -please upload your Tables in editable .doc or excel format -please add your main, supplementary figure, and table legends to the main manuscript text after the references section -please add an Author Contributions section to your main manuscript text -please add a conflict of interest statement to your main manuscript text -please add callouts for Figure 6A-B to your main manuscript text; -please add a credit line for the photos in Figure 6 If you are planning a press release on your work, please inform us immediately to allow informing our production team and scheduling a release date.
LSA now encourages authors to provide a 30-60 second video where the study is briefly explained. We will use these videos on social media to promote the published paper and the presenting author (for examples, see https://twitter.com/LSAjournal/timelines/1437405065917124608). Corresponding or first-authors are welcome to submit the video. Please submit only one video per manuscript. The video can be emailed to contact@life-science-alliance.org To upload the final version of your manuscript, please log in to your account: https://lsa.msubmit.net/cgi-bin/main.plex You will be guided to complete the submission of your revised manuscript and to fill in all necessary information. Please get in touch in case you do not know or remember your login name.
To avoid unnecessary delays in the acceptance and publication of your paper, please read the following information carefully.
A. FINAL FILES: These items are required for acceptance.
--An editable version of the final text (.DOC or .DOCX) is needed for copyediting (no PDFs).
--High-resolution figure, supplementary figure and video files uploaded as individual files: See our detailed guidelines for preparing your production-ready images, https://www.life-science-alliance.org/authors --Summary blurb (enter in submission system): A short text summarizing in a single sentence the study (max. 200 characters including spaces). This text is used in conjunction with the titles of papers, hence should be informative and complementary to the title. It should describe the context and significance of the findings for a general readership; it should be written in the present tense and refer to the work in the third person. Author names should not be mentioned.