Phylogenetic profiling resolves early emergence of PRC2 and illuminates its functional core

This study strengthens the support for PRC2 emergence before the diversification of eukaryotes, detects a common presence of E(z) and ESC, indicating a conserved core, identifies diverse VEFS-Box Su(z)12 candidate proteins, and proposes a substrate specificity shift during E(z) evolution.

1. The authors developed a sequence homology-based computational automated tool (PcG-finder) for the search of PRC2 core components in genomes and transcriptomes, using Drosophila melanogaster PRC2 protein sequences. I wonder how robust and sensitive this tool is. Could the authors validate their tool by showing that they manage to recover all the known PRC2 core components in the organisms where PRC2 composition has been established experimentally: Arabidopsis, C elegans, Neurospora crassa, Cryptococcus neoformans, Paramecium tetraurelia, Tetrahymena thermophila.
2. Along the same line, SUZ12 is not always found in PRC2 complexes. In the fungus Crytopcoccus neoformans (Dumesic, 2015) and in C elegans (Bender 2004;Xu 2001), no obvious SUZ12 homologs were identified, yet Cryptococcus Bnd and C elegans MES3, which are required for the catalytic activity of the complex, might be functional homologs of SUZ12, as suggested before for MES3 (Schuettengruber, 2017). Do the authors identify those proteins with their tool? Pull down experiments recently revealed the existence of putative SUZ12 components in PRC2 in the ciliates Paramecium tetraurelia (Miro Pina 2021 https://doi.org/10.1101/2021.08.12.456067) and Tetrahymena thermophila (Xu 2021 https://doi.org/10.1093/nar/gkaa1262) that could not be identified by homology-based search because of sequence divergence. The conclusion that there is evidence of the absence of SUZ12 in ciliates (Figure 1-Figure 5-line 143) should thus be corrected. In addition, the limitations of the approach presented in the manuscript should be discussed. It might also be wise not to draw any conclusions on the dispensability of SUZ12 based on its absence. Instead, I suggest considering the alternative possibility that SUZ12 might evolve more rapidly than the other components. The authors should rephrase the abstract (Lines 21-22) and the results section (lines 143-lines 279-281).
Minor comments: -Line 13: "maintaining" not "establishing transcriptionally silent chromatin states" -Lines 50-53: regarding the evidence for H3K27me3 in unicellular eukaryotes, some references are missing (eg Carlier, 2021;Jamieson 2013, Lhuillier-Akakpo 2014Liu 2007) -Lines 68-69: The association between H3K27me3 and genomic repeats could be supported by a more complete list of references and/or a review. -Lines 90-91: The sentence is incomplete: "suggesting the evidence..." -Lines 132-134: The statement regarding NURF55 function outside PRC2 is correct. It would be better to state that NURF55 belongs to several complexes earlier in the introduction. -Line 220: reference to E(z) phylogeny in ciliates should be corrected. It is not Frapporti 2019 but Lhuillier-Akakpo 2014.
-Lines 329-332. The sentence is unclear. Why is this a good model for studying the evolution of PRC2 catalytic function? -Lines 387-389: Regarding the structural modeling of SET domains of E(z) proteins belonging to the 5 clades, the authors state that clade II is different from the others: "some of the loops/turns near the cofactor binding pocket are slightly disordered when compared to other models, which might be the reason for the binding difference". Could the authors be more precise: I don't understand from their description what exactly differs in clade II enzyme and how this is linked to substrate specificity? Clade II Paramecium tetraurelia SET domain has been modeled in a previous study (Frapporti 2019). How similar/different are the two modeled structures and conclusions? -Lines 392-394: add the reference. -Lines 415/416: what do the authors mean by " biological significance of PRC2": its biological functions? - Figure 5: A "parasitism lifestyle" cannot be used to describe Ciliophora/Alvelolata, as indicated in the legend. This should be corrected.
Reviewer #2 (Comments to the Authors (Required)): Polycomb Repressive Complex 2 (PRC2) has been shown to play crucial roles in cell differentiation and cell fate maintenance during the development of many metazoan models by generating specific histone modifications and establishing a poised gene expression program. In the article, "Phylogenetic profiling resolves early emergence of PRC2 and illuminates its functional core", the authors performed comprehensive phylogenetic analysis to study the distribution of PRC2 components across eukaryotes. Upon surveying the sequence information of 283 species, the authors identified a putative ortholog of Ez in a subset of Discoba, an earlier eukaryotic lineage. The authors did not observe Ez in Metamonada, although both Discoba and Metamonada belong to the Excavates superfamily. After studying distribution, the authors focused on the Ez component, which harbors the enzymatic SET domain thus compose a catalytic part of the PRC2. The authors classified full-length Ez into five groups and identified a group where Discoba is located with other species (group II). Then, the authors zoomed into SET-domain and further compared domain structures across five groups. In a structure model, the histone substrate-binding pocket of group II showed more disordered features than other groups.
Overall, this study represents comprehensive phylogenetic profiling of the PRC2 subunits across the entire Eukaryota. My main concern is the presentation, in particular Figures 1-3, needs to be simplified and up to the point.
1. Figure 1: Manuscript (written part of Figure 1) is relatively straightforward, but Figure 1 itself is too complicated to understand by looking at it. Figure legend is poorly described and contains little information. What are the main messages of the figure? Where is that info? There are many unnecessary parts of the figure as well. For example, what is the added value of the numbers next to the species? Species names are barely readable on a 100% scale. Do you need all those names and numbers in the main figure? The authors should move the current version into supplementary and provide a simple figure. Also, it would be helpful to add a photo or image of each representative species used in their analysis. 2. Among the PRC2 components, NURF55 shows the most substantial conservation across eukaryotes. Nevertheless. The authors focused on Ez. The rationale for Ez selection has to be apparent from the beginning of the manuscript. 3. Figures 2 and 3 have the same issue as Figure 1. The numbers in triangles and branches are too small to recognize and may not be necessary for the main figure. Figure 2A -Information about different species in each clade is an interesting biological outcome, but it is hard to get from the figure. Figure 2B -it is unclear what is the take-home message of this figure. The written part (page 7; 251-272) is also confusing. 4. I understand the challenge of assembling and organizing all info after sequencing analysis. However, the current version of the figure set seems premature. These figures show the crude output of the research (species names, conservation scores, etc.), and the authors made a little effort to deliver biological meaning to more broad and general readers. 5. Figure 4 is excellently delivered the biological message. Figure 5 is constructive to get the summary view of Figure 1-3. It would be helpful to add info about the percentage value of the existing PRC2 components. For instance, most Aveloata they observed do not contain Suz12, ESC, and Ez.
Reviewer #3 (Comments to the Authors (Required)): The Polycomb repressive complexes are important chromatin regulators found across eukaryotes. Although their function, composition and mechanisms of action have been intensively studied in many major model systems the evolution of the Polycomb system is not yet well understood. The authors here study the evolution of Polycomb Repressive Complex 2 (PRC2) components across eukaryotes. Their findings support the hypothesis that PRC2 emerged early in eukaryote evolution, likely being present in the last common eukaryotic ancestor and provide some additional support to the idea that E(z) may have first evolved as a dual K9/K27 methyltransferase. Although neither of these ideas are new and have been proposed by other studies this study is certainly valuable in providing additional evidence to support them. I think this manuscript is certainly worthy of publication but I have a number of issues and suggestions which I think should be addressed before acceptance: 1. My biggest concern with the manuscript is the quality of the data upon which the authors rely to make conclusions about presence/absence in different species and groups. The authors use BUSCO as a measure of the completeness of the genomic information they use. It is not clear to me, however why having more than 50% completeness would be considered good. In fact, I would think by looking at the numbers for many of the genomes used that the BUSCO assessment shows that those genomes are very incomplete making it quite likely that potential homologs could be missed in those species. I think the authors should use some cut-off in whether to include a species in their analysis here at all. Some species have such a low score that they should not be included but unfortunately for some groups this would remove most of the species. This is particularly important when the authors want to make about about species or clades have no ortholog. Any decision on this will always be arbitrary but I would say that 50% is a reasonable cutoff. This is not a major problem for the overall conclusions of the paper per se as even in groups where some genome are low quality the fact homologs are or are not found across the clade is good evidence. To put this another way, even if all the genome are low quality the fact it would be missed in many is low. It does however call into question some of the more specific claims made about presence/absence in specific species/clades. I think the authors should shy away from making major claims, particularly based on absence in species/clade where the quality of genomes is low, and should rathe stick to more high level claims. For example, there are only 2 species in the Metamonada which have a BUSCO of higher than 50 and most have very low values which would lead to any claims from this group being less robust that for other groups. You see that within the Discoba the only two species which have an ESC or E(z) are the species with greater than 50. I would be interested if the authors were to recreate figure 1 now including only species with BUSCO > 50 or also with BUSCO>75 what conclusions can now be drawn convincingly from the data. 2. I think the paper could be substantially shorter and many of the major points could be made in a more succinct way. At several points the authors describe in detail the presence/absence of homolog in various species which is totally unnecessary. This is particularly unimportant given the uncertainly over the quality of data for these species. It would make the paper much more readable if the authors were to make shorter more definitive statements of presence/absence in clades rather than including too much detail. 3. Throughout the manuscript the authors use the term "higher eukaryotes". This term is somewhat outdated and can be misleading. It is hard to define what higher means and it often suggests some relationship between groups that does not exist. I would try to replace such terms with more concise description o the clades/groups that you want to refer to. 4. Line 60: There is no need to use the term multicellular when referring to animals. 5. In several places the authors use the term "homolog" where it would be more appropriate to use the term "ortholog". 6. In a lot of places in the manuscript the writing is sloppy and it is unclear what the authors are talking about. For example, on line 90 a sentence ends "suggesting the existence.", and it seems the end of the sentence is missing. Similarly on line 92 the authors write "... a separate domain of life representing, ..." and again it seems like some text is missing as it is unclear what should come after representing. Some attention should be paid to the writing to make the paper more readable. 7. Although the authors focus on PRC2 a short description of what PRC1 and PRC2 are and what is known about PRC1 evolution would help orient the reader in the field. Several studies have addressed PRC1 evolution recently (Berke and Snel, 2015;Chen et al., 2016;Gahan et al., 2020;Schuettengruber et al., 2017) and a short description of what is known there may be useful. It also seems PRC2 is more conserved/older than PRC1 (or to be specific several components of PRC1) and this may be worth discussing. 8. In Figure 1 the authors use a very definite phylogeny of eukaryotes even though there has been considerable debate in this area. The authors should either state the source of the phylogeny they use and/or include some detail about possible alternative phylogenies. 9. Figure 1: In a couple of cases the colored icons have moved and are located outside of the circle. 10. The authors state that Suz(12) is absent in several groups. One of the species in which they claim it is missing is in Tetrahymena, however it has been shown that there is indeed a Suz12 in this species but it is substantially shorted that found in other species (Xu et al., 2021). I assume that this was not counted as it did not meet the criteria set by the authors to assign orthology. Firstly, this should be discussed albeit briefly but secondly it also highlights an important point that in cases where loss is seen it may be that the protein has simply lost some domain or changed in some other way so as to be difficult to recognize as a true ortholog without additional experiments. This is a problem in all such studies and not something that the authors can help but a description of such possibilities would be helpful. 11. Line 137, you cannot start a new paragraph with "however" and in this case it makes no sense either way so please rephrase 12. Line 151-152 The authors state their ESC phylogeny is the "most robust ESC tree to date". This statement must somehow be backed up by data. It is not clear how many other ESC trees there are that the authors are talking about. These should be referenced. The authors must also state what they mean by "robust" and by which measurements they came to this conclusion. If this is simply the opinion of the authors, I would remove this statement entirely. 13. Line 152-154. The authors state "Similarly, unresolved relationships within microalgae species were reported (Zhao et al. 2020), whereas other studies suggest a monophyletic origin for ESC homologs found in most eukaryotes (Huang et al. 2017)." IT is not clear to me however how the ESC phylogeny here fits with these reports. If the authors want to mention them here they should state if they find the same or different things to these two studies. 14. Line 157. The jump to NURF55 here is a bit abrupt and could do with starting a s anew paragraph. 15. Line 160. The refence here to possible horizontal gene transfer events is vague and not well supported. There are other explanations as to why a single gene family phylogeny may not recapitulate the phylogeny of the species used. The authors should be more reserved in any claims of horizontal gene transfer unless they have additional evidence for it. This is also the case later in the manuscript and the authors should be careful without to make claims about HGT without additional support from outside of the phylogeny. 16. Line 164-169. It is unclear to me what the authors are trying to say in this section. It looks as if some sentences are missing which would make sense but as it is it doesn't fit together at all. 17. Line 212-243. This entire paragraph discusses in detail the clades into which E(z) homologs fall but in the end doesn't really reveal anything of importance to the paper and should be shortened down to a single statement. 18. Line 256. Just because sequences found in species close to the root does not necessarily suggest this has anything to do with the ancestral sequence and such statements should be avoided. 19. Line 299-301. The authors state: ". Our results show that the SET-domain proteins are less conserved in bacteria and archaea than what was reported previously (Alvarez-Venegas 2014), likely due to a lower number of representative species per group in our analysis." Given that the authors admit that the discrepancy in these studies is likely due to less species in their analysis that they cannot conclude that " SET-domain proteins are less conserved". 20. Line 393. The authors must cite the study which showed that paramecium E9z) is capable of methylating h3k9.  Dear Dr. Novella Guidi, PhD I am happy to submit a revised version of our manuscript "Phylogenetic profiling resolves early emergence of PRC-2 and illuminates its functional core" (#LSA-2021-01271-T) for consideration by Life Science Alliance. We have addressed all suggestions of the reviewers and provide pointby-point response to the comments raised by the reviewers. To ease the revision of the new version of the manuscript, all changes are mentioned in this rebuttal letter. In addition to the reviewers' suggestions, the manuscript format has also been changed to fit the LSA required format.
On behalf of all co-authors let me thank you for your and the reviewers´ efforts invested into our manuscript, we are looking forward to hearing from you.

With best wishes
Abdoallah Sharaf

Reviewer #1:
The manuscript by Sharaf et al examines the evolutionary origin of the Polycomb Repressive Complex 2 (PRC2) using a phylogenetic approach. PRC2 deposits the histone mark H3K27me3 on silent proteincoding genes. The composition of the complex has been characterized in very few species and its enzymatic activity was experimentally tested in a handful of organisms. Here, the authors survey the genomes/transcriptomes of 283 species to search for 4 PRC2 core components: E(z), the catalytic subunit, ESC, SUZ12, and Nurf55. They identified E(z) and Esc homologs in species that support the emergence of PRC2 prior to the diversification of eukaryotes. Their analysis also suggests that the E(z) enzyme may have shifted substrate specificity-from K9 to K27 methylation-in the course of evolution.
Response: Thank you for your thorough review and comments that have helped to improve our manuscript. The major changes involve the addition of PcG-finder validation and additional analyses of VEFS-Box proteins across the studies species to consider higher diversification of Su(z)12 orthologs.
Major comments: 1. The authors developed a sequence homology-based computational automated tool (PcGfinder) for the search of PRC2 core components in genomes and transcriptomes, using Drosophila melanogaster PRC2 protein sequences. I wonder how robust and sensitive this tool is. Could the authors validate their tool by showing that they manage to recover all the known PRC2 core components in the organisms where PRC2 composition has been established experimentally: Arabidopsis, C elegans, Neurospora crassa, Cryptococcus neoformans, Paramecium tetraurelia, Tetrahymena thermophila.
2. Along the same line, SUZ12 is not always found in PRC2 complexes. In the fungus Crytopcoccus neoformans (Dumesic, 2015) and in C elegans (Bender 2004;Xu 2001), no obvious SUZ12 homologs were identified, yet Cryptococcus Bnd and C elegans MES3, which are required for the catalytic activity of the complex, might be functional homologs of SUZ12, as suggested before for MES3 (Schuettengruber, 2017 The conclusion that there is evidence of the absence of SUZ12 in ciliates (Figure 1-Figure 5-line 143) should thus be corrected. In addition, the limitations of the approach presented in the manuscript should be discussed. It might also be wise not to draw any conclusions on the dispensability of SUZ12 based on its absence. Instead, I suggest considering the alternative possibility that SUZ12 might evolve more rapidly than the other components.

Response to comments 1 and 2:
We have performed the validation of the computational pipeline in the first steps of our work but have not included the results in the original version of the manuscript, which was a clear mistake on our side. In the revised version, we newly include a first chapter of results entitled "Specificity and sensitivity of PcG-finder, an automated protein homolog identification pipeline" and Table S1 where we summarize the results of the pipeline validation. The PcG-finder tool has been validated comparing its predictions of all the PRC2 core components in 8 organisms from different eukaryotic lineages, including the species mentioned by the reviewer. We find that the pipeline can selectively identify sequence homologs in all the organisms except for the Su(z)12 proteins in N. crassa and T. thermophila and ESC in ciliates, that carry limited sequence conservation. Importantly, all the known E(z) orthologs are faithfully recovered by the pipeline. Being a sequence-homology-based pipeline, PcG-finder will not detect PRC2 core subunits that are functional but not sequence homologs and that do not carry characteristic protein domains -i.e. Su(z)12 functional homologs in C. elegans, C. neoformans or P. tetraurelia. In the revised version, we are discussing the limitations of the approach to avoid misinterpretation of homolog absence.
We would like to thank the reviewer for highlighting the divergence instead of loss of Su(z)12 subunit. Based on this comment and not having identified the VEFS-Box-containing Su(z)12 orthologs in N. crassa and T. themophila, we included a domain-based protein search to identify all VEFS-Box proteins in the species studied and the results have been included in the revised version of the manuscript. Using this approach, we identify the Su(z)12 orthologs in the abovementioned species, indeed justifying the approach to search for more divergent Su(z)12-like proteins. We are now basing our conclusions and discussion on the combined results of the two approaches. Therefore, all conclusions on the dispensability of Su(z)12 based on its absence have been removed or rephrased to accommodate the possibility of more divergent VEFS-Box proteins or even unrelated proteins fulfilling the function of Su(z)12. We think that expanding the manuscript by providing the data on VEFS-Box proteins (Fig 1 and fully Table S8) is a useful extension for the scientific community, although proteins not involved in PRC2 (false positives) are likely to be identified this way. In this version of the manuscript, we have distinguished Su(z)12 orthologs with sufficient protein sequence similarity to be identified by PcG-finder, Su(z)12 orthologs that can only be identified based on the existence of the VEFS domain (e.g. T. thermophila, N. crassa) and proteins without sequence or functional domain conservation that may be functional homologs of Su(z)12 -e.g. Bnd1 in C. neoformans or MES2 in C. elegans, which cannot be identified by homology search. We call these "functional homologs" of Su(z)12 and do not include them among Su(z)12 or VEFS-Box proteins.

listed in
Based on the additional results, we have rephrased the Su(z)12-relevant sections as follows: Abstract: FROM: "Analyzing 283 species, we robustly detect a common presence of E(z) and ESC, suggesting that Su(z)12 may have emerged later and/or maybe dispensable from the evolutionarily conserved functional core of PRC2." TO: "Full-length Su(z)12 orthologs were identified in some lineages and species only, indicating, non-exclusively, high divergence of VEFS-Box-containing Su(z)12-like proteins, functional convergence of sequentially unrelated proteins or Su(z)12 dispensability." Original line 143: FROM: "Our results suggest the Su(z)12 was lost in Alveolata at the root of Myzozoa." 12 is generally absent in Alveolata except for the recently described Colponemidia clade (Tikhonenkov et al. 2020), suggesting divergence of the Su(z)12 at the root of Myzozoa." Original lines 279-281: FROM: "Su(z)12 may have emerged later and its secondary loss seems more frequent than loss of ESC, suggesting its dispensability." TO: Lines 328-342: "We have not found conserved Su(z)12 orthologs in multiple lineages including Discoba and Metamonada (Fig 1 and Supplementary file 1). VEFS-Box proteins are nevertheless present in most of these lineages (Fig 1 and Table S8) and they may represent more divergent Su(z)12-like proteins. VEFS-Box proteins are found in alveolates where E(z) homologs seem missing (Fig 1). We found non-Su(z)12 VEFS-Box proteins in model species with well-descried Su(z)12 orthologs, such as human, D. melanogaster or A. thaliana, indicating potential engagement of VEFS-Box proteins in other complexes. Experimental evidence will be needed to associate VEFS-Box proteins with PRC2. It is interesting to note that the currently described examples of PRC2 where other than VEFS-box proteins take the function of Su(z)12 come from species (C. elegans, C. neoformans and P. tetraurelia) where no VEFS-Box proteins were identified (Table S8)

and that contain ancestral E(z) of Clades I or II. Similarly, highly divergent Su(z)12 (or even other VEFS-Box proteins) are involved in PRC2 in fungi and ciliate T. thermophila within Clade I and/or Clade II E(z). This may indicate later evolution and engagement of VEFS-box proteins and canonical Su(z)12 in PRC2, but this hypothesis must be addressed experimentally."
Minor comments: -Line 13: "maintaining" not "establishing transcriptionally silent chromatin states" Response: Thank you, it is a valid criticism and the word "establishing" has been replaced by "maintaining". -Lines 68-69: The association between H3K27me3 and genomic repeats could be supported by a more complete list of references and/or a review.
Response: Thank you for the comment, we have lost some crucial references while shortening the text. We have newly added the references to reviews that better reflect and provide discussion related to H3K27me3 in constitutive heterochromatin (Deleris et al.2021, Vijayanathan et al 2022.
-Lines 90-91: The sentence is incomplete: "suggesting the evidence..." Response: Thank you for noticing and please excuse the negligence. The sentence has been completed.
-Lines 132-134: The statement regarding NURF55 function outside PRC2 is correct. It would be better to state that NURF55 belongs to several complexes earlier in the introduction.

Response: We have included the following statement in the introduction:
Line 73-74: "NURF55 is involved in multiple chromatin-related protein complexes (Hennig et al. 2005;Suganuma et al. 2008). " -Line 220: reference to E(z) phylogeny in ciliates should be corrected. It is not Frapporti 2019 but Lhuillier-Akakpo 2014.
Response: Thank you for the correction, the reference has been changed. The whole passage is newly included in Supplementary file 1.
-Lines 329-332. The sentence is unclear. Why is this a good model for studying the evolution of PRC2 catalytic function?

Response:
We have removed the statement. Current version: Lines 369-371: "The E(z) subfamily of SET-domain proteins therefore seems to have diverged after the emergence of eukaryotes but prior to their expansion." -Lines 387-389: Regarding the structural modeling of SET domains of E(z) proteins belonging to the 5 clades, the authors state that clade II is different from the others: "some of the loops/turns near the cofactor binding pocket are slightly disordered when compared to other models, which might be the reason for the binding difference". Could the authors be more precise: I don't understand from their description what exactly differs in clade II enzyme and how this is linked to substrate specificity? Clade II Paramecium tetraurelia SET domain has been modeled in a previous study (Frapporti 2019). How similar/different are the two modeled structures and conclusions?

Response:
We have rephrased the paragraph. Lines 419-422: "The representative protein model for clade II (P. tetraurelia) is comparable to the previously computed model (Frapporti et al., 2019), although the amino acids comprising the Cterminal post-SET domain are not included in our model. Importantly, structural folds are conserved, and retain similar structural topology and substrate binding patterns (Fig S8)." In our model (Fig 4), the N-terminal alpha-helix appears shorter than in the model of Frapporti et al 2019. In our study, we used the SET-domain of EZH1 (PDB ID: 7KSR, EZH1-identity-38.66%) as a reference structure, while Frapporti et al. use SET-domain of EZH2. To respond to this comment, we modelled the P. tetraurelia SET-domain using EZH2 (PDB ID: 5hyn.1.A--EZH2--36.97%) as a reference template. In this case, the helix is longer and the model resembles the published one. We do not see the post-SET domain structure as this sequence was not included in our modelled sequence. Importantly, the substrate orientation is similar in both models, showing K27 oriented in proximity of the cofactor.

Response:
We have revised the sentence. Although genomic and biological functions clearly need to be addressed, this is not specifically highlighted by the presented work. We have therefore rephrased to: Lines 447-450: "However, our findings highlight important future questions that need to be addressed experimentally, including PRC2 catalytic activity, E(z) substrate specificity, or involvement of VEFS-Box proteins and other functional homologs of Su(z)12." - Figure 5: A "parasitism lifestyle" cannot be used to describe Ciliophora/Alvelolata, as indicated in the legend. This should be corrected.
Response: Thank you, the lifestyle of ciliophora has been corrected.

Reviewer #2:
Polycomb Repressive Complex 2 (PRC2) has been shown to play crucial roles in cell differentiation and cell fate maintenance during the development of many metazoan models by generating specific histone modifications and establishing a poised gene expression program. In the article, "Phylogenetic profiling resolves early emergence of PRC2 and illuminates its functional core", the authors performed comprehensive phylogenetic analysis to study the distribution of PRC2 components across eukaryotes. Upon surveying the sequence information of 283 species, the authors identified a putative ortholog of Ez in a subset of Discoba, an earlier eukaryotic lineage. The authors did not observe Ez in Metamonada, although both Discoba and Metamonada belong to the Excavates superfamily. After studying distribution, the authors focused on the Ez component, which harbors the enzymatic SET domain thus compose a catalytic part of the PRC2. The authors classified full-length Ez into five groups and identified a group where Discoba is located with other species (group II). Then, the authors zoomed into SETdomain and further compared domain structures across five groups. In a structure model, the histone substrate-binding pocket of group II showed more disordered features than other groups.
Overall, this study represents comprehensive phylogenetic profiling of the PRC2 subunits across the entire Eukaryota. My main concern is the presentation, in particular Figures 1-3, needs to be simplified and up to the point.

Response:
Thank you for your thoughtful and thorough review. Major changes involve modification of figures and rearrangement of main text and supplementary figures. We have simplified the font (sans-serif) for all figures and only kept the species names (avoiding accession numbers) to increase legibility of figures 2 and 3. In case of both Figures 2 and 3, we created new version of the figure, collapsing leaves including species from the major supergroups where possible. This way, the association of the clades with the major supergroups is more obvious. Based on comments of Reviewer 1 and Reviewer 3, we have added a section on PcG finder validation and analyses of the distribution of VEFS-Box proteins in the studied species.
1. Figure 1: Manuscript (written part of Figure 1) is relatively straightforward, but Figure 1 itself is too complicated to understand by looking at it. Figure legend is poorly described and contains little information. What are the main messages of the figure? Where is that info? There are many unnecessary parts of the figure as well. For example, what is the added value of the numbers next to the species? Species names are barely readable on a 100% scale. Do you need all those names and numbers in the main figure? The authors should move the current version into supplementary and provide a simple figure. Also, it would be helpful to add a photo or image of each representative species used in their analysis.
Response: Thank you for these suggestions. We fully agree with the reviewer that the original image was complex. We have tried to simplify the image: we have simplified the font (sans-serif) and only kept the species names (avoiding accession numbers) to increase legibility. Only species with BUSCO score 50 are now plotted (where BUSCO score 75 and higher is indicated in bold) and complete data is available in the supplement. We believe that the new version of the figure is more legible. At the same time, we would prefer to maintain the information contained in the current version of the image -although we have considered options of collapsing the groups, we feel that this may result in misleading information due to the variability within the groups. The summary of the findings is provided in Figure 5 (including representative species images) and we would prefer to keep this revised version of Fig 1 as an overview of the study, if possible.
2. Among the PRC2 components, NURF55 shows the most substantial conservation across eukaryotes. Nevertheless. The authors focused on Ez. The rationale for Ez selection has to be apparent from the beginning of the manuscript.
Response: Thank you, the introduction has been modified in the following ways: Lines 73-76: "NURF55 is involved in multiple chromatin-related protein complexes (Hennig et al. 2005;Suganuma et al. 2008). Thus, only E(z), ESC, and Su(z)12 subunit were used to infer the phylogeny of the PRC2 complex to bypass confusing interpretations (Huang et al. 2017;Shaver et al. 2010)."and Lines 86-87: "The catalytic activity of PRC2 is carried out by E(z), which is therefore the defining functional subunit." 3. Figures 2 and 3 have the same issue as Figure 1. The numbers in triangles and branches are too small to recognize and may not be necessary for the main figure. Figure 2A -Information about different species in each clade is an interesting biological outcome, but it is hard to get from the figure. Figure 2B -it is unclear what is the take-home message of this figure. The written part (page 7; 251-272) is also confusing.
Response: Thank you. We have simplified the font (sans-serif) for all figures and only kept the species names (omitting accession numbers) to increase legibility in Figures 2&3. In case of both Figures 2 and 3, we created new versions of the figure, collapsing leaves including species from the major supergroups where possible. This way, the association of the clades with the major supergroups is more obvious. We have in addition removed panel A depicting the domain organization and keep this as supplementary Fig S5. Finally, the written part (original page 7; 251-272) has been simplified to: Lines 314-319: "Clade I and II proteins contain either only the SET domain or sequentially arranged CXC and SET domains (Fig S5 and Table S10). The SANT domain is present in clades III-IV Nterminally of the arranged canonical domains (CXC and SET) (Fig S5 and Table S10). More complex domain architectures were observed within proteins of clade V. This clade contains all plant and animal sequences, including eumetazoan orthologs (EZH1 and EZH2) (Fig S5 and Table S10)." 4. I understand the challenge of assembling and organizing all info after sequencing analysis. However, the current version of the figure set seems premature. These figures show the crude output of the research (species names, conservation scores, etc.), and the authors made a little effort to deliver biological meaning to more broad and general readers.
Response: Thank you, it was a valid and valuable comment that prompted us to modify the figures to hopefully better convey the main message. We tried our best to simplify but not to lose information that we believe is important. Figure 4 is excellently delivered the biological message. Figure 5 is constructive to get the summary view of Figure 1-3. It would be helpful to add info about the percentage value of the existing PRC2 components. For instance, most Aveloata they observed do not contain Suz12, ESC, and Ez.

Response:
To this end, we have added values representing the percentage of species where PRC2 components were found next to the symbols for the respective subunits. In the legend, we further refer to the relevant supplementary table S2 where the full list and numbers of species studied can be found.

Reviewer #3
The Polycomb repressive complexes are important chromatin regulators found across eukaryotes. Although their function, composition and mechanisms of action have been intensively studied in many major model systems the evolution of the Polycomb system is not yet well understood. The authors here study the evolution of Polycomb Repressive Complex 2 (PRC2) components across eukaryotes. Their findings support the hypothesis that PRC2 emerged early in eukaryote evolution, likely being present in the last common eukaryotic ancestor and provide some additional support to the idea that E(z) may have first evolved as a dual K9/K27 methyltransferase. Although neither of these ideas are new and have been proposed by other studies this study is certainly valuable in providing additional evidence to support them. I think this manuscript is certainly worthy of publication but I have a number of issues and suggestions which I think should be addressed before acceptance: Response: Thank you for your insightful comments and thorough review. We have tried to address each of the comments as detailed below. The major changes involve modification of figures and rearranging the information included in the main text vs. supplement. Based on the Reviewer 1 and Reviewer 3 comments, we have added a section on PcG finder validation and analyses of the distribution of VEFS-Box proteins in the studied species.
1. My biggest concern with the manuscript is the quality of the data upon which the authors rely to make conclusions about presence/absence in different species and groups. The authors use BUSCO as a measure of the completeness of the genomic information they use. It is not clear to me, however why having more than 50% completeness would be considered good. In fact, I would think by looking at the numbers for many of the genomes used that the BUSCO assessment shows that those genomes are very incomplete making it quite likely that potential homologs could be missed in those species. I think the authors should use some cut-off in whether to include a species in their analysis here at all. Some species have such a low score that they should not be included but unfortunately for some groups this would remove most of the species. This is particularly important when the authors want to make about about species or clades have no ortholog. Any decision on this will always be arbitrary but I would say that 50% is a reasonable cutoff. This is not a major problem for the overall conclusions of the paper per se as even in groups where some genome are low quality the fact homologs are or are not found across the clade is good evidence. To put this another way, even if all the genome are low quality the fact it would be missed in many is low. It does however call into question some of the more specific claims made about presence/absence in specific species/clades. I think the authors should shy away from making major claims, particularly based on absence in species/clade where the quality of genomes is low, and should rathe stick to more high level claims. For example, there are only 2 species in the Metamonada which have a BUSCO of higher than 50 and most have very low values which would lead to any claims from this group being less robust that for other groups. You see that within the Discoba the only two species which have an ESC or E(z) are the species with greater than 50. I would be interested if the authors were to recreate figure 1 now including only species with BUSCO > 50 or also with BUSCO>75 what conclusions can now be drawn convincingly from the data.