CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data

A systematic comparison of batch effect metrics for single cell data is performed. The new cell-specific mixing score from the R/Bioconductor CellMixS package performs well across various tasks.

--An editable version of the final text (.DOC or .DOCX) is needed for copyediting (no PDFs).
--High-resolution figure, supplementary figure and video files uploaded as individual files: See our detailed guidelines for preparing your production-ready images, https://www.life-sciencealliance.org/authors --Summary blurb (enter in submission system): A short text summarizing in a single sentence the study (max. 200 characters including spaces). This text is used in conjunction with the titles of papers, hence should be informative and complementary to the title and running title. It should describe the context and significance of the findings for a general readership; it should be written in the present tense and refer to the work in the third person. Author names should not be mentioned.

B. MANUSCRIPT ORGANIZATION AND FORMATTING:
Full guidelines are available on our Instructions for Authors page, https://www.life-sciencealliance.org/authors We encourage our authors to provide original source data, particularly uncropped/-processed electrophoretic blots and spreadsheets for the main figures of the manuscript. If you would like to add source data, we would welcome one PDF/Excel-file per figure for this information. These files will be linked online as supplementary "Source Data" files. ***IMPORTANT: It is Life Science Alliance policy that if requested, original data images must be made available. Failure to provide original images upon request will result in unavoidable delays in publication. Please ensure that you have access to all original microscopy and blot data images before submitting your revision.*** Reviewer #1 (Comments to the Authors (Required)): The manuscript presented by Lutge and colleagues describes a new metric, cell-specific mixing score (csm), designed to detect and quantify batch effects in scRNA-seq data. The metric is based on the Anderson-Darling test, which is used to test the null hypothesis of "no batch effect". In single-cell data analysis, it is important to detect batch effects and there are already several methods designed for this task. In some cases, the idea behind these methods is close to the one proposed here. However, it is probably true that a systematic comparison of batch mixing metrics in key tasks is relevant and has not been conducted yet. Overall, this study is well constructed and presented, the manuscript well written and easy to follow. Real and simulated datasets are used to evaluate the csm method in distinct scenarios.
Here my comments: Authors evaluated the variance contribution of batch effects related to distinct sources, such as batch, cell type and interaction between them. In figure 1, they observed that most of the gene variances are attributed to batches and cell types and a lower percentage is related to their interaction. Given that, how can the authors motivate the fact that csm does not account for cell type assignment? Did authors evaluate if the cell type identity of the cell in the neighbourhood can affect the csm metric? If distances are derived from the PCA space, the structure of the data is partially retrieved. In this way, the cells of the same type will be closer if we consider a scenario with no batch effect at a specific neighbourhood. In contrast, if a neighbourhood is affected by cell type-specific batch effects, cells of random cell type composition could be in that neighbourhood even if they are from distinct batches. Have authors considered this aspect? About the comparison of metric scores for the task 1, authors say: "Most cell-specific metrics showed a plateau in their score towards higher batch strength, suggesting a maximal score has been reached and thus they cannot further discern the strength of a batch effect". I believe this is not optimal. However, authors have not commented on that. In principle, if the proportion of DE genes is very high and a plateau is observed, this is desirable. However, here the plateau seems to start quite early (~15%). This would mean that any of these metrics is able to reflect the real strength of the batch effect. I think this should be clarified. I understand that probably it is more important having high sensitivity for detecting batch effects. Authors should address this point and help the reading of these graphics. In task 3 authors evaluate the sensitivity in detecting bact effects and, similarly to task 1, they also compare the ability of each metric to reflect the strength of the batch effect by increasing the batch logFC. Are these features equally important in a real context? Could the authors address this point. What is the impact of the scaling in a real scenario? Are these tasks independent? Does task 3 (scaling) reflect task 1? Minor comments: The links to Table 1 (row 32), Figure S1 and Table S1 (row 56), Figure S2 (row 74) don't work for me. Plots in Figures 1B and 1C require more explanations. I find it difficult to understand the meaning of the dotted lines and their corresponding percentage values. In figure 3, there are many annotations at each tool-specific plot and this makes their understanding difficult. Probably, the lack of both x-axes does not help.
Reviewer #2 (Comments to the Authors (Required)): The authors discuss metrics / measures by which to assess / quantify the degree of batch effects affecting single cell RNA-seq experiments.
The authors suggest a new metric themselves. They also compare existing measures. They do this to a degree that leaves no further questions open. I therefore consider the paper a great guideline when trying to get control of batch effects that affect different runs of scRNA-seq experiments.
I only have (very) minor comments, and recommend to accept this paper.

MINOR:
Results: * Figure 1A: would be preferable to have hca, cellbench, pancreas in one row, and so on Discussion:

CellMixS reviewer's comments Reviewer #1 (Comments to the Authors (Required)):
The manuscript presented by Lutge and colleagues describes a new metric, cell-specific mixing score (csm), designed to detect and quantify batch effects in scRNA-seq data. The metric is based on the Anderson-Darling test, which is used to test the null hypothesis of "no batch effect".
In single-cell data analysis, it is important to detect batch effects and there are already several methods designed for this task. In some cases, the idea behind these methods is close to the one proposed here. However, it is probably true that a systematic comparison of batch mixing metrics in key tasks is relevant and has not been conducted yet.
Overall, this study is well constructed and presented, the manuscript well written and easy to follow. Real and simulated datasets are used to evaluate the csm method in distinct scenarios.
Here my comments: Authors evaluated the variance contribution of batch effects related to distinct sources, such as batch, cell type and interaction between them. In figure 1, they observed that most of the gene variances are attributed to batches and cell types and a lower percentage is related to their interaction. Given that, how can the authors motivate the fact that csm does not account for cell type assignment? Did authors evaluate if the cell type identity of the cell in the neighbourhood can affect the csm metric?
If distances are derived from the PCA space, the structure of the data is partially retrieved. In this way, the cells of the same type will be closer if we consider a scenario with no batch effect at a specific neighbourhood. In contrast, if a neighbourhood is affected by cell type-specific batch effects, cells of random cell type composition could be in that neighbourhood even if they are from distinct batches. Have authors considered this aspect?
We developed cms with the goal to provide a metric that was independent of any cell type assignment, motivated by the fact that cell type assignment can be affected by 1st Authors' Response to Reviewers February 2, 2021 the batch effect itself. So, cms is cell type independent to prevent bias related to the method of cell type assignment or possibly misclassified cells. For example, a cell type dependent metric could give different results on the same data depending on the clustering method and parameters used.
As pointed out by the reviewer, a cell type-specific batch effect could lead to mixing of cells from different cell types and batches, which would not be detected by a metric without considering cell type labels. To address this concern we added the following sentence (bold) to the Discussion section: While cell type-specific metrics also provide local information, they depend on clustering and cell type assignment, which themselves can be affected by the batch effect; thus, it is desirable to have batch effect assessments that are independent of cell type assignment.

If cell type information exists, cell-specific metrics can also be run independently for each pre-determined cell type to assess interference of batch and cell type effects.
In Task 1, we tested metrics on datasets with batch effects varying in the cell type-specificity of the batch effect. Based on these results, we did not observe an advantage to including cell type information in the metric.
About the comparison of metric scores for the task 1, authors say: "Most cell-specific metrics showed a plateau in their score towards higher batch strength, suggesting a maximal score has been reached and thus they cannot further discern the strength of a batch effect". I believe this is not optimal. However, authors have not commented on that. In principle, if the proportion of DE genes is very high and a plateau is observed, this is desirable. However, here the plateau seems to start quite early (~15%). This would mean that any of these metrics is able to reflect the real strength of the batch effect. I think this should be clarified. I understand that probably it is more important having high sensitivity for detecting batch effects. Authors should address this point and help the reading of these graphics.
We agree that the plateau shown by cell-specific metrics in Task 1 is an important observation that should be clarified. We added the following paragraph (bold) to the Results section for Task 1 to give the reader more context for these results: Most cell-specific metrics showed a plateau in their score towards higher batch strength, suggesting a maximal score has been reached and thus they cannot further discern the strength of a batch effect.
In Figure S2,

While cell-specific metrics that only consider each cell's neighbourhood get saturated at their nominal minimum (from the csf_patient dataset onwards), their summarized score still reflects the overall order of datasets based on batch strength measures.
We also changed the caption of Figure 3 to help describe the corresponding graphic to the following (changes in bold):

Task 1 -Reflection of batch characteristics: Metric scores versus (surrogate) batch strength across the real datasets. Summarized metric scores (y-axis) are compared to the proportion of DE genes (top x-axis, solid line) and the mean PVE-Batch (bottom x-axis, dashed line) per dataset. Datasets with a stronger batch effect (high percentage of DE genes/mean PVE-Batch) are expected to show a higher overall metric score than datasets with mild batch effects (low percentage of DE genes/mean PVE-Batch). Spearman correlation coefficients of metrics against the two batch strength measures are shown (R_PVE-Batch, R_DE) in the text boxes for each metric and evaluated in Task 1. Metric scores were standardized by subtraction of their minimum and division of their range (maximum -minimum) across datasets. Directions
were adjusted when necessary, such that all scores increase with batch strength.
In task 3 authors evaluate the sensitivity in detecting bact effects and, similarly to task 1, they also compare the ability of each metric to reflect the strength of the batch effect by increasing the batch logFC. Are these features equally important in a real context? Could the authors address this point. What is the impact of the scaling in a real scenario?
Are these tasks independent? Does task 3 (scaling) reflect task 1?
We agree that a metrics ability to scale with the strength of a batch effect and its sensitivity are complementary aspects and both should be considered when interpreting a metric's result.
While it is desirable to have a sensitive metric that detects any bias related to a batch effect, not every structure related to the batch is a real confounder of the signal of interest. For example, a mild batch effect might confound the within cell type structure, but not cell type clusters themselves. Thus, the batch effect does not need to be considered for cell type assignment, but becomes relevant at the level of clustering to cell identity. As the relevance of a batch effect is context dependent, it is important for a metric to be sensitive, but also interpretable with regards to the severity of the batch effect. In Task 1, we evaluate whether the metrics reflect batch strength related characteristics across datasets and in Task 3, we evaluate the metrics ability to scale within the same dataset. The latter is particularly relevant in benchmarks for batch correction methods or when a batch effect before and after correction is compared.
To expand upon these considerations about sensitivity and scaling of metrics, we edited (bold) the following paragraphs in the manuscript: Results section "Comparison of batch mixing metrics": Altogether, we designed 5 benchmark tasks to cover the most relevant use cases of these metrics (see Table 2 for short descriptions). One major application of these metrics is to assess the severity of a batch effect and thus reflect the level of confounding. For example, a larger score should result from a stronger batch effect across datasets (Task 1).

Test whether metrics reflect batch strength /confounding across datasets
Results section "Task 1: Reflection of batch characteristics": In this task, we tested a metric's ability to discriminate between a strong and a mild batch effect across datasets. This is an important feature of these metrics as the impact of a batch effect is context-specific and depends on how strongly interesting data characteristics are confounded. To test this, we used the batch characteristics and datasets explored above.
Thanks a lot for pointing this out. It seems to be related to the conversion of the .tex files. We will pay attention when uploading the improved version.
Plots in Figures 1B and 1C require more explanations. I find it difficult to understand the meaning of the dotted lines and their corresponding percentage values.
We changed the caption of Figure 1 to: … B,C) Batch logFC distribution by cell type and batch effect in the cellbench and hca datasets, respectively. Each column represents a density plot of the estimated logFCs for a batch / cell type combination. Dotted lines indicate the mean, 25%, 50% and 75% percentiles.
In figure 3, there are many annotations at each tool-specific plot and this makes their understanding difficult. Probably, the lack of both x-axes does not help.
We added axis lines and marks to Figure 3.

Reviewer #2 (Comments to the Authors (Required)):
The authors discuss metrics / measures by which to assess / quantify the degree of batch effects affecting single cell RNA-seq experiments.
The authors suggest a new metric themselves. They also compare existing measures.
They do this to a degree that leaves no further questions open.
I therefore consider the paper a great guideline when trying to get control of batch effects that affect different runs of scRNA-seq experiments.
I only have (very) minor comments, and recommend to accept this paper.

MINOR:
Results: * Figure 1A: would be preferable to have hca, cellbench, pancreas in one row, and so on As suggested, we changed the order of datasets in Figure 1A to a more meaningful order with batches related to the same origin (sequencing protocols, patients, media storage) in the same row. Discussion:

* Citation Crowell2019a broken
Thanks for pointing this out. We fixed the citation. Thank you for submitting your revised manuscript entitled "CellMixS: quantifying and visualizing batch effects in single cell RNA-seq data". We would be happy to publish your paper in Life Science Alliance pending final revisions necessary to meet our formatting guidelines.
Along with the points listed below, please also attend to the following, -please consult our manuscript preparation guidelines https://www.life-sciencealliance.org/manuscript-prep and make sure your manuscript sections are in the correct order -please make sure the author order in your manuscript and our system match, add all Contributing Authors in our system -please add ORCID ID for secondary corresponding author-they should have received instructions on how to do so -please add a Category for your manuscript in our system -please upload your main and supplementary figures as single files -please add callouts for Figures S4A,B If you are planning a press release on your work, please inform us immediately to allow informing our production team and scheduling a release date.
To upload the final version of your manuscript, please log in to your account: https://lsa.msubmit.net/cgi-bin/main.plex You will be guided to complete the submission of your revised manuscript and to fill in all necessary information. Please get in touch in case you do not know or remember your login name.
To avoid unnecessary delays in the acceptance and publication of your paper, please read the following information carefully.

A. FINAL FILES:
These items are required for acceptance.
--An editable version of the final text (.DOC or .DOCX) is needed for copyediting (no PDFs).
--High-resolution figure, supplementary figure and video files uploaded as individual files: See our detailed guidelines for preparing your production-ready images, https://www.life-sciencealliance.org/authors --Summary blurb (enter in submission system): A short text summarizing in a single sentence the study (max. 200 characters including spaces). This text is used in conjunction with the titles of papers, hence should be informative and complementary to the title. It should describe the context and significance of the findings for a general readership; it should be written in the present tense and refer to the work in the third person. Author names should not be mentioned.

B. MANUSCRIPT ORGANIZATION AND FORMATTING:
Full guidelines are available on our Instructions for Authors page, https://www.life-sciencealliance.org/authors We encourage our authors to provide original source data, particularly uncropped/-processed electrophoretic blots and spreadsheets for the main figures of the manuscript. If you would like to add source data, we would welcome one PDF/Excel-file per figure for this information. These files will be linked online as supplementary "Source Data" files. **Submission of a paper that does not conform to Life Science Alliance guidelines will delay the acceptance of your manuscript.** **It is Life Science Alliance policy that if requested, original data images must be made available to the editors. Failure to provide original images upon request will result in unavoidable delays in publication. Please ensure that you have access to all original data images prior to final submission.** **The license to publish form must be signed before your manuscript can be sent to production. A link to the electronic license to publish form will be sent to the corresponding author only. Please take a moment to check your funder requirements.** **Reviews, decision letters, and point-by-point responses associated with peer-review at Life Science Alliance will be published online, alongside the manuscript. If you do want to opt out of having the reviewer reports and your point-by-point responses displayed, please let us know immediately.** Thank you for your attention to these final processing requirements. Please revise and format the manuscript and upload materials within 7 days.
Thank you for this interesting contribution, we look forward to publishing your paper in Life Science