Transmembrane protease serine 2 (TMPRSS2) rs75603675, comorbidity, and sex are the primary predictors of COVID-19 severity

The TMPRSS2 rs75603675 genotype (OR = 0.586), dyslipidemia (OR = 2.289), sex (OR = 0.586), and the Charlson Comorbidity Index (OR = 1.126) were identified as the main predictors of COVID-19 severity in 817 patients who attended Hospital Universitario de La Princesa during March and April 2020.


Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic started in Wuhan, China, in December 2019. This virus causes the new coronavirus disease 2019 . By the end of December 2021, more than 271 million cases and 5.3 million deaths had been reported (1). Although vaccination is a proved effective strategy for pandemic control, it is not yet equally available in all countries of the world (2). Even in some developed countries, the slowdown in vaccination hinders the achievement of herd immunity (3,4). Hence, reaching worldwide herd immunity seems highly unlikely in the medium to long term, and it is expected that the virus will remain a health problem in the following months and years. Therefore, the authorization of effective therapies and identification of prognostic biomarkers remains crucial to manage COVID-19 patients more rationally. This is of particular importance because strains emerge that may be potentially more infectious, could cause more severe disease and, above all, could escape the protection of vaccines. This could be the case for the emerging strain omicron (B.1.1.529), a novel variant of concern (5).
Although several studies were published to date evaluating genetic biomarkers associated with COVID-19 severity, most were exploratory, showing heterogenic results, and still nowadays, a clinically relevant genetic biomarker was not described. The first were genes involved in virus entrance to the host, such as angiotensin converting enzyme 2 gene (ACE2) and transmembrane serine protease 2 gene (TMPRSS2). Different research groups have suggested several candidate variants of ACE2, namely rs2285666 or rs4646116 (6,7). On the other hand, for the TMPRSS2, variants such as rs2298659, rs17854725, rs12329760, and rs75603675 were found to be different in the frequency in populations more affected by the disease (8,9). However, a later genome-wide association study (GWAS) of severe COVID-19 with respiratory failure reported two clusters of genes associated with two different polymorphisms: rs11385942 in leucine zipper transcription factor like 1 gene (LZTFL1) 1 and rs657152 in the ABO gene (10). Because of the disparity of the observed findings, additional confirmatory and exploratory studies are warranted. The aim of this work was to perform a review of the published single-nucleotide polymorphisms (SNPs) related to COVID-19 prognosis or severity by the end of 2020 and to evaluate them in an independent validation cohort. For this purpose, we genotyped 817 patients managed at Hospital Universitario de La Princesa, for a panel of 120 SNPs selected based on an extensive literature search.

Results
The population consisted on 817 patients, 453 (55.45%) males and 364 (44.55%) females. The range of age was 19 to 97 yr, where the mean age was 60 yr. The baseline characteristics of the study population are shown in Table 1. Biogeographical origin of patients was inferred by their country of birth: 636 were European, 161 were American, 7 were East Asian, 6 were Near Eastern, and 1 was Central/ South Asian. Most patients were symptomatic and required hospitalization with oxygen supplementation at the moment of the first hospital visit (WHOCS-1 = 4, 51.38%), followed by asymptomatic or mild patients (WHOCS-1 = 1 and 2, 27.92%) and by symptomatic without need for oxygen supplementation (WHOCS-1 = 3, 19.74%). As for the severest clinical situation, 77 died or required ICU admission (WHOCS-2 = 6-7, 9.27%), and the remaining severity groups were distributed similarly like in WHOCS-1. Most patients presented a CCI of 2-8 (93.50%), patients with CCI = 1 accounted for 3.97%, and patients with CCI between 9 and 13 accounted for 2.53% of the population.
Treatments received before the first emergency room visit and during the admission in the hospital are described in Table 2. The most frequently prescribed treatments prior first emergency room were ACE inhibitors (ACEIs) and angiotensin II receptor blockers (ARA-II), received by 13.22% and 15.06% of patients, respectively. The most frequently used treatments for the management of the disease were hydroxychloroquine or chloroquine (74.30%), heparin (56.18%), lopinavir/ritonavir combination (49.69%), and corticoids (43.33%).
The univariate analysis of severity 1 and 2 variables is shown in Table S1, including a summary of nominally significant variables  and those who reached a corrected P9 < 0.05, which were included in the multivariate analysis. Biogeographical group resulted nonsignificant. Males compared with females (OR = 0.586), a higher CCI (OR = 1.126) (covariates), dyslipidemia (OR = 2.289), and TMPRSS2 rs75603675 (C/C or C/A diplotypes, compared with the A/A diplotype) (OR =2.140) were significantly related to a higher severity 1 and 2 status, after multivariate analysis and Bonferroni correction for multiple comparisons (Fig 1). The univariate analysis of WHOCS-1 and WHOCS-2 is shown in Table S1, including a summary of nominally significant variables and those who reached a corrected P9 < 0.05, which were included in the multivariate analysis. The same variables identified in the multivariate analysis of severities 1 and 2 were now observed in the multivariate analysis of WHOCS-1 and WHOCS-2 (Table 3).
Based on the estimates obtained from the multivariate analysis, the following equations were proposed to calculate WHOCS-1 and WHOCS-2 scales in infected in Table 4.
The remaining variables, that is, drugs used before COVID-19 infection, drugs used for the treatment of the disease, the remaining polymorphisms, etc., were unrelated to disease severity in both analyses.

Discussion
The scientific community's effort to investigate biomarkers for predicting the risk and severity of infection was strenuous since December 2019 to date. A huge amount of works were published in this regard, including reviews and systematic reviews (11). Although there is some consensus on which biomarkers can track disease progression (mainly pro-inflammatory cytokines, ferritin, etc.), there is no biomarker that can predict disease progression from baseline. Baseline health status and demographic characteristics, including sex, age, and comorbidities have been described as the main predisposing factors (12,13,14,15,16). However, a percentage of severity is not explained by the latter factors (12,13). Genetic polymorphism may explain part of this susceptibility, which resulted in dozens of publications proposing several SNPs and other genetic alterations associated with susceptibility to COVID-19 infection and severity (Table 5). However, to our knowledge, very few studies validated these associations and their potential clinical relevance. Our intention was, therefore, to design a panel of polymorphisms to validate their usefulness in an independent set of patients.
To prevent bias, first, we proposed a very strict statistical analysis to avoid obtaining spurious results. Second, we proposed to correct for confounding factors in all the statistical tests performed and therefore decided to consider the CCI, which included known COVID-19 severity predictors such as age and obesity, and the sex as covariates. Third, we decided not to analyze some variables dependent on the pandemic situation at the time of recruitment. For example, we did not consider ICU admission as a valid measure of COVID-19 severity as this was restricted because of hospital collapse. In other words, some patients who reached sufficient severity to merit admission to an intensive care unit were not admitted because there was no bed available.
Our findings on the predictors of COVID-19 severity are consistent with previous publications in the literature. The CCI was previously related to COVID-19 prognosis (89,90), which is consistent with the correlation observed in this work between a higher score and higher WHOCS-1 and -2 scores. Furthermore, males get infected, require ICU admission, mechanical ventilation, and die more frequently than women (91), which is consistent with the protective effect observed here for the female sex regarding WHOCS-1 and more intensely with WHOCS-2. Additional studies are required to determine the underlying differences behind this sexual dimorphism. Moreover, dyslipidemia was previously related to severe COVID-19 prognosis (92), which is consistent with our findings, where the presence of dyslipidemia was related to a 0.579 and 0.551 higher WHOCS-1 or WHOCS-2 scores, respectively. Finally, TMPRSS2 rs75603675 C/C or C/A diplotypes, compared with the A/A diplotype, were related to 0.591 higher WHOCS-1 and 0.405 higher WHOCS-2 scores, respectively. The contribution of this polymorphism requires further discussion.
The transmembrane serine protease 2, encoded by the TMPRSS2 gene, participates in several physiological and pathological situations, being up-and down-regulated by several hormonal processes. It is used by several viruses to enter host cells, including the Influenza virus and the human coronaviruses HCoV-229E, MERS-CoV, SARS-CoV, and SARS-CoV-2 (93,94). Genetic polymorphism of this gene was described to affect disease severity. Particularly, the P.Val197Met (rs12329760) variant is defined as deleterious and previously reported to have a protective effect on the patients (95). Here, this variant had no effect on WHOCS-1 or -2 variability, whereas the P.Gly8Val (rs75603675) missense variant (C > A) was related to a significantly higher WHOCS-1 and WHOCS-2. In contrast, in one study, the prevalence of the TMPRSS2 rs75603675 A allele was similar between infected patients, compared with uninfected (96). Unfortunately, no information about infection severity is provided in the latter article. Therefore, to the best of our knowledge, this is the first work to propose that this variant has a significant impact on COVID-19 prognosis and severity. Probably, the reason behind this association is explained by the down-regulation of the protein in TMPRSS2 rs75603675 A allele carriers or the expression of a protein     with a structural change that causes a less specific or impaired interaction with viral proteins, causing a less efficient internalization of viruses inside human cells. This is congruent with Latini et al observations: TPMRSS2 rs75603675 (C > A) A allele was in significantly lower frequency in populations that suffered more severe cases of COVID-19 (57). This suggested a possible protective effect toward COVID-19 infection.
Other nominally significant associations were established between WHOCS scales and: IFNL4 rs12979860, ACEIs, obesity, ARA-II, HCP5 rs2395029, ACE rs1799752, DPP4 rs17574, HMOX1 rs2071746, IL10 rs1800896, IL6 rs1818879, NFKB1 rs28362491, and MX1 rs469390. Although some of these associations were previously observed, others were also persistently rejected. For instance, the use of renin-angiotensin-aldosterone system (RAAS) inhibitors was proposed to cause the up-regulation of the ACE2 receptor, the receptor for SARS-CoV-2 protein S, which enables cell infection; therefore, the use of this drugs was related to a higher risk for COVID-19 infection and worse prognosis (97,98,99,100,101,102,103). However, a bias was identified in this assumption as it were the conditions associated with the use of RAAS inhibitors (e.g., hypertension or  heart disease) which actually led to a higher risk of worse prognosis (104). Consequently, we decided to use a sufficiently strict statistical approximation to control for this bias. Hence, we considered all associations mentioned above as negative, and not worth discussing, because this would contribute to confusion. One noteworthy disparity with our study is the GWAS study by the Severe COVID-19 GWAS Group (18). They report different genetic markers to ours. The explanation that we can elucidate to address this discrepancy is that our outcome variable is different. Our analysis focus in all COVID-19 severity stages, whereas they focus in the most severe cases. Thus, our findings may explain the susceptibility to infection in the early stages of the disease, whereas other factors such as 3p21.31 gene cluster and the ABO blood-group system may determine terminal status of COVID-19. Definitely, more research must be performed in other to clarify the true effect of the TMPRSS2 rs75603675 polymorphism in the outcome of this disease.
With this work, we integrate multiple clinical factors and genetic factors to COVID-19 severity prediction. A variable that excellently captures the comorbidity of patients is used, the CCI index. The presence or absence of dyslipidemia complements the CCI index. Only with this information, for instance, a significantly higher WHOCS-1 and -2 scores can be predicted for a patient with dyslipidemia and a CCI of five versus a patient without dyslipidemia and CCI of 1. Although this was previously known, that is, patients with higher comorbidities are related to worse COVID-19 prognosis, our scale signifies a clinically relevant tool to better manage patients because it is clear and numerical. Furthermore, sex and TMPRSS2 rs75603675 stand as additional clinically relevant predictors of COVID-19 severity, which were included in the scales. Continuing with the previous example, if the first patient carried the C/C diplotype and was male and the second patient was female and carried the A/A diplotype, the predicted WHOCS-1 and WHOCS-2 scales would be: 3.637 and 4.153 for the first patient and 1.737 and 2.302 for the second. These predictions are related to specific clinical requirements (e.g., hospitalization or oxygen supplementation). This quantitative measurement could help in the optimization of clinical resources.
Despite the merits of this work, it presents limitations that should be considered. First, patients were recruited from the first COVID-19 wave, which might be considered more than a limitation; it can be considered a strength. As this work was carried with patients infected during March-April 2020, the circulating strains were different from those circulating now. In this sense, the conclusions regarding TMPRSS2 rs75603675 should be confirmed in the currently circulating strains. Nonetheless, new strains, such as the omicron variant, could be more infectious or pathogenic, being the underlying mechanism of such pathogenicity an enhanced interaction between viral antigens and TMPRSS2. This interaction could be affected by genetic variants of this gene and cause an even greater difference of severity with new strains. Nevertheless, new studies shall confirm the relevance of TMPRSS2 rs75603675 in new circulating variants. Another limitation is that hospital protocols were severely affected by the emergency healthcare situation of the first COVID-19 wave, and adequacy of the therapeutic effort was needed. This, in combination with the retrospective nature of the study produces relative scarcity of severe and asymptomatic individuals. Consequently, the severity distribution might be skewed. In addition, the nature of the WHOCS-1 and WHOCS-2 analysis by means of the generalized linear model assumes a normal distribution of these variables. To avoid bias, we performed a strict statistical analysis, in exchange for assuming greater type II error. Moreover, the fact that the exact same variables related to severity 1 and 2 variables were identified is reassuring, along with the strict control for type-1 error.

Conclusions
The TMPRSS2 rs75603675 C/C or C/A genotypes compared with A/A, males compared with females, the presence of dyslipidemia, and a higher CCI score were associated with more severe COVID-19, at the first hospital visit and at the most severe point of disease progression. The integration of all these variables into the proposed equations could be a useful clinical tool for the rational management of patients with COVID-19. To our knowledge, this is the first work to propose a similar tool that integrates genetic data capable of predicting the prognosis of COVID-19.

Study design and participants
This study was designed with an observational and retrospective approach. A total of 817 patients with COVID-19, who attended the emergency department of the Hospital Universitario de La Princesa between 29 March and 29 April 2020, were recruited. Both inpatients and outpatients were considered. Patients were recruited consecutively according to their first visit to the emergency department. To avoid imbalances in the proportions of the severity groups because of the retrospective and consecutive nature of recruitment, checkpoints were applied to prioritize the selection of under-represented patients, that is, severe and mild patients. The first checkpoint was performed at the beginning of May 2021, with 617 patients recruited: 83 (13.5%) mild, 466 (75.5%) moderate, and 68 (11%) severe; and the second was performed at the end of May 2021, with 743 patients: 159 (21.5%) mild, 502 (67.5%) moderate, and 82 (11%) severe. The Ethics Research Committee of Hospital Universitario de La Princesa approved the study protocol. All subjects provided informed consent, except for the deceased. They were scheduled for sampling at the Department of Internal Medicine of Hospital Universitario de La Princesa; stored samples were retrieved from the deceased patients. Research compiled with Spanish and European legislation on biomedical research and with the revised Declaration of Helsinki.

Variables
Hospital and primary care medical records were used to retrieve disease and clinical information. The main outcome (dependent variable) was a modified version of the 7-point World Health Organization (WHO) COVID-19 severity scale (WHOCS) (105). Briefly, individuals are classified according to the following COVID-19 severity groups: (1) infected, asymptomatic; (2) symptomatic not requiring hospitalization; (3) COVID-19 requiring hospitalization without oxygen supplementation; (4) oxygen supplementation with mask or nasal prongs; (5) noninvasive ventilation or high flow oxygen; (6) intubation and mechanical ventilation in an intensive care unit (ICU); and (7) death. This scale was evaluated in the moment of the first hospital examination (WHOCS-1) and of the severest WHOCS score (WHOCS-2). For this work, WHOCS levels were grouped as follows: levels 1-2 were considered mild severity, levels 3-4 were considered moderate severity, and levels 5 or greater were considered severe COVID-19. The resulting variables were named "severity-1" and "severity-2," that is, a transformed variable of the severity at admission and at the worst severity status.

Genotyping
Published articles evaluating SNPs and COVID-19 were addressed since January 2020 until December 2020. Moreover, articles evaluating polymorphism in important genes related to COVID-19 (e.g., ACE2, ACE, IL-6, or IFNs) or the coagulation cascade (e.g., F11, CRP) were screened however not necessarily published during the pandemic. Finally, variants related to safety and effectiveness of the drugs used for the treatment of the disease were included, that is, pharmacogenetic variants. Sample collection occurred in a period after discharge of the patients, during the months of January to May 2021. Patients who gave informed consent provided 5 ml of blood collected in an EDTA K2 tube. For deceased patients, samples were retrieved from stored collection. Genomic DNA was extracted from peripheral blood samples with a Maxwell RSC automated DNA extractor (Promega) following the manufacturer's instructions. For genotype analysis, a QuantStudio 12K flex thermal cycler along with an OpenArray thermal block was used (Thermo Fisher Scientific). A customized genotyping array was designed with variants shown in Table 5. The references justifying the inclusion of these variants are included.

Statistical analysis
An online tool for sample size calculation was used (Sample Size Calculator available at https://riskcalc.org/samplesize/). The study was a retrospective observational cohort study. The α or type-1 error rate was set at 0.05, and the power, or 1-β, was set at 0.9. Exposure was defined as the presence or absence of one or several SNPs related to severe COVID-19. The event to be analyzed was severe COVID-19, defined as the event, that is, WHO COVID-19 severities 5, 6, and 7. Depending on the SNP prevalence, the k value (unexposed to exposed ratio) was set. For SNPs with low prevalence (e.g., 5%), k = 20. For SNPs with 20% prevalence, k = 4. For SNPs with 40% prevalence, k = 1.5. Assuming a probability of the event in the unexposed group (P0) of 0.4 (40%) and of 0.55 (55%) in the risk group (P1), the following sample sizes were required: 2,667 for k = 20 (127 exposed and 2,540 unexposed patients), 760 for k = 4 (127 exposed and 608 unexposed patients), and 510 for k = 1.5 (204 exposed and 306 unexposed). Therefore, a sample size between 510 and 2,667 was considered. Assuming greater differences (e.g., P0 = 0.3 and P = 0.7), significantly lower sample sizes were required. Finally, the sample size was determined according to the capacity of the genotyping platform, the available budget, and the latter estimations. A genotyping array containing 120 SNPs with a capacity for 920 samples was designed and purchased. Full coverage of the 920 samples genotyping capacity could not be achieved because of failures or the need for repetitions.
Statistical analysis was conducted in R (107). To analyze severity 1 and 2 variables, a univariate analysis was initially performed by ordinal logistic regression with the MASS package (108). As independent variables, the following ones were explored: genetic variables (all SNPs were transformed into dummy variables, that is, grouping heterozygous subjects with the most frequent homozygous diplotype), biogeographic group, previous treatments and other comorbidities (e.g., dyslipidemia or tobacco use); CCI and sex were included as covariates, and the level of significance was adjusted based on the Bonferroni correction for multiple comparisons. Those variables with corrected P9 < 0.05 were introduced as independent variables in a multiple logistic regression analysis of severities 1 and 2, that is, the multivariate analysis. In this analysis, significance was similarly adjusted based on the Bonferroni correction for multiple comparisons.
A secondary analysis of WHOCS-1 and WHOCS-2 was performed to provide a predictive equation of disease severity. For the univariate analysis, a generalized lineal model was performed by means of individual linear regression, with the CCI and sex as covariates and applying the Bonferroni correction for multiple comparisons. Those variables with P9 < 0.05 were introduced as independent variables in the multivariate analysis, in this case, by means of a generalized lineal model with CCI and sex as covariates again.

Availability of Data and Materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
grants of the year 2020, Fondo Supera COVID-19 from Banco de Santander and CRUE (grant Predinmun-COVID) to I de los Santos, I González-Álvaro, and E Fernández-Ruiz and Instituto de Salud Carlos III (ISCIII) from the Spanish Ministry of Science Innovation and Universities and the European Regional Development Fund (ISCIII-FEDER) "A way to achieve Europe." G Villapalos-García is supported by a PFIS predoctoral grant (FI20/00090), and P Zubiaur's contract with CIBERehd is financed by the "Infraestructura de Medicina de Precisión asociada a la Ciencia y Tecnología (IMPaCT, IMP/00009)" (ISCIII). S Fernández de Córdoba-Oñate and P Delgado-Wicke were supported by Predinmun-COVID and PI19/00096 grants, respectively.

Author Contributions
G Villapalos-Garcia: conceptualization, resources, data curation, formal analysis, supervision, investigation, visualization, methodology, and writing-original draft, review, and editing. P Zubiaur: conceptualization, resources, data curation, formal analysis, supervision, investigation, visualization, methodology, project administration, and writing-original draft, review, and editing. R Rivas-Duran: data curation, investigation, and writing-review and editing. P Campos-Norte: data curation, investigation, and writing-review and editing. C Arevalo-Roman: data curation, investigation, and writing-review and editing. M Fernandez-Rico: data curation, investigation, and writing-review and editing. L Garcia-Fraile Fraile: conceptualization, data curation, visualization, methodology, and writing-review and editing. P Fernadez-Campos: investigation and writing-review and editing. P Soria-Chacartegui: investigation and writing-review and editing. S Fernandez de Cordoba-Onate: resources and writing-review and editing. P Delgado-Wicke: resources and writing-review and editing. E Fernandez-Ruiz: resources and writing-review and editing. I Gonzalez-Alvaro: resources, methodology, and writing-review and editing. J Sanz: resources and writing-review and editing. F Abad-Santos: conceptualization, resources, supervision, funding acquisition, investigation, visualization, methodology, project administration, and writing-review and editing. I de los Santos: conceptualization, resources, supervision, funding acquisition, investigation, visualization, methodology, project administration, and writing-review and editing.

Conflict of Interest Statement
F Abad-Santos has been consultant or investigator in clinical trials sponsored by the following pharmaceutical companies: Abbott, Alter, Chemo, Cinfa, FAES, Farmalíder, Ferrer, GlaxoSmithKline, Galenicum, Gilead, Italfarmaco, Janssen-Cilag, Kern, Normon, Novartis, Servier, Silver Pharma, Teva, and Zambon. I de los Santos has received grants from Gilead, ViiV, and Janssen. The remaining authors declare no conflicts of interest.