Skip to main content
Advertisement

Main menu

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Author Interviews
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Journal of Human Immunity
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research

User menu

  • My alerts

Search

  • Advanced search
Life Science Alliance
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Journal of Human Immunity
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research
  • My alerts
Life Science Alliance

Advanced Search

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Author Interviews
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Follow LSA on Bluesky
  • Follow lsa Template on Twitter
Resource
Transparent Process
Open Access

Systematic assessment of structural variant annotation tools for genomic interpretation

View ORCID ProfileXuanshi Liu, Lei Gu, Chanjuan Hao, Wenjian Xu, Fei Leng, Peng Zhang, View ORCID ProfileWei Li  Correspondence email
Xuanshi Liu
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Visualization, Writing—original draft, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xuanshi Liu
Lei Gu
2Epigenetics Laboratory, Max-Planck Institute for Heart and Lung Research, Cardiopulmonary Institute, Bad Nauheim, Germany
Roles: Visualization, Methodology, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chanjuan Hao
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Supervision, Project administration, Interpretation of the data
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wenjian Xu
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Methodology, Interpretation of the data
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Fei Leng
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Interpretation of the data
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Peng Zhang
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Validation, Visualization, Interpretation of the data
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wei Li
1Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Genetics and Birth Defects Control Center, National Center for Children’s Health; Beijing Children’s Hospital, Capital Medical University, Beijing, China
Roles: Supervision, Funding acquisition, Project administration, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wei Li
  • For correspondence: liwei@bch.com.cn
Published 10 December 2024. DOI: 10.26508/lsa.202402949
  • Article
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF
Loading

Abstract

Structural variants (SVs) over 50 base pairs play a significant role in phenotypic diversity and are associated with various diseases, but their analysis is complex and resource-intensive. Numerous computational tools have been developed for SV prioritization, yet their effectiveness in biomedicine remains unclear. Here we benchmarked eight widely used SV prioritization tools, categorized into knowledge-driven (AnnotSV, ClassifyCNV) and data-driven (CADD-SV, dbCNV, StrVCTVRE, SVScore, TADA, XCNV) groups in accordance with the ACMG guidelines. We assessed their accuracy, robustness, and usability across diverse genomic contexts, biological mechanisms and computational efficiency using seven carefully curated independent datasets. Our results revealed that both groups of methods exhibit comparable effectiveness in predicting SV pathogenicity, although performance varies among tools, emphasizing the importance of selecting the appropriate tool based on specific research purposes. Furthermore, we pinpointed the potential improvement of expanding these tools for future applications. Our benchmarking framework provides a crucial evaluation method for SV analysis tools, offering practical guidance for biomedical research and facilitating the advancement of better genomic research tools.

Introduction

Structural variants (SVs), namely genetic alterations exceeding 50 base pairs (bp), significantly contribute to phenotypic diversity and underlie the mechanisms of a wide spectrum of human disorders, from rare diseases such as thrombocytopenia-absent radius syndrome (Klopocki et al, 2007) to common ones like autism spectrum disorder (Zhang et al, 2023) and cancer (Li et al, 2020). However, SVs represent a diverse spectrum of genomic changes containing deletions, duplications, inversions, insertions, translocations, and more complex variations (Collins et al, 2020), which present significant challenges for detection and analysis.

Detecting SVs using short-read sequencing poses challenges due to difficulties in aligning reads and accurately determining the full genomic span affected by an SV, especially when breakpoints occur within tandem repeats or involve sequences absent from the reference genome. Although long-read sequencing can mitigate some of these challenges by providing longer and more contiguous reads, it is often constrained by higher costs, lower throughput, and increased error rates compared with short-read sequencing. In addition, the vast number of SVs detected, thousands through short-read and up to 20,000 through long-read whole genome sequencing (WGS) (Collins et al, 2020; Beyter et al, 2021), results in the complexity of their analysis and interpretation.

The functional impact of SVs is complex, directly influencing gene function and indirectly affecting regulatory regions through long-range interactions (Lupianez et al, 2015). Moreover, a significant portion of SVs is found in noncoding regions, where our understanding is still evolving. Traditional methods for assessing the functionality or causality of SVs, such as association studies and eQTL analysis, require extensive cohorts, high-throughput sequencing, and sophisticated data analysis (Brandler et al, 2018). Family based studies, while valuable, are resource-intensive with specialized expertise (Pagnamenta et al, 2023).

Given the complexity and the high number of SVs, computational tools for their prioritization have become essential. Since 2015, more than two dozen tools have been introduced, predominantly in the last 3 yr. However, there is currently no study in comparing the performance of these SV prioritization tools. To fill this gap, we have selected eight tools for benchmarking based on their availability, periodic updates, ability to handle various SV types without additional information or manual work, and computational efficiency in terms of computational resource usage and compatibility with standard pipelines (Table S1).

Table S1. SV prioritization methods to be evaluated (Erikson et al, 2015; McLaren et al, 2016; Geoffroy et al, 2018; Huynh & Hormozdiari, 2019; Spector & Wiita, 2019; Kumar et al, 2020; Nieboer & de Ridder, 2020; Bhattacharya et al, 2021; Fan et al, 2021; Fino et al, 2021; Minoche et al, 2021; Requena et al, 2021; Yang et al, 2022; Danis et al, 2022; Ding et al, 2023; Macnee et al, 2023).

These eight tools are categorized into two types: the first type, or knowledge-driven, such as AnnotSV (Geoffroy et al, 2021) and ClassifyCNV (Gurbich & Ilinsky, 2020), is based on established clinical evaluation guidelines from the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen), which serve as the gold standard for the clinical evaluation and etiological diagnosis of genetic disorders (Richards et al, 2015). The second type, or data-driven, including tools such as CADD-SV (Kleinert & Kircher, 2022), dbCNV (Lv et al, 2023), StrVCTVRE (Sharo et al, 2022), SVScore (Ganel et al, 2017), TADA (Hertzberg et al, 2022), and XCNV (Zhang et al, 2021), employs machine learning models such as random forest, gradient boosted trees, and XGBoost to estimate SV effects, differing in features or training sets.

The knowledge-driven approaches implemented related databases described in ACMG guidelines stratified by SV types, incorporating factors like protein-coding or other functionally important elements, gene numbers, haploinsufficiency, benign regions, and inheritance patterns. In contrast, data-driven approaches based their training sets and features on gold standard datasets, including ClinVar (Landrum et al, 2016), DECIPHER (Firth et al, 2009), DGV (MacDonald et al, 2014), GnomAD (Collins et al, 2020), and 1 KG (1000 Genomes Project Consortium et al, 2015), with a focus on specific aspects of SV analysis. For example, CADD-SV used training sets derived from human and chimpanzee SVs as neutral proxies, whereas dbCNV incorporated diverse gold standard datasets within its scoring models. StrVCTVRE focused on molecular functions overlapping exons, SVScore aggregated scores from individual SNPs, TADA considered long-range hypotheses from 3D genomic data, and XCNV integrated a broad spectrum of population genomic information.

In this study, we evaluated the eight SV prioritization approaches in accuracy, robustness, and usability of SV across various genomic contexts and biological backgrounds. We hope to provide a comprehensive evaluation to assist researchers and clinicians in choosing the most appropriate tools for their study purposes or dataset usage. Furthermore, we discuss the future directions of SV prioritization approaches, offering insights into the field to facilitate the development of tools.

Results

Description of benchmarking pipeline

In our systematic evaluation (Table S1), we identified eight computational approaches developed between 2017 and 2023: AnnotSV, CADD-SV, ClassifyCNV, dbCNV, StrVCTVRE, SVScore, TADA, and XCNV (Table 1). The knowledge-driven approaches, AnnotSV and ClassifyCNV which included scoring metrics, demanded considerable expertise for implementation based on ACMG criteria. In contrast, data-driven approaches primarily generated scores to prioritize SVs.

View this table:
  • View inline
  • View popup
Table 1.

Overview of the approaches evaluated in this work.

Our benchmarking used six datasets constructed from seven different data sources, with GnomAD serving as a negative control set (Tables 2 and S2). The datasets encompassed a total of 489 germline SVs from ClinVar, six noncoding SVs, 12 long-range SVs, 456 somatic SVs from COSMIC (Sondka et al, 2024), 32 GWAS SVs, and 72 eQTL SVs. The performance of these approaches was assessed based on three key criteria: accuracy, robustness, and usability (Fig 1). Accuracy was evaluated using the AUC metric on the ClinVar dataset since the ability to identify pathogenic SVs. Robustness was examined in the context of genomic and biological variability. Usability was measured by computational efficiency and the user-friendliness of the tools, including the quality of documentation, ease of installation, requirements of preinstalled datasets, complexity of input files, and the presence of an online webserver.

View this table:
  • View inline
  • View popup
Table 2.

Summary of seven independent datasets used in this study.

Table S2. Detailed description of seven independent data sources used in this study.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1. Overview of study workflow for SV prioritization benchmarking.

This workflow illustrates the evaluation process for eight SV prioritization tools, categorized into knowledge-driven and data-driven approaches. These tools were benchmarked across seven independent and curated datasets using three main criteria: (1) accuracy in pathogenicity prediction, (2) robustness in diverse genomic and biological contexts, and (3) usability, focusing on user accessibility and computational performance.

Benchmarking performance evaluation of accuracy

Our comprehensive evaluation revealed significant variability in the predictive concordance among the eight SV prioritization approaches. Spearman rank correlation coefficients indicated a higher degree of consistency for the negative set compared with the positive set, with weak correlations (R < 0.3) prevalent among the approaches (Fig 2A). This observation suggests a lack of consensus in predictive capabilities, underscoring the necessity for a thorough comparative assessment.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2. Comparative performance of eight SV prioritization approaches.

(A) Correlation analysis between positive (pathogenic) and negative (benign) variant sets across the eight approaches, indicating the differentiation ability of each tool. (B) Distribution of pathogenicity scores for positive and negative sets, showing score separation across the tools. (C) Performance summary across all germline variants from ClinVar, measured by area under the curve.

In assessing accuracy using the AUC metric against gold standard datasets, StrVCTVRE stood out with an AUC of 0.96, demonstrating exceptional performance (Fig 2B, Table S3). Within the data-driven approaches, XCNV, CADD-SV, TADA, and SVScore also exhibited commendable AUCs ranging from 0.91 to 0.83. Conversely, dbCNV showed a notably lower performance with an AUC of 0.50. The distribution of pathogenic score (PS) for positive and negative sets was distinctly separable in most data-driven methods, whereas dbCNV showed overlapping distributions (Fig 2C). Knowledge-driven models, AnnotSV and ClassifyCNV, also performed relatively well, with AUCs of 0.93 and 0.70, respectively. These results highlight the competitive performance of both knowledge-driven and data-driven models, particularly StrVCTVRE and AnnotSV.

Table S3. Performances over all approaches.

Performance evaluation of the robustness on genomic features

We conducted a robustness evaluation of the approaches based on genomic features, including SV types, lengths, and gene contents. According to ACMG guidelines, deletions and duplications were assessed separately. The performance of most approaches was found to be similar for both SV types, which aligned with the distribution of PS (Fig 3A). StrVCTVRE, AnnotSV, XCNV, CADD-SV, and ClassifyCNV demonstrated AUCs above 0.71 for deletions and 0.63 for duplications (Fig 3B, Table S4). However, TADA, SVScore, and dbCNV were less consistent, especially for duplications, where their AUCs were considerably lower.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3. Performance of SV prioritization tools across genomic contexts.

SV type, length, and gene content: (A) Distribution of pathogenicity scores for deletion and duplication sets, illustrating score separation across tools by SV type. (B) Performance of each tool in deletions and duplications among germline ClinVar variants, evaluated by area under the curve (AUC). (C) Length distributions of deletions and duplications within the dataset. (D) AUCs performance over three lengths ranges (L1< 6*103, L2:6*103∼105, L3 >105) for deletions and duplications. (E) Distribution differences in protein-coding gene coverage between negative (benign) and positive (pathogenic) SV sets. (F) AUC comparison by gene context (disease-related, functional genes) for deletions and duplications, further categorized by SVs covering zero genes (No. genes = 0) and one or more genes (No. genes ≥ 1). AUC, area under the curve; SV, structural variant.

Table S4. Performance across approaches in two SV types.

When assessing SV performance across different length ranges (>6*103, 6*103∼105, > 105) (Fig 3C), StrVCTVRE, AnnotSV, XCNV, CADD-SV, and TADA maintained high and consistent performance (AUCs > 0.80) in deletions across all size groups (Fig 3D, Table S5). In contrast, ClassifyCNV and dbCNV showed relatively poor performances, and SVScore displayed a lower AUC (AUC = 0.65) for lengths greater than 10⁵ bp. For duplications, a decline in performance with increasing length was observed, particularly for TADA, which showed a decline in AUC from 0.98 for shorter duplications to 0.94 for longer ones. This trend may be attributed to the size-match strategy used in TADA’s training set construction. ClassifyCNV and SVScore showed less promising performance for longer duplications, where dbCNV failed to distinguish between positive and negative sets. These results demonstrated that CADD-SV, AnnotSV, StrVCTVRE, and XCNV had high efficacy across various SV length groups for both deletions and duplications, while TADA, SVScore, and ClassifyCNV exhibited diverse performances, particularly for longer lengths (Fig 3D, Table S5).

Table S5. Performance across approaches in different length groups.

From the comparison of SV types and length groups, we observed that the distribution of PS in the duplication set was generally less distinct. This may be due to the smaller number of duplications in the training set and feature selection processes that were more tailored to deletions. The predominance of shorter deletions over duplications in pathogenic status requires careful consideration. For example, the commonly used training set ClinVar includes 11,946 germline pathogenic deletions with an average length of 122,698 bp and 1,391 duplications with an average length of 131,202 bp, with over 13% and 15% of deletions and duplications being longer than the average length, respectively (Fig 3C).

Gene content analysis revealed significant differences between the number of protein-coding genes covered by SVs in negative and positive sets (Figs 3E and S1). CADD-SV, AnnotSV, StrVCTVRE, and XCNV consistently showed superior performance across different gene content categories, irrespective of the number of genes involved, with AUCs exceeding 0.85 (Fig 3F, Table S6). In contrast, TADA, SVScore, and ClassifyCNV performed better in deletions not associated with any disease or functional genes. Notably, deletions without disease or functional genes were longer but not significantly (mean length: 20,948.65 bp). For duplications, the performance of CADD-SV, AnnotSV, StrVCTVRE, and XCNV remained high in groups intersecting with at least one disease or functional gene (Fig 3F, Table S6). In summary, our study indicates that while most approaches exhibit improved performance in the absence of disease or functional genes in deletions, their efficacy varies in duplications.

Figure S1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S1. Distribution and intersection analyses of disease-relevant and functionally relevant genes.

(A) Upset plot showing the distribution and overlap of disease-relevant gene sets across three databases: ACMG (American College of Medical Genetics), ClinGen (Clinical Genome Resource), and Orphanet. Vertical bars indicate the size of each intersection, while horizontal bars represent the total number of genes in each database. The largest group is unique to ClinGen (3,686 genes), with additional smaller overlaps across combinations of the three datasets. (B) Upset plot of functionally relevant genes across four experimental and phenotypic databases: cell culture, pTriplo (probability of triplosensitivity), MGI (Mouse Genome Informatics), and pHaplo (probability of haploinsufficiency). The highest number of unique genes is in cell culture (1,351 genes), with other notable intersections among database combinations. Vertical bars show intersection sizes, while horizontal bars display total gene counts in each dataset.

Table S6. Performance across approaches in different groups of genes.

Collectively, StrVCTVRE, AnnotSV, CADD-SV, and XCNV have demonstrated superior performance across various metrics, including SV types, length groups, and gene contents, indicating their robustness in predicting SV pathogenicity.

Performance evaluation of the robustness on biological mechanisms

Our investigation into the robustness of computational approaches for SV analysis extended to examining biological mechanisms. We systematically curated five distinct datasets representing a spectrum of genomic variations, including noncoding SVs, long-range SVs, somatic SVs, disease-associated SVs, and functionally relevant SVs.

In noncoding SVs, we observed that TADA, SVScore, and AnnotSV were the top performers, demonstrating high AUC values of 0.92, 0.86, and 0.83, respectively (Fig 4). These tools showed strong alignment between AUC and other performance metrics such as accuracy, sensitivity, and specificity (Table S3). However, it is important to note that CADD-SV and StrVCTVRE were not applicable for noncoding SVs due to their focus on protein-coding genes. Long range SVs were evaluated, revealing TADA and XCNV as standout performers with AUCs of 0.98 and 0.95, respectively (Fig 4). StrVCTVRE, AnnotSV, and CADD-SV also exhibited robust performance, albeit with StrVCTVRE’ s results being partially obscured by an 11% variant missing rate (Table S3). The discordance between SVScore’s high AUC and its lower MCC suggested a high false positive rate, likely due to its default scoring system for long SVs.

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4. Performance across approaches covering various biological mechanisms including noncoding SVs, long range SVs, somatic SVs, GWAS SV and eQTL SV.

AUC, area under the curve; SV, structural variant.

Somatic SVs were assessed with TADA and AnnotSV leading the way with AUCs of 0.77 and 0.74, respectively (Fig 4). Regarding to MCC, the knowledge-driven approaches, particularly AnnotSV and ClassifyCNV, showed a slight advantage over data-driven methods. When assessing disease-associated SVs from large cohort studies, all methods except SVScore showed comparable AUCs and MCCs (Fig 4, Table S3). However, CADD-SV, StrVCTVRE, and TADA exhibited missing variant rates, indicating a need for improvement in detecting these variants.

Finally, the analysis of functionally relevant SVs revealed varying performance among data-driven approaches. XCNV emerged as the top performer with an AUC of 0.71, followed by CADD-SV and TADA (Fig 4). In contrast, dbCNV and StrVCTVRE lagged behind, highlighting challenges in accurately predicting these SVs.

In summary, our assessment confirmed the potential of these tools to identify novel biological mechanisms from germline variants. While the tools may not exhibit the highest level of AUC for somatic, GWAS, and eQTL SVs, their performances provide a foundation for further refinement. Our findings showed the importance of selecting the appropriate tool based on the specific characteristics of the SVs and highlight the potential for further refinement across various genomic contexts.

Usability evaluation of computational efficiency and user-friendliness

The usability of the approaches was assessed. We focused on computational efficiency and user-friendliness, which significantly impact user experience and practical applicability (Table 3). Computational efficiency revealed that knowledge-driven approaches generally outperformed data-driven approaches with the completing tasks within an average of 15 s. ClassifyCNV was notably efficient, while StrVCTVRE and TADA led among data-driven methods. Notably, the use of default hyperparameter settings during testing influenced method efficiency. For instance, CADD-SV provides multicore operational capability, which may influent efficiency.

View this table:
  • View inline
  • View popup
Table 3.

Summary of computational efficiency and user-friendliness over all approaches.

Regarding the quality of tutorials and code, we found that most methods adequately met the basic requirements of users, ensuring straightforward installation and use. Several approaches offered support through conda environment and Docker images, which greatly facilitated the setup process. However, the necessity to install datasets is a speed limit step which was dependent on internet connection stability. Our analysis also considered the complexity of input files, including supported genome builds, file types, and SV types. We noted that the most of tools supported at least two SV types. The hg19 genome build was commonly accepted, though hg38 is increasingly adopted. The bed format, specifying chromosome, start position, end position, and SV type, emerged as standard among the evaluated tools. In addition, four out of the eight approaches provided an online version, including AnnotSV, CADD-SV, StrVCTVRE, XCNV, enhancing accessibility and user convenience.

Discussion

In the landscape of genomic sequencing, computational methods have become indispensable for deciphering the functional relevance and clinical significance of SVs. We created seven datasets from diverse biological backgrounds and conducted an extensive benchmarking of eight available approaches, categorized into knowledge-driven and data-driven, focusing on accuracy, robustness, and usability. Our findings reveal that both categories of tools demonstrate comparable effectiveness in identifying pathogenic germline SVs. This study systematically evaluates and compares the performance of SV prioritization tools, offering important insights to the biomedical and clinical communities.

Our evaluations yielded several key insights. First, the comparable effectiveness in identifying pathogenic germline SVs across different methods suggests that the choice between these approaches should be guided by the specific context and objectives of the analysis rather than any inherent superiority. This is a significant hint that underscores the need for a contextual approach in selecting SV analysis tools. In addition, our benchmarking study highlights the strengths and limitations of both knowledge-driven and data-driven techniques. Future tools could benefit from a hybrid approach. Knowledge-based techniques which leverage existing knowledge and framework like the ACMG guidelines, are essential for determine the pathogenicity of SVs. Incorporating data-driven techniques can be highly beneficial in identifying novel or potentially pathogenic SVs that may not be well understood yet. Integrating both approaches can lead to more comprehensive and accurate SV prioritization, especially for novel or complex regions.

Second, the capacity of these methods to integrate new knowledge and generate new hypotheses is critical. For example, the importance of small variants in noncoding regions is well-established, as illustrated by examples such as a variant in the promoter region of GATA1 affecting a transcription factor binding site, leading to hereditary persistence of fetal hemoglobin (Martyn et al, 2019), or a variant disrupting upstream open reading frames of the NF2 gene causing neurofibromatosis type 2 (Whiffin et al, 2020). With WGS uncovering hundreds of thousands of SVs, primarily impacting noncoding regions, the ability of these tools to accommodate emerging data is essential for scientific discovery.

Third, the applicability of these methods to variants beyond germline SVs is highly significant. The performances are acceptable for initial screening and can be particularly useful in data generation or in settings where a broader filter is applied to capture potential variants of interest. Recently, several studies focused on discovery the somatic variants from whole exome sequencing data from UK Biobank (Bernstein et al, 2024). As the understanding of the role of somatic and other non-germline variants in disease grows, tools capable of analyzing a broader spectrum of variants become increasingly important.

Despite these advancements, challenges persist in generating unified SV sets across all types, especially from short-read WGS. Most existing approaches concentrate on deletions and duplications, often overlooking other SV types. This limitation may stem from the developing status of ACMG guidelines and the scarcity of gold standard datasets for certain SV types. The increasing accessibility of long-read sequencing opens up new opportunities for SV detection. This technique is particularly effective for identifying complex SVs, repetitive regions, and resolving large structural changes that short-read technologies failed. However, it also faces challenges. These new regions will require updated annotations and retraining of data-driven models to handle the unique properties of long-read data. Moreover, integrating long-read sequencing data with the existing short-read data and annotations poses another challenge. There is a need for tools that can efficiently combine information from multiple sequencing platforms and provide a unified annotation framework.

Understanding the underlying biological mechanisms also necessitates integrating cell-type specific information and phenotype data (Liu et al, 2023; Sanchez-Gaya & Rada-Iglesias, 2023). Promisingly, recent methodologies have begun to incorporate phenotype-specific characteristics (Althagafi et al, 2022; Xu et al, 2023), recognizing their significance in assessing SV pathogenicity. A particular challenge lies in interpreting the biological significance of SVs within noncoding regions, where their impact often depends on disruptions to regulatory elements such as enhancer–promoter interactions and topologically associating domain (TAD) boundaries. Tools that incorporate 3D genomic context could improve noncoding SV interpretation (Hertzberg et al, 2022; Poszewiecka et al, 2022).

Finally, CHM13/T2T represents a major improvement in genome completeness, especially in difficult regions like centromeres and telomeres. Combining it with updated annotations and resources could be a promising direction for tool development, benefiting future clinical and biological studies. As the identification of pathogenic SVs increases, comprehensive annotation of the noncoding genome, a deeper understanding of SVs in disease etiology, and advancements in bioinformatics technologies will undoubtedly spur the development of additional tools. Our future work will compare these emerging tools as they become available.

In conclusion, our study provides a critical evaluation of computational tools for prioritizing SVs, highlighting their accuracy, robustness, and usability. The findings emphasize the importance of selecting tools based on the specific analysis context and objectives. As genomics continues to evolve, the adaptability of these tools to new knowledge and data generation will be crucial for advancing our understanding of the genomic basis of disease.

Materials and Methods

Dataset curation

To ensure the strength and reproducibility of our benchmarking assessments, the creation of an independent dataset is required that does not overlap with any variants used in the training datasets of the software under evaluation. Our pipeline created seven distinct datasets, with the first six serving as positive datasets to evaluate specific aspects of software performance, and the last one serving as a negative control which were: (1) germline SVs from ClinVar; (2) SVs in noncoding regions (Noncoding SVs); (3) SVs involved in long-range interactions (long range SVs); (4) somatic SVs from COSMIC (https://cancer.sanger.ac.uk/cosmic); (5) validated SVs from GWAS (Auwerx et al, 2024); (6) functionally relevant SVs from eQTL studies (Scott et al, 2021); (7) population SVs from GnomAD version 4.1 (https://gnomad.broadinstitute.org/) (Fig 1, Tables 2 and S2). Employing a dual-strategy approach, we ensured that the access date for our test datasets was subsequent to the publication date of the evaluated software and meticulously eliminated any overlapping SVs among the datasets.

The first benchmark dataset, derived from ClinVar in March 2024, focused on CNVs, including deletions and duplications, classified according to ACMG guidelines. We retained CNVs exceeding 50 bp in length and meeting specific classification criteria (“pathogenic,” and “likely pathogenic” as positive labels, “likely benign” and “benign” as negative labels). To address the limited number of negative sets, we randomly selected deletions and duplications with allele frequencies less than 1% from GnomAD, confirming no overlap with GnomAD V2 or pathogenic SVs in ClinVar.

The second to fifth benchmark datasets were curated to reflect disease relevance with diverse biological origins (Table 2). Noncoding SVs with established pathogenicity were identified from peer-reviewed publications, emphasizing the role of noncoding regions in genetic pathology (Gordon et al, 2014; Bieth et al, 2015; Turner et al, 2016; Cappuccio et al, 2019) (Table S7). Long-range SVs were sourced from studies demonstrating their impact on the three-dimensional genome architecture (Kouwenhoven et al, 2010; Ellaway et al, 2013; Tayebi et al, 2014; Lupianez et al, 2015; Franke et al, 2016; D’Haene et al, 2019; Long et al, 2020) (Table S7). Somatic SVs were derived from COSMIC, and we constructed a matched positive and negative SV dataset following the approach by Wang et al (2023) and oncoKB (Chakravarty et al, 2017). Disease-associated SVs from GWAS were included based on validation and significance thresholds.

Table S7. Curated datasets from publications for biological mechanisms.

The sixth dataset from eQTL studies aimed to connect molecular and clinical phenotypes, focusing on rare SVs with aberrant gene expression across multiple tissues (Scott et al, 2021). The consequence of SV with respect to outlier gene is either complete dosage change or partial dosage change.

The final dataset, comprising population SVs, served as negative controls, with additional rare GnomAD variants added for comprehensiveness. All variants were lifted over to hg19 using UCSC liftover tool. We restricted our analysis to autosomes in hg19 genome build.

Feature selection

Genomic content, crucial for evaluating the disease or functional relevance of SVs, was systematically compiled. We collected three groups of genes: protein-coding genes, disease-associated genes, and functionally relevant genes. Protein-coding genes were sourced from GENCODE. Disease associated genes were obtained from Orphanet (https://www.orpha.net/consor/cgi-bin/index.php), genes with dosage sensitivity from ClinGen (https://search.clinicalgenome.org/kb/gene-dosage/cnv) and ACMG-approved genes (V3.0). The functional relevant genes were collected among essential genes from cell culture studies (Hart et al, 2017), genes lethal in mouse models (Motenko et al, 2015), and genes with predicted dosage sensitive (probability of haploinsufficiency >= 0.9 or probability of triplosensitivity >= 0.9) (Collins et al, 2022). Annotations were based on distinct feature types in hg19 genome build (Liu et al, 2023).

Workflow building and evaluation method

All methods, except dbCNV, generated pathogenic scores (PS) for SV prioritization with lower scores indicating non-pathogenicity and higher scores suggesting pathogenicity. For dbCNV, the five-tier classification was converted into numeric indicators. PS was derived using default parameters, followed by min–max normalization. All methods operated using default settings.

To evaluate the performance among approaches, we used a suite of metrics including accuracy (Equation (1)), sensitivity (Equation (2)), specificity (Equation (3)), positive prediction value (PPV) (Equation (4)), false positive rate (FPR) (Equation (5)), F1-score (Equation (6)), Matthews correlation coefficient (MCC) (Equation (7)) and area under the curve (AUC). Data visualization and analysis scripts were conducted using R and self-authored scripts.Accuracy=(TP+TN) /(TP+FP+TN+FN)(1)Sensitivity=TP/(TP+FN)(2)Specificity=TN/(TN+FP)(3)PPV=TP/(TP+FP)(4)FPR=FP/(FP+TN)(5)F1−Score=2 * (PPV * Sensitivity)/(PPV+Sensitivity)(6)MCC=((TP * TN−FP * FN))/((TP+FP)(TP+FN)(TN+FP)(TN+FN))(7)

Computational resource

The computational resources for testing all approaches including an Intel(R) Xeon(R) Gold 6140 CPU @ 2.30 GHz with 144 cores and 1 TB of memory, running CentOS Linux release 7.7.1908.

Data Availability

The data accessed in this article are available in ClinVar (accessed at 2024-Mar-20), GnomAD (v4.1) and Cosmic (accessed at 2024-Mar-20).

Acknowledgements

This work was supported by the Ministry of Science and Technology of China [2019YFA0802104], the National Natural Science Foundation of China [32293204] to W Li; the Beijing Natural Science Foundation [5222007] to X Liu.

Author Contributions

  • X Liu: conceptualization, data curation, formal analysis, funding acquisition, investigation, visualization, and writing—original draft, review, and editing.

  • L Gu: visualization, methodology, and writing—review and editing.

  • C Hao: supervision, project administration, and interpretation of the data.

  • W Xu: methodology and interpretation of the data.

  • F Leng: interpretation of the data.

  • P Zhang: validation, visualization, and interpretation of the data.

  • W Li: supervision, funding acquisition, project administration, and writing—review and editing.

Conflict of Interest Statement

The authors declare that they have no conflict of interest.

  • Received July 17, 2024.
  • Revision received November 30, 2024.
  • Accepted December 2, 2024.
  • © 2024 Liu et al.
Creative Commons logoCreative Commons logohttps://creativecommons.org/licenses/by/4.0/

This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).

References

  1. ↵
    1. 1000 Genomes Project Consortium,
    2. Auton A,
    3. Brooks LD,
    4. Durbin RM,
    5. Garrison EP,
    6. Kang HM,
    7. Korbel JO,
    8. Marchini JL,
    9. McCarthy S,
    10. McVean GA, et al.
    (2015) A global reference for human genetic variation. Nature 526: 68–74. doi:10.1038/nature15393
    OpenUrlCrossRefPubMed
  2. ↵
    1. Althagafi A,
    2. Alsubaie L,
    3. Kathiresan N,
    4. Mineta K,
    5. Aloraini T,
    6. Al Mutairi F,
    7. Alfadhel M,
    8. Gojobori T,
    9. Alfares A,
    10. Hoehndorf R
    (2022) DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning. Bioinformatics 38: 1677–1684. doi:10.1093/bioinformatics/btab859
    OpenUrlCrossRefPubMed
    1. Audano PA,
    2. Sulovari A,
    3. Graves-Lindsay TA,
    4. Cantsilieris S,
    5. Sorensen M,
    6. Welch AE,
    7. Dougherty ML,
    8. Nelson BJ,
    9. Shah A,
    10. Dutcher SK, et al.
    (2019) Characterizing the major structural variant alleles of the human genome. Cell 176: 663–675.e19. doi:10.1016/j.cell.2018.12.019
    OpenUrlCrossRefPubMed
  3. ↵
    1. Auwerx C,
    2. Joeloo M,
    3. Sadler MC,
    4. Tesio N,
    5. Ojavee S,
    6. Clark CJ,
    7. Magi R, Estonian Biobank Research Team,
    8. Reymond A,
    9. Kutalik Z
    (2024) Rare copy-number variants as modulators of common disease susceptibility. Genome Med 16: 5. doi:10.1186/s13073-023-01265-5
    OpenUrlCrossRefPubMed
  4. ↵
    1. Bernstein N,
    2. Spencer Chapman M,
    3. Nyamondo K,
    4. Chen Z,
    5. Williams N,
    6. Mitchell E,
    7. Campbell PJ,
    8. Cohen RL,
    9. Nangalia J
    (2024) Analysis of somatic mutations in whole blood from 200,618 individuals identifies pervasive positive selection and novel drivers of clonal hematopoiesis. Nat Genet 56: 1147–1155. doi:10.1038/s41588-024-01755-1
    OpenUrlCrossRef
  5. ↵
    1. Beyter D,
    2. Ingimundardottir H,
    3. Oddsson A,
    4. Eggertsson HP,
    5. Bjornsson E,
    6. Jonsson H,
    7. Atlason BA,
    8. Kristmundsdottir S,
    9. Mehringer S,
    10. Hardarson MT, et al.
    (2021) Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet 53: 779–786. doi:10.1038/s41588-021-00865-4
    OpenUrlCrossRefPubMed
    1. Bhattacharya S,
    2. Barseghyan H,
    3. Delot EC,
    4. Vilain E
    (2021) nanotatoR: a tool for enhanced annotation of genomic structural variants. BMC Genomics 22: 10. doi:10.1186/s12864-020-07182-w
    OpenUrlCrossRefPubMed
  6. ↵
    1. Bieth E,
    2. Eddiry S,
    3. Gaston V,
    4. Lorenzini F,
    5. Buffet A,
    6. Conte Auriol F,
    7. Molinas C,
    8. Cailley D,
    9. Rooryck C,
    10. Arveiler B, et al.
    (2015) Highly restricted deletion of the SNORD116 region is implicated in Prader-Willi Syndrome. Eur J Hum Genet 23: 252–255. doi:10.1038/ejhg.2014.103
    OpenUrlCrossRef
  7. ↵
    1. Brandler WM,
    2. Antaki D,
    3. Gujral M,
    4. Kleiber ML,
    5. Whitney J,
    6. Maile MS,
    7. Hong O,
    8. Chapman TR,
    9. Tan S,
    10. Tandon P, et al.
    (2018) Paternally inherited cis-regulatory structural variants are associated with autism. Science 360: 327–331. doi:10.1126/science.aan2261
    OpenUrlAbstract/FREE Full Text
  8. ↵
    1. Cappuccio G,
    2. Attanasio S,
    3. Alagia M,
    4. Mutarelli M,
    5. Borzone R,
    6. Karali M,
    7. Genesio R,
    8. Mormile A,
    9. Nitsch L,
    10. Imperati F, et al.
    (2019) Microdeletion of pseudogene chr14.232.a affects LRFN5 expression in cells of a patient with autism spectrum disorder. Eur J Hum Genet 27: 1475–1480. doi:10.1038/s41431-019-0430-5
    OpenUrlCrossRefPubMed
  9. ↵
    1. Chakravarty D,
    2. Gao J,
    3. Phillips SM,
    4. Kundra R,
    5. Zhang H,
    6. Wang J,
    7. Rudolph JE,
    8. Yaeger R,
    9. Soumerai T,
    10. Nissan MH et al.
    (2017) OncoKB: A precision oncology knowledge base. JCO Precis Oncol 2017: PO.17.00011. doi:10.1200/PO.17.00011
    OpenUrlCrossRefPubMed
  10. ↵
    1. Collins RL,
    2. Brand H,
    3. Karczewski KJ,
    4. Zhao X,
    5. Alfoldi J,
    6. Francioli LC,
    7. Khera AV,
    8. Lowther C,
    9. Gauthier LD,
    10. Wang H, et al.
    (2020) A structural variation reference for medical and population genetics. Nature 581: 444–451. doi:10.1038/s41586-020-2287-8
    OpenUrlCrossRefPubMed
  11. ↵
    1. Collins RL,
    2. Glessner JT,
    3. Porcu E,
    4. Lepamets M,
    5. Brandon R,
    6. Lauricella C,
    7. Han L,
    8. Morley T,
    9. Niestroj LM,
    10. Ulirsch J, et al.
    (2022) A cross-disorder dosage sensitivity map of the human genome. Cell 185: 3041–3055.e25. doi:10.1016/j.cell.2022.06.036
    OpenUrlCrossRefPubMed
  12. ↵
    1. D’Haene E,
    2. Bar-Yaacov R,
    3. Bariah I,
    4. Vantomme L,
    5. Van Loo S,
    6. Cobos FA,
    7. Verboom K,
    8. Eshel R,
    9. Alatawna R,
    10. Menten B, et al.
    (2019) A neuronal enhancer network upstream of MEF2C is compromised in patients with Rett-like characteristics. Hum Mol Genet 28: 818–827. doi:10.1093/hmg/ddy393
    OpenUrlCrossRefPubMed
    1. Danis D,
    2. Jacobsen JOB,
    3. Balachandran P,
    4. Zhu Q,
    5. Yilmaz F,
    6. Reese J,
    7. Haimel M,
    8. Lyon GJ,
    9. Helbig I,
    10. Mungall CJ, et al.
    (2022) SvAnna: Efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing. Genome Med 14: 44. doi:10.1186/s13073-022-01046-6
    OpenUrlCrossRefPubMed
    1. Ding Q,
    2. Somerville C,
    3. Manshaei R,
    4. Trost B,
    5. Reuter MS,
    6. Kalbfleisch K,
    7. Stanley K,
    8. Okello JBA,
    9. Hosseini SM,
    10. Liston E, et al.
    (2023) SCIP: Software for efficient clinical interpretation of copy number variants detected by whole-genome sequencing. Hum Genet 142: 201–216. doi:10.1007/s00439-022-02494-1
    OpenUrlCrossRefPubMed
  13. ↵
    1. Ellaway CJ,
    2. Ho G,
    3. Bettella E,
    4. Knapman A,
    5. Collins F,
    6. Hackett A,
    7. McKenzie F,
    8. Darmanian A,
    9. Peters GB,
    10. Fagan K, et al.
    (2013) 14q12 microdeletions excluding FOXG1 give rise to a congenital variant Rett syndrome-like phenotype. Eur J Hum Genet 21: 522–527. doi:10.1038/ejhg.2012.208
    OpenUrlCrossRefPubMed
    1. Erikson GA,
    2. Deshpande N,
    3. Kesavan BG,
    4. Torkamani A
    (2015) SG-ADVISER CNV: Copy-number variant annotation and interpretation. Genet Med 17: 714–718. doi:10.1038/gim.2014.180
    OpenUrlCrossRefPubMed
    1. Fan C,
    2. Wang Z,
    3. Sun Y,
    4. Sun J,
    5. Liu X,
    6. Kang L,
    7. Xu Y,
    8. Yang M,
    9. Dai W,
    10. Song L, et al.
    (2021) AutoCNV: A semiautomatic CNV interpretation system based on the 2019 ACMG/ClinGen technical standards for CNVs. BMC Genomics 22: 721. doi:10.1186/s12864-021-08011-4
    OpenUrlCrossRefPubMed
    1. Fino J,
    2. Marques B,
    3. Dong Z,
    4. David D
    (2021) SVInterpreter: A comprehensive topologically associated domain-based clinical outcome prediction tool for balanced and unbalanced structural variants. Front Genet 12: 757170. doi:10.3389/fgene.2021.757170
    OpenUrlCrossRefPubMed
  14. ↵
    1. Firth HV,
    2. Richards SM,
    3. Bevan AP,
    4. Clayton S,
    5. Corpas M,
    6. Rajan D,
    7. Van Vooren S,
    8. Moreau Y,
    9. Pettett RM,
    10. Carter NP
    (2009) Decipher: Database of chromosomal imbalance and phenotype in humans using ensembl resources. Am J Hum Genet 84: 524–533. doi:10.1016/j.ajhg.2009.03.010
    OpenUrlCrossRefPubMed
  15. ↵
    1. Franke M,
    2. Ibrahim DM,
    3. Andrey G,
    4. Schwarzer W,
    5. Heinrich V,
    6. Schopflin R,
    7. Kraft K,
    8. Kempfer R,
    9. Jerkovic I,
    10. Chan WL, et al.
    (2016) Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 538: 265–269. doi:10.1038/nature19800
    OpenUrlCrossRefPubMed
  16. ↵
    1. Ganel L,
    2. Abel HJ, FinMetSeq Consortium,
    3. Hall IM
    (2017) SVScore: An impact prediction tool for structural variation. Bioinformatics 33: 1083–1085. doi:10.1093/bioinformatics/btw789
    OpenUrlCrossRefPubMed
  17. ↵
    1. Geoffroy V,
    2. Guignard T,
    3. Kress A,
    4. Gaillard JB,
    5. Solli-Nowlan T,
    6. Schalk A,
    7. Gatinois V,
    8. Dollfus H,
    9. Scheidecker S,
    10. Muller J
    (2021) AnnotSV and knotAnnotSV: A web server for human structural variations annotations, ranking and analysis. Nucleic Acids Res 49: W21–W28. doi:10.1093/nar/gkab402
    OpenUrlCrossRef
    1. Geoffroy V,
    2. Herenger Y,
    3. Kress A,
    4. Stoetzel C,
    5. Piton A,
    6. Dollfus H,
    7. Muller J
    (2018) AnnotSV: An integrated tool for structural variations annotation. Bioinformatics 34: 3572–3574. doi:10.1093/bioinformatics/bty304
    OpenUrlCrossRef
  18. ↵
    1. Gordon CT,
    2. Attanasio C,
    3. Bhatia S,
    4. Benko S,
    5. Ansari M,
    6. Tan TY,
    7. Munnich A,
    8. Pennacchio LA,
    9. Abadie V,
    10. Temple IK, et al.
    (2014) Identification of novel craniofacial regulatory domains located far upstream of SOX9 and disrupted in Pierre Robin sequence. Hum Mutat 35: 1011–1020. doi:10.1002/humu.22606
    OpenUrlCrossRefPubMed
  19. ↵
    1. Gurbich TA,
    2. Ilinsky VV
    (2020) ClassifyCNV: A tool for clinical annotation of copy-number variants. Sci Rep 10: 20375. doi:10.1038/s41598-020-76425-3
    OpenUrlCrossRefPubMed
  20. ↵
    1. Hart T,
    2. Tong AHY,
    3. Chan K,
    4. Van Leeuwen J,
    5. Seetharaman A,
    6. Aregger M,
    7. Chandrashekhar M,
    8. Hustedt N,
    9. Seth S,
    10. Noonan A, et al.
    (2017) Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 (Bethesda) 7: 2719–2727. doi:10.1534/g3.117.041277
    OpenUrlAbstract/FREE Full Text
  21. ↵
    1. Hertzberg J,
    2. Mundlos S,
    3. Vingron M,
    4. Gallone G
    (2022) TADA-a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs. Genome Biol 23: 67. doi:10.1186/s13059-022-02631-z
    OpenUrlCrossRefPubMed
    1. Huynh L,
    2. Hormozdiari F
    (2019) TAD fusion score: Discovery and ranking the contribution of deletions to genome structure. Genome Biol 20: 60. doi:10.1186/s13059-019-1666-7
    OpenUrlCrossRefPubMed
  22. ↵
    1. Kleinert P,
    2. Kircher M
    (2022) A framework to score the effects of structural variants in health and disease. Genome Res 32: 766–777. doi:10.1101/gr.275995.121
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Klopocki E,
    2. Schulze H,
    3. Strauss G,
    4. Ott CE,
    5. Hall J,
    6. Trotier F,
    7. Fleischhauer S,
    8. Greenhalgh L,
    9. Newbury-Ecob RA,
    10. Neumann LM, et al.
    (2007) Complex inheritance pattern resembling autosomal recessive inheritance involving a microdeletion in thrombocytopenia-absent radius syndrome. Am J Hum Genet 80: 232–240. doi:10.1086/510919
    OpenUrlCrossRefPubMed
  24. ↵
    1. Kouwenhoven EN,
    2. van Heeringen SJ,
    3. Tena JJ,
    4. Oti M,
    5. Dutilh BE,
    6. Alonso ME,
    7. de la Calle-Mustienes E,
    8. Smeenk L,
    9. Rinne T,
    10. Parsaulian L, et al.
    (2010) Genome-wide profiling of p63 DNA-binding sites identifies an element that regulates gene expression during limb development in the 7q21 SHFM1 locus. PLoS Genet 6: e1001065. doi:10.1371/journal.pgen.1001065
    OpenUrlCrossRefPubMed
    1. Kumar S,
    2. Harmanci A,
    3. Vytheeswaran J,
    4. Gerstein MB
    (2020) SVFX: A machine learning framework to quantify the pathogenicity of structural variants. Genome Biol 21: 274. doi:10.1186/s13059-020-02178-x
    OpenUrlCrossRefPubMed
  25. ↵
    1. Landrum MJ,
    2. Lee JM,
    3. Benson M,
    4. Brown G,
    5. Chao C,
    6. Chitipiralla S,
    7. Gu B,
    8. Hart J,
    9. Hoffman D,
    10. Hoover J, et al.
    (2016) ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44: D862–D868. doi:10.1093/nar/gkv1222
    OpenUrlCrossRefPubMed
  26. ↵
    1. Li Y,
    2. Roberts ND,
    3. Wala JA,
    4. Shapira O,
    5. Schumacher SE,
    6. Kumar K,
    7. Khurana E,
    8. Waszak S,
    9. Korbel JO,
    10. Haber JE, et al.
    (2020) Patterns of somatic structural variation in human cancer genomes. Nature 578: 112–121. doi:10.1038/s41586-019-1913-9
    OpenUrlCrossRefPubMed
  27. ↵
    1. Liu X,
    2. Xu W,
    3. Leng F,
    4. Zhang P,
    5. Guo R,
    6. Zhang Y,
    7. Hao C,
    8. Ni X,
    9. Li W
    (2023) NeuroCNVscore: A tissue-specific framework to prioritise the pathogenicity of CNVs in neurodevelopmental disorders. BMJ Paediatr Open 7: e001966. doi:10.1136/bmjpo-2023-001966
    OpenUrlAbstract/FREE Full Text
  28. ↵
    1. Long HK,
    2. Osterwalder M,
    3. Welsh IC,
    4. Hansen K,
    5. Davies JOJ,
    6. Liu YE,
    7. Koska M,
    8. Adams AT,
    9. Aho R,
    10. Arora N, et al.
    (2020) Loss of extreme long-range enhancers in human neural crest drives a craniofacial disorder. Cell Stem Cell 27: 765–783.e14. doi:10.1016/j.stem.2020.09.001
    OpenUrlCrossRef
  29. ↵
    1. Lupianez DG,
    2. Kraft K,
    3. Heinrich V,
    4. Krawitz P,
    5. Brancati F,
    6. Klopocki E,
    7. Horn D,
    8. Kayserili H,
    9. Opitz JM,
    10. Laxova R, et al.
    (2015) Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 161: 1012–1025. doi:10.1016/j.cell.2015.04.004
    OpenUrlCrossRefPubMed
  30. ↵
    1. Lv K,
    2. Chen D,
    3. Xiong D,
    4. Tang H,
    5. Ou T,
    6. Kan L,
    7. Zhang X
    (2023) dbCNV: deleteriousness-based model to predict pathogenicity of copy number variations. BMC Genomics 24: 131. doi:10.1186/s12864-023-09225-4
    OpenUrlCrossRefPubMed
  31. ↵
    1. MacDonald JR,
    2. Ziman R,
    3. Yuen RK,
    4. Feuk L,
    5. Scherer SW
    (2014) The database of genomic variants: A curated collection of structural variation in the human genome. Nucleic Acids Res 42: D986–D992. doi:10.1093/nar/gkt958
    OpenUrlCrossRefPubMed
    1. Macnee M,
    2. Pérez-Palma E,
    3. Brünger T,
    4. Klöckner C,
    5. Platzer K,
    6. Stefanski A,
    7. Montanucci L,
    8. Bayat A,
    9. Radtke M,
    10. Collins RL, et al.
    (2023) CNV-ClinViewer: Enhancing the clinical interpretation of large copy-number variants online. Bioinformatics 39: btad290. doi:10.1093/bioinformatics/btad290
    OpenUrlCrossRefPubMed
  32. ↵
    1. Martyn GE,
    2. Wienert B,
    3. Kurita R,
    4. Nakamura Y,
    5. Quinlan KGR,
    6. Crossley M
    (2019) A natural regulatory mutation in the proximal promoter elevates fetal globin expression by creating a de novo GATA1 site. Blood 133: 852–856. doi:10.1182/blood-2018-07-863951
    OpenUrlAbstract/FREE Full Text
    1. McLaren W,
    2. Gil L,
    3. Hunt SE,
    4. Riat HS,
    5. Ritchie GRS,
    6. Thormann A,
    7. Flicek P,
    8. Cunningham F
    (2016) The ensembl variant effect predictor. Genome Biol 17: 122. doi:10.1186/s13059-016-0974-4
    OpenUrlCrossRefPubMed
    1. Minoche AE,
    2. Lundie B,
    3. Peters GB,
    4. Ohnesorg T,
    5. Pinese M,
    6. Thomas DM,
    7. Zankl A,
    8. Roscioli T,
    9. Schonrock N,
    10. Kummerfeld S, et al.
    (2021) ClinSV: Clinical grade structural and copy number variant detection from whole genome sequencing data. Genome Med 13: 32. doi:10.1186/s13073-021-00841-x
    OpenUrlCrossRefPubMed
  33. ↵
    1. Motenko H,
    2. Neuhauser SB,
    3. O’Keefe M,
    4. Richardson JE
    (2015) MouseMine: A new data warehouse for MGI. Mamm Genome 26: 325–330. doi:10.1007/s00335-015-9573-z
    OpenUrlCrossRefPubMed
    1. Nieboer MM,
    2. de Ridder J
    (2020) svMIL: predicting the pathogenic effect of TAD boundary-disrupting somatic structural variants through multiple instance learning. Bioinformatics 36: i692–i699. doi:10.1093/bioinformatics/btaa802
    OpenUrlCrossRefPubMed
  34. ↵
    1. Pagnamenta AT,
    2. Camps C,
    3. Giacopuzzi E,
    4. Taylor JM,
    5. Hashim M,
    6. Calpena E,
    7. Kaisaki PJ,
    8. Hashimoto A,
    9. Yu J,
    10. Sanders E, et al.
    (2023) Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases. Genome Med 15: 94. doi:10.1186/s13073-023-01240-0
    OpenUrlCrossRefPubMed
  35. ↵
    1. Poszewiecka B,
    2. Pienkowski VM,
    3. Nowosad K,
    4. Robin JD,
    5. Gogolewski K,
    6. Gambin A
    (2022) TADeus2: a web server facilitating the clinical diagnosis by pathogenicity assessment of structural variations disarranging 3D chromatin structure. Nucleic Acids Res 50: W744–W752. doi:10.1093/nar/gkac318
    OpenUrlCrossRefPubMed
    1. Requena F,
    2. Abdallah HH,
    3. García A,
    4. Nitschké P,
    5. Romana S,
    6. Malan V,
    7. Rausell A
    (2021) CNVxplorer: A web tool to assist clinical interpretation of CNVs in rare disease patients. Nucleic Acids Res 49: W93–W103. doi:10.1093/nar/gkab347
    OpenUrlCrossRefPubMed
  36. ↵
    1. Richards S,
    2. Aziz N,
    3. Bale S,
    4. Bick D,
    5. Das S,
    6. Gastier-Foster J,
    7. Grody WW,
    8. Hegde M,
    9. Lyon E,
    10. Spector E, et al.
    (2015) Standards and guidelines for the interpretation of sequence variants: A joint consensus recommendation of the American College of medical genetics and genomics and the association for molecular pathology. Genet Med 17: 405–424. doi:10.1038/gim.2015.30
    OpenUrlCrossRefPubMed
  37. ↵
    1. Sanchez-Gaya V,
    2. Rada-Iglesias A
    (2023) Postre: A tool to predict the pathological effects of human structural variants. Nucleic Acids Res 51: e54. doi:10.1093/nar/gkad225
    OpenUrlCrossRefPubMed
  38. ↵
    1. Scott AJ,
    2. Chiang C,
    3. Hall IM
    (2021) Structural variants are a major source of gene expression differences in humans and often affect multiple nearby genes. Genome Res 31: 2249–2257. doi:10.1101/gr.275488.121
    OpenUrlAbstract/FREE Full Text
  39. ↵
    1. Sharo AG,
    2. Hu Z,
    3. Sunyaev SR,
    4. Brenner SE
    (2022) StrVCTVRE: A supervised learning method to predict the pathogenicity of human genome structural variants. Am J Hum Genet 109: 195–209. doi:10.1016/j.ajhg.2021.12.007
    OpenUrlCrossRefPubMed
  40. ↵
    1. Sondka Z,
    2. Dhir NB,
    3. Carvalho-Silva D,
    4. Jupe S,
    5. Madhumita,
    6. McLaren K,
    7. Starkey M,
    8. Ward S,
    9. Wilding J,
    10. Ahmed M, et al.
    (2024) COSMIC: A curated database of somatic variants and clinical data for cancer. Nucleic Acids Res 52: D1210–D1217. doi:10.1093/nar/gkad986
    OpenUrlCrossRef
    1. Spector JD,
    2. Wiita AP
    (2019) ClinTAD: A tool for copy number variant interpretation in the context of topologically associated domains. J Hum Genet 64: 437–443. doi:10.1038/s10038-019-0573-9
    OpenUrlCrossRefPubMed
  41. ↵
    1. Tayebi N,
    2. Jamsheer A,
    3. Flottmann R,
    4. Sowinska-Seidler A,
    5. Doelken SC,
    6. Oehl-Jaschkowitz B,
    7. Hulsemann W,
    8. Habenicht R,
    9. Klopocki E,
    10. Mundlos S, et al.
    (2014) Deletions of exons with regulatory activity at the DYNC1I1 locus are associated with split-hand/split-foot malformation: Array CGH screening of 134 unrelated families. Orphanet J rare Dis 9: 108. doi:10.1186/s13023-014-0108-6
    OpenUrlCrossRefPubMed
  42. ↵
    1. Turner TN,
    2. Hormozdiari F,
    3. Duyzend MH,
    4. McClymont SA,
    5. Hook PW,
    6. Iossifov I,
    7. Raja A,
    8. Baker C,
    9. Hoekzema K,
    10. Stessman HA, et al.
    (2016) Genome sequencing of autism-affected families reveals disruption of putative noncoding regulatory DNA. Am J Hum Genet 98: 58–74. doi:10.1016/j.ajhg.2015.11.023
    OpenUrlCrossRefPubMed
  43. ↵
    1. Wang Z,
    2. Zhao G,
    3. Li B,
    4. Fang Z,
    5. Chen Q,
    6. Wang X,
    7. Luo T,
    8. Wang Y,
    9. Zhou Q,
    10. Li K, et al.
    (2023) Performance comparison of computational methods for the prediction of the function and pathogenicity of non-coding variants. Genomics Proteomics Bioinformatics 21: 649–661. doi:10.1016/j.gpb.2022.02.002
    OpenUrlCrossRefPubMed
  44. ↵
    1. Whiffin N,
    2. Karczewski KJ,
    3. Zhang X,
    4. Chothani S,
    5. Smith MJ,
    6. Evans DG,
    7. Roberts AM,
    8. Quaife NM,
    9. Schafer S,
    10. Rackham O, et al.
    (2020) Characterising the loss-of-function impact of 5’ untranslated region variants in 15,708 individuals. Nat Commun 11: 2523. doi:10.1038/s41467-019-10717-9
    OpenUrlCrossRefPubMed
  45. ↵
    1. Xu Z,
    2. Li Q,
    3. Marchionni L,
    4. Wang K
    (2023) PhenoSV: Interpretable phenotype-aware model for the prioritization of genes affected by structural variants. Nat Commun 14: 7805. doi:10.1038/s41467-023-43651-y
    OpenUrlCrossRefPubMed
    1. Yang Y,
    2. Wang X,
    3. Zhou D,
    4. Wei DQ,
    5. Peng S
    (2022) SVPath: An accurate pipeline for predicting the pathogenicity of human exon structural variants. Brief Bioinform 23: bbac014. doi:10.1093/bib/bbac014
    OpenUrlCrossRefPubMed
  46. ↵
    1. Zhang L,
    2. Shi J,
    3. Ouyang J,
    4. Zhang R,
    5. Tao Y,
    6. Yuan D,
    7. Lv C,
    8. Wang R,
    9. Ning B,
    10. Roberts R, et al.
    (2021) X-CNV: Genome-wide prediction of the pathogenicity of copy number variations. Genome Med 13: 132. doi:10.1186/s13073-021-00945-4
    OpenUrlCrossRefPubMed
  47. ↵
    1. Zhang Y,
    2. Li Y,
    3. Guo R,
    4. Xu W,
    5. Liu X,
    6. Zhao C,
    7. Guo Q,
    8. Xu W,
    9. Ni X,
    10. Hao C, et al.
    (2023) Genetic diagnostic yields of 354 Chinese ASD children with rare mutations by a pipeline of genomic tests. Front Genet 14: 1108440. doi:10.3389/fgene.2023.1108440
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Download PDF
Email Article

Thank you for your interest in spreading the word on Life Science Alliance.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Systematic assessment of structural variant annotation tools for genomic interpretation
(Your Name) has sent you a message from Life Science Alliance
(Your Name) thought you would like to see the Life Science Alliance web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Benchmarking SV prioritization tools
Xuanshi Liu, Lei Gu, Chanjuan Hao, Wenjian Xu, Fei Leng, Peng Zhang, Wei Li
Life Science Alliance Dec 2024, 8 (3) e202402949; DOI: 10.26508/lsa.202402949

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
Benchmarking SV prioritization tools
Xuanshi Liu, Lei Gu, Chanjuan Hao, Wenjian Xu, Fei Leng, Peng Zhang, Wei Li
Life Science Alliance Dec 2024, 8 (3) e202402949; DOI: 10.26508/lsa.202402949
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
Issue Cover

In this Issue

Volume 8, No. 3
March 2025
  • Table of Contents
  • Cover (PDF)
  • About the Cover
  • Masthead (PDF)
Advertisement

Jump to section

  • Article
    • Abstract
    • Introduction
    • Results
    • Discussion
    • Materials and Methods
    • Data Availability
    • Acknowledgements
    • References
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF

Subjects

  • Genomics & Functional Genomics
  • Methods & Resources
  • Systems & Computational Biology

Related Articles

  • No related articles found.

Cited By...

  • No citing articles found.
  • Google Scholar

More in this TOC Section

  • Comparison of mitochondrial imaging in aging C. elegans
  • RNA profile by single cell analysis of severe dengue in mice
  • Pangenome-based disease genomics
Show more Resource

Similar Articles

EMBO Press LogoRockefeller University Press LogoCold Spring Harbor Logo

Content

  • Home
  • Newest Articles
  • Current Issue
  • Archive
  • Subject Collections

For Authors

  • Submit a Manuscript
  • Author Guidelines
  • License, copyright, Fee

Other Services

  • Alerts
  • Bluesky
  • X/Twitter
  • RSS Feeds

More Information

  • Editors & Staff
  • Reviewer Guidelines
  • Feedback
  • Licensing and Reuse
  • Privacy Policy

ISSN: 2575-1077
© 2025 Life Science Alliance LLC

Life Science Alliance is registered as a trademark in the U.S. Patent and Trade Mark Office and in the European Union Intellectual Property Office.