A comparative study of structural variant calling in WGS from Alzheimer’s disease families

We developed a flexible protocol to generate a high-quality deletion call set and a truth set of Sanger sequencing–validated deletions with precise breakpoints between 1 and 17,000 bp from whole-genome sequencing data in multiplex families with Alzheimer’s disease.

A variety of SV/indel types and sizes can be detected using highthroughput short-read whole-genome sequencing (WGS).Multiple large-scale SV detection studies have been performed such as the 1000 Genomes Project (Sudmant et al, 2015), the Cancer Genome Atlas Project (Cancer Genome Atlas Research et al, 2013;Fredriksson et al, 2014), Genome of the Netherlands (Genome of the Netherlands Consortium, 2014), the UK 10K Project ( UK10K Consortium et al, 2015), gnomAD (Collins et al, 2020), and CCDG (Abel et al, 2020).However, SV/indel calling using short-read sequence data continues to be challenging.Multiple algorithms and programs (e.g., Breakdancer, CNVnator, DELLY, Genome Analysis Toolkit [GATK: 3.2] Haplotype Caller, Lumpy, Pindel, Scalpel, and SWAN) (Chen et al, 2009;Ye et al, 2009;McKenna et al, 2010;Abyzov et al, 2011;Rausch et al, 2012;Layer et al, 2014;Narzisi et al, 2014;Xia et al, 2016) are available, but many factors continue to hinder accurate and comprehensive identification of SV/indels in sequence data.These confounding factors include complex sequence structure, variability in read depth and coverage across the genome, sequencing bias and artifacts, biological contamination, and mapping and alignment errors or artifacts.Also, computational demands can limit the use of some SV/indel calling programs.Furthermore, SV/indel calling in large samples lacks standards for calling procedures, call set merging, and quality control (QC).These challenges become even more daunting when merging SV/indel calls from samples sequenced at multiple centers that use different sequencing library designs and protocols.The quality and characteristics of sequence data may vary considerably among samples within and across centers and can affect SV/indel calling sensitivity and specificity (Guo et al, 2014).We present results from analyses of WGS data generated by the Alzheimer's Disease Sequencing Project (ADSP) for 578 members of 111 families.We focused on deletions because deletion detection is more reliable than other SV types.In addition, we spiked in structural variants in existing sequences for benchmarking and evaluating the performance of several of these tools and pipelines.Spiking in known structural variants allows for a systematic evaluation of a tool's sensitivity and specificity.Spiking in variants of different sizes and complexities allowed us to benchmark tools across a spectrum of genomic alterations, providing a more comprehensive evaluation.In addition, it also facilitated the evaluation of a tool's ability to distinguish true variants from background noise.
Our work showed that no single caller can accurately detect a broad range of deletion sizes.We developed two systematic approaches for evaluating the sensitivity and specificity of different callers and deletions identified from data generated by different platforms and sequencing centers.Finally, we validated a comprehensive strategy for calling, merging, QC, and genotyping deletions that had high sensitivity and minimized false-positive calls.

Results
We generated deletion calls for the ADSP discovery phase WGS using eight different programs (GATK Haplotype Caller, Scalpel, Breakdancer, CNVnator, Lumpy, Pindel, Swan, and DELLY) (Methods: Deletion variant calling protocol, Fig 1).These programs use different sequence features and analyze different event sizes (Tables 1, 2, and S1).To determine the properties of the data generated by each program, we systematically evaluated sensitivity and specificity.Because the sequence data were generated at three different Large-Scale Sequencing and Analysis Center (LSAC) sites using libraries with different characteristics (Methods: Subjects and generation of WGS data), we evaluated data from each site.We also benchmarked the computational resources needed for each program.

Sensitivity analysis
Sensitivity was evaluated by inserting deletions and insertions into WGS data generated at each LSAC (Methods: Sensitivity analysis using simulated spike-in data).Sensitivity for detecting the inserted deletions varied among callers and, to a lesser extent, the source of the sequence data, and was dependent on the size of the deletion (Fig 2).For short deletions (30-500 bp), Scalpel showed the best sensitivity (~85%) and was closely followed by Pindel.Pindel showed good sensitivity up to 1,000 bp.GATK Haplotype Caller showed a sensitivity of ~75% for events up to 100 bp but fell off rapidly above this size range.For larger events, Lumpy and SWAN both showed good performance up to 5,000 bp.DELLY showed reasonable sensitivity in the 500-to 5,000-bp range, but when compared to other programs, its results were more influenced by the source of the data.For example, DELLY had lower sensitivity when calling genomes sequenced by BCM in the 200-to 500-bp bin as compared to those from WashU and BI.Overall, SWAN was the most sensitive caller across all sizes and sequencing centers, perhaps because it accounts for various sequencing characteristics such as multiple insert-size libraries and soft-clipped reads (Xia et al, 2016;Zhang et al, 2016).CNVnator and Breakdancer showed worrisome sensitivity for all size ranges.Our results show that sensitivity varies considerably among callers and for different size ranges but is relatively insensitive to the sequencing site.

Specificity analysis
We assessed caller specificity using the D-score method (Methods: D-score: a metric for evaluating SV/indel caller specificity in family studies).LUMPY was the best-performing program with D-scores between 5 and 10 for deletions from 30 to 10,000 bp (Fig 3).The results were independent of the sequencing center.Scalpel also yielded highly specific calls, particularly in the 200-to 1,000-bp range with D-scores ranging from 5 to 8. The median D-scores for deletion calls from SWAN, Pindel, and Breakdancer were between 3 and 5, but the results were dependent on the sequencing center.
Other programs yielded calls with lower specificity that were greatly influenced by the sequence source.We also applied the kinship coefficient to evaluate and calibrate the quality of deletion calls and measure the impact of QC steps on call specificity (Fig 4A).Before data cleaning, the kinship coefficient was much greater than the expected value of 0.25 in siblings for events ranging from 21 to 350 bp, suggesting that that Scalpel is overcalling variants in this size range.False positives often occur because of mapping issues with the allele frequencies approaching 50%.This results in higherthan-expected heterozygous genotypes under the Hardy-Weinberg equilibrium.After removing deletions showing excess heterozygosity, the kinship coefficient of the Scalpel genotypes approached 0.25 for all deletion sizes (Fig 4A).Comparison of kinship coefficient metrics also showed that the quality of GATK Haplotype calls decreased as the deletion size increased and the coefficient was 0 for deletions ≥50 bp (Fig 4B).In contrast, a kinship coefficient of 0.25 was maintained for Scalpel calls for deletion sizes between 20 and 400 bp, showing that the Scalpel calls are more reliable in this size range.This work shows that the specificity of calls from different programs varies depending on the size of the event detected and can be influenced by the source of the sequence data.

Assessment of SV/indel caller computational requirements
We measured computational performance metrics for seven of the eight callers used in this study (Methods: Computational performance of SV/indel callers, Fig 5, Table S1).Scalpel was excluded from performance benchmarking because of its extreme central processing unit (CPU) demands and total runtime.To generate these benchmark metrics, we processed 10 BAM files (mean size of 209.05 MB) from the ADSP's discovery (disc) phase and 10 BAM files (mean size of 54.58 MB) from the discovery extension (disc + ext) phase.Among the tested callers, SWAN had the highest memory demands and required more than 10 times greater runtime compared with other programs.Breakdancer was the second longest running SV caller evaluated.DELLY, Lumpy, GATK, and SWAN all had similar CPU demands.Although Scalpel and SWAN ranked high in terms of sensitivity and specificity, the runtime computational requirements preclude the use of these programs on large datasets.

Generating an ADSP deletion call set
All 584 samples were called in parallel via two independent production pipelines, Scalpel + GenomeSTRiP and the Parliament Column 1 provides the caller evaluated.The second column provides the software version used for each caller."Sequence Feature" provides the method used to determine events such as read-pair (RP), split-read (SR), read-depth (RD), soft-clip (SC), and local assembly (AS).Columns 5 and 6 provide whether each caller supplies precise genotypes and breakpoints.
Toolkit (Fig 1).Given that the sensitivity and quality of the GATK Haplotype Caller dropped off significantly with deletions of size greater than 20 bp, the pipelines focused on deletions greater than that size range.Localized assembly and breakpoint refinement on gapped alignments were performed with Scalpel to increase the calling accuracy of deletions as large as ~900 bp.Of the 123,581 deletions detected by Scalpel and genotyped with GenomeSTRiP, 100,678 sites remained after the removal of deletions with excess heterozygotes (N = 17,286), homozygous reference (N = 5,014), and call rates less than 90% (N = 603).The number of deletions called dropped off exponentially as the deletion size increased except for a spike in the number of deletions related to Alu retrotransposons (Fig 6).The frequency of these events peaked around 350 bp, which corresponds to the lengths of most Alu transposons, and this size distribution is expected and consistent with that observed in other studies (Collins et al, 2020).The Parliament pipeline genotyped more than 14 million SVs from the eight callers listed in Table 2.The mean number of calls per program was slightly greater than 1.8 million.Because of computational requirements, the sites genotyped were limited to those greater than or equal to 100 bp.A total of 32,122 calls remained post-QC, and the size distribution of these calls shows the Alu peak at ~350 bp (Fig 7A and B).The distribution of functional annotations of these variants is shown in Table 3.A comparison of the deletion calls generated by the two pipelines in the size ranges that overlapped (100-900 bp) identified 3,401 deletions (mean size = 330 bp, range 207-620 bp) that shared a base location for at least one breakpoint (Fig S1) in the size bin with deletions common to both callers.

Laboratory validation of deletion calls
To validate deletion calls, we performed Sanger sequencing on putative deletions (Methods: Laboratory validation of deletion calls).We sequenced 106 deletions called by Scalpel ranging in size Sensitivity rates were derived for all eight callers using the in silico variant spike-in on nine sample replicates.Sensitivity is provided for all three centers (Baylor, Broad, and WashU).Biological replicates are three individuals in one family that were sequenced at the three centers.Sensitivity rates are provided across a large range of event sizes (30 bp-10 kb).Sensitivity rates are largely consistent across centers.
from 2 to 900 bp (Tables 4 and S3).When smaller deletions were randomly selected, 87.5% of events between 2 and 100 bp were validated by Sanger sequencing (100% of the events under 20 bp and 80% of events between 80 and 100 bp were confirmed by Sanger sequencing).For loss-of-function deletions and those near AD genes (±500 kb, Table S2) in this size range, slightly higher validation rates were observed (average 93% and 95%, respectively).For randomly selected large events (between 101 and 900 bp), the validation rate fell to 17%.This size range includes several transposable elements (e.g., Alu) in the genome that are susceptible to higher false-positive rates in SV calling across most calling algorithms.Although Scalpel detected deletions up to 900 bp, we found the spiked-in sensitivity drops off at 500 bp (Fig 2) as Scalpel's window size was set to 600 bp.For many of the nonvalidated deletions (Table 4), the sequencing did find an alternate SV that was not a deletion (e.g., repetitive low complexity regions [LCR], ALU insertion).However, when large SV/indel calls were prescreened to remove deletion sequences found at multiple regions of the genome, the validation rate increased to 50%.Deletions near AD genes and LOF variants had a higher validation rate (83% and 75%, respectively).For Parliament pipeline calls, 20% of randomly selected deletions in the 101-to 900-size range could be validated.For Parliament calls near AD genes and LOF variants, calls were validated at a higher rate (33% and 83%, respectively).For larger SV calls, the validation rate ranged from 73 to 83%.When we examined calls made by both Parliament and Scalpel, all deletions tested (n = 11) could be validated.The mean D-score for validated deletions (8.12, sd = 10.98,n = 114) was significantly greater than the mean for deletions that were not validated (2.52, sd = 4.98, n = 19, P = 0.0075).This Sanger sequencing validation of deletions demonstrates that the variants called by Scalpel, particularly within the 2-to 100-bp size range, are highly reliable and are suitable for genetic association studies.Specificity rates are provided for all eight callers from 30 bp to 10 kbp using our Dscore method.D-scores were calculated for each of the three sequencing centers (Baylor, Broad, and WashU).D-scores are quite consistent across centers.

Deletions within/near AD genes
To detect possible AD-associated pathogenic variants, we looked for deletions in a ±500-kb window bracketing candidate AD genes, focusing on deletions in gene functional units (coding regions, 59 and 39UTRs, promoters, and splice junctions).This window was selected to capture genes regulated by cis-acting elements impacted by peak GWAS variants that influence the expression of causal AD genes.We identified deletions in the vicinity of 24 AD candidate genes (Table S4) that could be validated by Sanger sequencing.One pathogenic deletion identified using Scalpel was a 44-bp deletion in exon 14 of ABCA7 (rs142076058, p.Arg578 fs).Subsequent work in a larger sample showed that the deletion was associated with AD in African American populations (Cukier et al, 2016).For the remaining confirmed SVs, we tested the segregation of the SVs in the families by requiring that at least 75% of the patients with LOAD and WGS data in the families were carriers.We found segregation in six SVs in at least one family near IQCK, FBXL7, INPP5D, SPDYE3, and SERPINB1 (Table S5).A 21-bp coding deletion was identified (rs527464858) in GIGYF2, a gene that encodes GRB10interacting GYF protein 2. This protein regulates tyrosine kinase receptor signaling.The GIGYF2 deletion is in an imperfect "CAG" repeat sequence and is ~270 kb from rs10933431, the top SNV for INPP5D (P = 3.4 × 10 −9 , OR = 0.91, CI: 0.88-0.97)(Kunkle et al, 2019).This deletion was observed in 46 cases and 3 controls in both NHW and CH populations.Note that in our study, there were more cases (n = 498) than controls (n = 86) and some of these subjects are related (n = 111 families).This deletion was observed in 10 CH and 13 NHW families.Co-segregation showed that the variant segregated with the AD status in three NHW families and one Hispanic family.A number of studies have found variants in GIGYF2 potentially associated with an autosomal dominant form of Parkinson's disease (Lautier et al, 2008;Ruiz-Martinez et al, 2015;Cristina et al, 2020), particularly in European populations but not in Asian cohorts (Lautier et al, 2008;Ruiz-Martinez et al, 2015;Zhang et al, 2015;Cristina et al, 2020).Although several SNVs in GIGYF2 may be associated with PD, most studies have not confirmed an association between this gene and PD (Ruiz-Martinez et al, 2015), and a large meta-analysis did not find that PD was associated with the poly-Q region deletion described here (Zhang et al, 2015).

Discussion
We developed novel approaches for detecting deletions and evaluating sensitivity, specificity, and validity.These methods were applied to WGS data obtained from 578 participants of the ADSP.We evaluated eight SV/indel callers on data generated at three sequencing centers, each of which generated sequence libraries using different protocols.Although sequencing library heterogeneity did not appreciably influence results obtained with most programs, call validity (deletion detection by an orthogonal method) varied by size and calling program.Our results revealed that no single calling program could reliably and accurately detect deletions in all size ranges.Ultimately, we effectively detected and genotyped deletions in the WGS dataset using a combination of SV and indel callers, applying several QC filters, and validating calls by Sanger sequencing.We evaluated the sensitivity of multiple SV/indel callers by in silico insertion of deletions and insertions into ADSP biological replicate sequence data.This simulation exercise suggests that Scalpel has the highest sensitivity for deletions in the 30-to 500-bp range.Scalpel's sensitivity performance was closely matched by Pindel and the GATK Haplotype Caller, but the latter only for smaller events.Also, the specificity of calls made by Pindel was much less than Scalpel because of the excess number of events called by this program.We measured the specificity of the SV/indel programs using the D-score, a measure that compares deletion sharing between related and unrelated individuals, and the kinship coefficient, which allows a comparison of the observed number of deletion calls with the number of expected calls among individuals with a defined degree of relationship.Scalpel and Lumpy showed the best specificity across a broad size range from 30 to 1,000 bp and were relatively insensitive to sequence library differences.In contrast, the output from other callers was more sensitive to the source of the sequence data.
We developed a comprehensive pipeline for calling, merging, QC, genotyping, and the breakpoint refinement of deletions using Scalpel and GenomeSTRiP (Fig 1).As expected, the most common deletions were small and we observed an excess of deletions of ~350 bp in length, many of which are likely Alu repeat sequences (Figs 5 and 6).For the size bin of 20-100 bp, Sanger sequencing validated more than 87.5% of randomly selected deletions and 90.1% of all deletions (random, near AD genes, LOF variants) (Table 4).This size bracket included 82,180 deletions and accounted for 88.7% of all deletions detected by Scalpel (n = 92,659 total deletions, Table S6).In addition, the Scalpel dataset had a kinship coefficient near the expected value of 0.25 for siblings after the removal of sites with excess heterozygosity.
Our study has several noteworthy strengths.First, we developed a method for evaluating deletion specificity in family-based studies (Dscore).This allowed us to directly compare different methods of deletion calling directly using study sequence data.Also, the D-score can be used to prioritize SVs for targeted validation.Second, we used the kinship coefficient metric as a method to measure the overall quality of the call set genotypes and evaluate quality control measures applied to family-based data.Third, we generated spikedin datasets that allowed for the evaluation of sensitivity in the  Breakdown of genomic functional annotation terms provided by SnpEff.There are slightly more annotation terms than loci as some loci overlap more than one region.
sequence data used in this study.Fourth, an orthogonal method (Sanger sequencing) was used to validate candidate deletions and to identify the characteristics of true calls.Fifth, the high-quality deletion calls from Scalpel, particularly those under 100 bp, can be used as a gold standard for comparison with calls from other programs that are computationally less intensive.Sixth, we cataloged deletion sites with precise breakpoints that can be directly genotyped in WGS CRAMs using other genotyping tools such as Graphtyper and Paragraph.Last, we detected a deletion in ABCA7 that was subsequently shown to be pathogenic.This illustrates the validity of our approach to identifying AD-related deletions.
Our conclusions and recommendations for deletion calling have some limitations.Although the D-score and kinship coefficient are useful specificity measures, they require family-based data.Also, because the D-score method relies on a comparison of the deletion frequency in the general population (i.e., unrelated individuals) versus related individuals, it does not perform well for deletions that are very rare (less than 20 instances in a dataset) or very common with allele frequencies approaching 50%.A minimum of two SVs are needed to compute a D-score.In both cases, the resulting D-score will be close to zero.Second, computational requirements need to be considered.Scalpel, while yielding highquality calls, is not practical when applied to WGS datasets containing more than a few thousand subjects because this program is computationally intensive.However, the Scalpel calls generated here can be used as a benchmark for evaluating the sensitivity and specificity of other programs such as more recent versions of GATK Haplotype Caller (unpublished data).The utility of callers with longer runtimes can be improved by splitting larger chromosomes and processing them in parallel.However, the cost of using some a "Sequenced" are deletions where PCR products were produced that could be sequenced.
b "Validated" is the number of deletions where Sanger sequencing yielded the predicted deletion.
c "Failed" is the number of confirmed false-positive calls.
d "Alternate Events" indicates that a deletion other than the predicted event was observed.
e "No PCR Product" is the number of events that could not be amplified and thus could not be tested. f The AD gene list used is in Table S2.Deletions tested were within ±500-kb bp of the target gene.
g "Cleaned" indicates that the BLAT was used to exclude events that mapped to multiple places in the genome.
h "In common" was randomly selected from a list of identical deletions called by both the Scalpel and Parliament pipelines.
i "Totals"-for total Scalpel and Parliament, calls "in common" events were included in the final total for each pipeline.
programs such as Scalpel and SWAN may be prohibitive when applied to datasets much larger than the one used in this study.
Another limitation of our study is that for associations of deletions with AD, our study is underpowered.Thus, we can nominate deletions as candidate pathogenic variants (e.g., Table S4) but will need larger follow-up studies to confirm true associations (e.g., the ABCA7 deletion).Finally, we only evaluated deletions in this study because of the poor performance of the callers used to detecting insertions and other types of events.Future studies will use other programs that better detect insertions, rearrangements, and copynumber changes.
Findings from this study have multiple, important implications.Small deletions represent a substantial portion of genetic variation (1000Genomes Project Consortium et al, 2015;Collins et al, 2020).Larger deletions are rarer and account for a small fraction of total genetic variability but are more likely to be deleterious because they may alter large portions of one or more genes.Given the challenges of accurate SV/indel detection and genotyping, SV/indels larger than a few base pairs are typically not included in genetic association studies.Accurately called and genotyped indels/SVs can increase the scope of both hypothesis-driven and genome-wide association studies.Moreover, similar to SNVs, SV/indels in the context of a large WGS or WES dataset can be imputed reliably into GWAS datasets derived from SNP arrays.Studies of SV/indels in the future will likely increase and improve our understanding of the genetic architecture of many diseases as more reliable and efficient calling algorithms are developed and validated.

Materials and Methods
Subjects and generation of WGS data WGS data were obtained from the ADSP, a collaboration between the National Institute on Aging (NIA), the National Human Genome Research Institute (NHGRI), and the Alzheimer's disease research community (Beecham et al, 2017).Details of subject selection and WGS data generation and processing are described elsewhere (Beecham et al, 2017;Leung et al, 2019).In brief, the sample included 498 AD cases and 84 cognitively normal elderly controls from 44 non-Hispanic Caucasian and 67 Caribbean Hispanic families.All studies involved were approved by their respective University Institutional Review Boards (IRBs), and the overall study was approved by the University of Pennsylvania IRB.WGS data were generated using Illumina's 2500 HiSeq platform by the NHGRI's LSACs at the Baylor College of Medicine (BCM), the Broad Institute (BI), and the McDonnell Genome Institute at Washington University (WashU).BCM provided 166 samples with a mean template size of 370 bp (SD = 12.4 bp).For the BI, 232 samples were sequenced with a mean template size of 335 bp (SD = 1.4 bp).WU provided 186 samples with three library preparations targeted at insert sizes of 200, 400, and 550 bp.These three library sizes were chosen to increase SV calling accuracy by incorporating longer reads; however, there was considerable size heterogeneity in the 550-bp read group.Three samples from one family were sequenced at all three LSACs as triplicates for evaluating and adjusting for center-specific sequencing effects.

Deletion variant calling protocol
Two complementary pipelines for deletion calling, merging, genotyping, and reassembly were implemented (Fig 1).In one approach, each genome was divided into 47 regions (two regions per autosome and one each for X, Y, and mitochondrial chromosomes) excluding telomeres and centromeres and called in parallel using Scalpel (Narzisi et al, 2014) to reduce processing time across the entire genome.Scalpel reassembles gapped alignments using the de Bruijn graph method to increase calling specificity in regions characterized by complex repeat structures.Scalpel was also used to generate precise breakpoints via local assembly within a 1,000bp capture window for the whole genome.GenomeSTRiP (McCarroll et al, 2006) was used to perform joint genotyping and provide missing genotype information to further refine calls.The second deletion calling pipeline was based on Parliament (English et al, 2015), which created a unified project-level variant call file by combining and filtering calls based on consensus and quality metrics from eight indel/SV callers including Scalpel (Table 1).Parliament also provided gene annotation, genotyping, and local hybrid assembly.Because Parliament's breakpoint detection process is computationally intensive, we limited the analysis to deletions >100 bp.The functional annotation of each variant was determined using SnpEff (Cingolani et al, 2012).

Sensitivity analysis using simulated spike-in data
We evaluated sensitivity by "spiking-in" SV/indels using BAMSurgeon (Ewing et al, 2015) into triplicated samples (three samples sequenced at all three LSACs).First, we generated a list of predefined SV/indels, including 4,040 deletions and insertions, and 1,560 inversions and tandem duplications, totaling 11,200 events.SV/indels ranged in size from 2 to 5,000 bp and were spiked into all autosomes for the three sample replicates (nine files in total).Half of the spike-in events were inserted as heterozygotes and half as homozygotes.BAMSurgeon failed to add in a small fraction (2.92%) of the attempted spike-in events, and those sites were excluded from sensitivity analysis.For sites where BAMSurgeon succeeded, there were minor discrepancies in the exact breakpoints of the actual spike-in location as compared to its targeted location.These minor breakpoint discrepancies did not affect the results because we applied a 50% reciprocal overlap for detecting spiked-in events.Finally, SV/indels were called for the nine spiked-in samples to measure the sensitivity of each caller across the full range of sizes.Because the true events were known or spiked-in, the sensitivity (Equation (1)) of each SV/indel caller was estimated as follows: D-score: a metric for evaluating SV/indel caller specificity in family studies To ascertain the specificity of deletion calls, we developed the following family-based metric called the deletion or D-score.It represents the log-likelihood ratio of the probability of sharing a variant by siblings assuming that (a) the variant is true and (b) the variant is a false call.The variant sharing probabilities among siblings depend on the caller sensitivity.
f = overall call rate (proportion of samples where a call is made).f sib = P 1 ðcall jcall in sibÞ, for a true variant.
For reproducible false calls, E 0 ½f observed sib = f = population call rate.For true variants, E 1 ½f observed sib = f sib .β HET , β HOM = the sensitivity of the caller for heterozygous and homozygous variants, specific to each caller and each library design for the sequencing sites.Caller sensitivity vectors were calculated from the spike-in study results.
Thus, given β HET and β HOM (which can be estimated from the spike-in data), we can compute f sib .
True variants have higher sharing across sibs.
However, false calls are random across samples.
For true calls, f (unrelated sharing frequency) is smaller than f sib (sharing frequency among siblings), resulting in larger positive values of the D-score.The D-score metric does not require genotype information and therefore can be used to evaluate caller specificity in the absence of genotyped calls.

Kinship coefficient
To assess overall call set quality, a kinship coefficient was calculated using KING (Manichaikul et al, 2010) for all sibling pairs with genotype information of SVs/indels.Because a kinship coefficient of 0.25 is expected for the pooled set of heterozygous jointgenotyped calls, departure from this value indicates systematic errors in SV/indel calling.Because multigenerational data are usually not available in family studies of AD, the kinship coefficient has greater utility than a check for Mendelian inconsistencies and is useful for measuring the overall quality of the genotypes.

Quality control
Many false positives are the result of poor mapping quality between two or more sites and are characterized by excess heterozygosity.Therefore, a Hardy-Weinberg equilibrium P-value threshold of 5 × 10 −8 was applied to filter calls with excess heterozygosity.The BLAST-like Alignment Tool (BLAT) (Kent, 2002) was used to filter deletions with a low predicted mapping quality or that map to many sites (N > 100) in the genome.Finally, deletions with an alternate allele count of less than five were removed from the final call set.
Parliament's consensus and QC strategy proved to be useful in improving call quality by combining call set metrics and applying heuristics to reduce false positives.

Computational performance of SV/indel callers
Computational performance benchmarks were obtained for the eight SV/indel programs based on the analysis of 20 randomly selected subjects.Performance benchmarks were derived using automated scripts and included total runtime, peak CPU usage, peak memory usage, and processing core hours.All data were processed using an © Amazon's Elastic Cloud 2 (EC2) extra-large instance with © Intel © Xeon 2.4 GHz CPUs.Scalpel benchmarking results were excluded from this analysis because of its extreme computational demands for processing WGS data.

Laboratory validation of deletion calls
Subsets of Scalpel-and Parliament-derived deletions of different sizes were selected for validation based on three methods: randomly selected events within specified size bins, predicted LOF, and proximity to 74 candidate AD loci with strong genome-wide association signals.These candidate AD loci were curated from GWAS, candidate gene studies, and multiple family-based studies (Goate et al, 1991;Corder et al, 1993;Levy-Lahad et al, 1995;Sherrington et al, 1995;Patel & David, 1997;Lambert et al, 2013aLambert et al, , 2013b;;Cruchaga et al, 2013;Beecham et al, 2014;Escott-Price et al, 2014;Jun et al, 2014Jun et al, , 2016Jun et al, , 2017;;Logue et al, 2014;Ruiz et al, 2014;Wetzel-Smith et al, 2014;Steinberg et al, 2015;Tosto et al, 2015;Herold et al, 2016;Jakobsdottir et al, 2016;Kohli et al, 2016;Deming et al, 2017;Mez et al, 2017;Sims et al, 2017;Marioni et al, 2018;Zhou et al, 2018;Baker et al, 2019;Jansen et al, 2019;Kunkle et al, 2019;Zhang et al, 2019;Bis et al, 2020).Validation was performed by PCR across the deletion with customdesigned primers followed by Sanger sequencing.For validation, we tested three SV carriers and one non-carrier for each SV.For the Scalpel-derived deletions, the variants were binned by base pair length (2-19, 20-40, 41-60, 61-80, 81-100, and 101-900 bp).The size ranges examined for the Parliament-derived deletions were 101-900, 901-1,000, and 1,001-17,000 bp.The BLAT from the University of California, Santa Cruz Genome Browser (Kent 2002) was used to search and align variant sequences and surrounding sequences to the human reference genome.Because the BLAT had a minimum requirement of 20 bp, sequences smaller than 20 bp were queried by adding flanking sequences upstream and downstream of the test sequence to bring the length up to 20 bp.Both University of California, Santa Cruz HG19 and HG38 reference genomes were queried using the BLAT.In addition, for each deletion, 100-bp sequences flanking the either side of the event were also queried against the BLAT as a contiguous 200-bp sequence (i.e., variant deletion sequence removed).The BLAT alignment allowed for the visualization of the deletion and surrounding sequence in terms of proximity to genes and repeat sequence and facilitated the identification of instances of clear mis-mapping.The sequence surrounding the variants was extracted from HG38 and used for primer design.For variants where a PCR product of ≤1,200 bp was expected (including the variant sequence), primers were designed outside of the breakpoints to amplify across the deletion sequence.
For deletions where the reference allele was too large to be amplified by a 1,200-bp PCR product, a double PCR approach was used.For the first PCR, one primer was designed within the putative deletion sequence, whereas the other primer was placed external to the deletion breakpoint.The samples containing the reference allele and not containing a deletion would yield a product with this PCR.For the second PCR, both primers flanked the putative deletion.Only samples, which contained the deletion, would yield a product for this PCR.The samples from the three individuals reported as heterozygous or homozygous deletions were used for sequence validation, as well as the one control (or reference) sample.When possible, samples from multiple families were used for validation.

Data Access
BAM files and variant calls on build hg38 of the human genome from the Alzheimer's Disease Sequencing Project (ADSP) are available through the NIA Genetics of Alzheimer's Disease Data Storage Site (NIAGADS), dataset NG00067.ADSP sequencing data aligned to human genome reference hg37 are available through dbGaP (Accession number: phs000572.v8.p4).

Figure 1 .
Figure 1.Overview of Alzheimer's Disease Sequencing Project's SV/indel calling and analysis pipeline.Two parallel pipelines, Scalpel + GenomeSTRiP (orange) and Parliament (green), were combined to perform SV/indel call merging, QC, genotyping, and reassembly for 584 samples from three sequencing centers.Nine replicated samples were used to measure individual SV/ indel caller sensitivity via variant spike-in studies.

Figure 2 .
Figure2.SV/indel caller sensitivity stratified by sequencing center and caller.Sensitivity rates were derived for all eight callers using the in silico variant spike-in on nine sample replicates.Sensitivity is provided for all three centers (Baylor, Broad, and WashU).Biological replicates are three individuals in one family that were sequenced at the three centers.Sensitivity rates are provided across a large range of event sizes (30 bp-10 kb).Sensitivity rates are largely consistent across centers.

Figure 3 .
Figure3.SV/indel caller specificity using the D-score stratified by sequencing center and caller.Specificity rates are provided for all eight callers from 30 bp to 10 kbp using our Dscore method.D-scores were calculated for each of the three sequencing centers (Baylor, Broad, and WashU).D-scores are quite consistent across centers.

Figure 4 .
Figure 4. Kinship coefficient by deletion size.(A)pre-and post-QC.Kinship coefficients for pre-(left) and post-QC (right) calls ranging from 20 to 400 bp were calculated for all sibling pairs.QC filtering to reduce excess heterozygosity resulted in coefficients that approximated the expected value of 0.25.(B) Kinship coefficients for GATK Haplotype Caller with VQSR using the best practices Mills indel training set (left) and GATK Haplotype Caller with an improved indel training set (right).Kiniship coefficient is 0.25 for SNVs called by the GATK Haplotype Caller but declines progressively to 0 with increasing deletion sizes.In contrast, the coefficient approximated 0.25 for the GATK Haplotype calls across all SV/ indel bin sizes when using an improved training set for VQSR step that includes deletions across the range of sizes.

Figure 5 .
Figure 5. Three performance metrics for seven SV/indel callers.toprow provides the total runtime in hours, the middle row provides peak central processing unit percentage, and the bottom row provides peak memory in gigabytes for 10 discovery phase (left column) and 10 discovery extension phase (right column) samples.

Figure 6 .
Figure 6.Histogram of Scalpel deletions by size.All Scalpel deletions (N = 123,581), ranging from 20 to 900 bp.The y-axis is truncated at 2,000 calls.The Alu peak is seen near 350 bp.

Figure 7 .
Figure 7. Histograms of Parliament deletion frequencies by size.(A) Histogram of Parliament deletions (N = 32,122) ranging from 20 to 1,000 bp.(B) Full histogram of all Parliament calls (N = 32,122) ranging from 1 to 10,000 bp.The Alu peak is seen at ~350 bp.

Table 2 .
Total calls by eight SV/indel callers.

Table 3 .
SnpEff functional annotation categories for Scalpel and Parliament calls.