Skip to main content
Advertisement

Main menu

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Author Interviews
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Journal of Human Immunity
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research

User menu

  • My alerts

Search

  • Advanced search
Life Science Alliance
  • Other Publications
    • EMBO Press
    • The EMBO Journal
    • EMBO reports
    • EMBO Molecular Medicine
    • Molecular Systems Biology
    • Rockefeller University Press
    • Journal of Cell Biology
    • Journal of Experimental Medicine
    • Journal of General Physiology
    • Journal of Human Immunity
    • Cold Spring Harbor Laboratory Press
    • Genes & Development
    • Genome Research
  • My alerts
Life Science Alliance

Advanced Search

  • Home
  • Articles
    • Newest Articles
    • Current Issue
    • Methods & Resources
    • Author Interviews
    • Archive
    • Subjects
  • Collections
  • Submit
    • Submit a Manuscript
    • Author Guidelines
    • License, Copyright, Fee
    • FAQ
    • Why submit
  • About
    • About Us
    • Editors & Staff
    • Board Members
    • Licensing and Reuse
    • Reviewer Guidelines
    • Privacy Policy
    • Advertise
    • Contact Us
    • LSA LLC
  • Alerts
  • Follow LSA on Bluesky
  • Follow lsa Template on Twitter
Methods
Transparent Process
Open Access

Refining the genetic risk of breast cancer with rare haplotypes and pattern mining

View ORCID ProfileWilliam Letsou  Correspondence email, View ORCID ProfileFan Wang, View ORCID ProfileWonjong Moon, View ORCID ProfileCindy Im, View ORCID ProfileYadav Sapkota, Leslie L Robison, View ORCID ProfileYutaka Yasui  Correspondence email
William Letsou
1Department of Biological & Chemical Sciences, New York Institute of Technology, Old Westbury, NY, USA
2Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Roles: Conceptualization, Software, Formal analysis, Methodology, Writing—original draft, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for William Letsou
  • For correspondence: wletsou@nyit.edu
Fan Wang
2Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Roles: Software, Methodology, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Fan Wang
Wonjong Moon
2Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Roles: Software
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wonjong Moon
Cindy Im
3Department of Pediatrics, University of Minnesota, Minneapolis, MN, USA
4School of Public Health, University of Alberta, Edmonton, Canada
Roles: Methodology, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Cindy Im
Yadav Sapkota
2Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Roles: Methodology, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yadav Sapkota
Leslie L Robison
2Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
Roles: Funding acquisition, Methodology, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yutaka Yasui
1Department of Biological & Chemical Sciences, New York Institute of Technology, Old Westbury, NY, USA
4School of Public Health, University of Alberta, Edmonton, Canada
Roles: Conceptualization, Formal analysis, Supervision, Funding acquisition, Methodology, Writing—original draft, Writing—review and editing
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yutaka Yasui
  • For correspondence: yutaka.yasui@stjude.org
Published 4 August 2023. DOI: 10.26508/lsa.202302183
  • Article
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF
Loading

Abstract

Hundreds of common variants have been found to confer small but significant differences in breast cancer risk, supporting the widely accepted polygenic model of inherited predisposition. Using a novel closed-pattern mining algorithm, we provide evidence that rare haplotypes may refine the association of breast cancer risk with common germline alleles. Our method, called Chromosome Overlap, consists in iteratively pairing chromosomes from affected individuals and looking for noncontiguous patterns of shared alleles. We applied Chromosome Overlap to haplotypes of genotyped SNPs from female breast cancer cases from the UK Biobank at four loci containing common breast cancer-risk SNPs. We found two rare (frequency <0.1%) haplotypes bearing a GWAS hit at 11q13 (hazard ratio = 4.21 and 16.7) which replicated in an independent, European ancestry population at P < 0.05, and another at 22q12 (frequency <0.2%, hazard ratio = 2.58) which expanded the risk pool to noncarriers of a GWAS hit. These results suggest that rare haplotypes (or mutations) may underlie the “synthetic association” of breast cancer risk with at least some common variants.

Introduction

Genome-wide association studies (GWAS) have identified hundreds to thousands of single-nucleotide polymorphisms (SNPs) robustly associated with breast cancer risk and other complex phenotypes (Michailidou et al, 2017; Zhang et al, 2020). Polygenic risk scores derived from common and genome-wide variants can differentiate women’s breast cancer risk by up to several fold (Ge et al, 2019; Mavaddat et al, 2019; Mars et al, 2020). The widely accepted polygenic model of genetic risk (Pharoah et al, 2002) contrasts with the hypothesis that rare variants are responsible for the elevated risk with which GWAS SNPs are but “synthetically” associated (Dickson et al, 2010; Anderson et al, 2011; Wray et al, 2011). According to the latter model, many complex diseases are characterized by genetic heterogeneity (McClellan & King, 2010), or by a small pool of individuals having private, rare mutations each associated with a large increase in risk. We have recently argued on the basis of high-quality data from several Nordic cancer registries (Möller et al, 2016) that the elevated breast cancer risk to twins of breast cancer cases is largely because of rare variants or haplotypes of large effect (Yasui et al, 2023). It is imperative to reconcile the polygenic and synthetic models to better understand the etiology of this and other diseases.

Haplotype analysis has long been used as a method to improve power in genetic association studies (Falk & Rubinstein, 1987; Schaid, 2004), but the idea that haplotypes themselves may tag or could themselves be the causal variants has received less attention. We recently explored the possibility in a genome-wide analysis that rare haplotypes with copy numbers on the order of tens per 10,000 chromosomes confer additional high risk for breast cancer (Wang et al, 2023). That study identified five robustly replicated rare haplotypes defined by genotyped SNPs through a sliding-window analysis, but was necessarily constrained by the requirement that haplotypes be contiguous regions of chromosome of various sizes, each possibly bearing a rare causal variant. The hypothesis that rare risk haplotypes could be noncontiguous, perhaps representing the interaction of common variants at different locations in a gene regulatory region, has to our knowledge not been previously considered. As haplotypes are combinations of alleles on one and the same chromosome, they are difficult to study in a systematic manner: for one, without phased genome sequencing, large numbers of chromosomes need to be computationally phased. More importantly, combinatorics quickly engenders a multiple-testing problem, as there are 3m – 1 possible haplotypes (both contiguous and noncontiguous) in every m-SNP window. Although most of these haplotypes will never appear in the population, identifying the ones to test for disease association requires new computational tools not previously used on a large-scale in genetic association studies.

Closed-pattern mining is a well-known technique in computer science used to find items that frequently occur together in multivariate data, including, for example, grocery items purchased together in supermarket transactions (Pasquier et al, 1999). Applied to market data, a pattern of goods is said to be closed if no item can be added to it without diminution of the number of transactions in which the pattern appears, and the closure of any pattern is the shortest closed-pattern which contains it as a subset (Uno et al, 2004). Applied to haplotype analysis, a pattern of SNP-alleles is said to be closed if it is the longest pattern shared by a group of chromosomes, capturing one and the same set of chromosomes as certain shorter patterns and therefore obviating the need for exhaustive enumeration. Versions of closed-pattern mining (Pan et al, 2003; Terada et al, 2016; Relator et al, 2018; Yoshizoe et al, 2018) and a related approach called frequent pattern mining (Fang et al, 2012; Okazaki et al, 2021; Pounraja and Girirajan, 2022) have been applied in various genetic association studies to the learning of association rules between combinations of genes/genotypes and oligogenic disease; but to our knowledge none of these methods has been applied to haplotype analysis. As it is haplotypes—not genotypes—that are passed from parent to offspring, it is pertinent to look for the rare, possibly noncontiguous patterns of alleles that could underlie the genetic heterogeneity of inherited breast cancer risk.

In this article, we find using a new closed-pattern mining algorithm called Chromosome Overlap that rare haplotypes can refine the risk associated with “GWAS hits,” common variants previously shown to be associated with breast cancer risk. Our method consists in iteratively overlapping pairs of chromosomes from affected individuals and looking for shared, noncontiguous haplotypes. We then compared the counts of (the closures of) these patterns in cases and controls on the hypothesis that the sharing of patterns by cases should be associated with their being cases. We apply Chromosome Overlap to computationally phased data from female breast cancer cases and controls in the UK Biobank (UKBB) (Bycroft et al, 2018) to look for rare haplotypes in the vicinity of three of the strongest breast cancer (EFO_0000305) hits by P-value in the NHGRI-EBI GWAS Catalog (Sollis et al, 2023), including: rs2981578 on chromosome 10q26 in an intron of FGFR2 (Meyer et al, 2008); rs554219 on chromosome 11q13 upstream of CCND1 (French et al, 2013); rs4784227 on 16q12 in an intron of CASC16 (Long et al, 2010; Ulder et al, 2010); and also a locus at 22q12 containing a rare haplotype identified by our previous genome-wide haplotype analysis (Wang et al, 2023). Subsequently, we replicated the 11q13 and 22q12 results in cases and controls from the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study, lending support to the hypothesis that rare haplotypes, or the rare variants they tag, underlie the synthetic association of breast cancer risk with at least some GWAS hits.

Results

Discovery and replication of two rare haplotypes at the 11q13 locus

We applied Chromosome Overlap around three of the most strongly associated breast cancer risk SNP-alleles. For each of the three GWAS hits in Phase 1, we focused on a ∼200-kb region containing the GWAS hit and extracted haplotypes containing 50–60 UKBB-genotyped SNPs (see the Materials and Methods section). The first region selected was chr11:69,419,318–69,616,860, a range containing 57 genotyped SNPs about, but not including, the GWAS hit rs554219.

Table 1 shows that there were 2,520 unique contiguous haplotypes at the chr11 locus among the 18,022 chromosomes of UKBB breast cancer cases and 2,962,894 meta-chromosomes resulting from the initial pairwise overlap. We filtered the meta-chromosomes down to 121 with a Fisher’s exact test P-value threshold of P = 1.0 × 10−9 to prevent a combinatorial explosion at subsequent iterations (see the Materials and Methods section). The 121 filtered meta-chromosomes resulted in a total of 585,850 closed-patterns within five iterations, after which point, no more closed-patterns were found.

View this table:
  • View inline
  • View popup
Table 1.

Numbers of closed-patterns discovered at four loci during Chromosome Overlap, phases 1 and 2.

Among the 585,850 noncontiguous closed-patterns and 2,520 original contiguous patterns, we evaluated the top 20,628 (with Fisher’s exact P < 1.0 × 10−15) in a Cox proportional hazards model for association with breast cancer incidence rates (see the Materials and Methods section), after verifying that none of the original contiguous patterns met the inclusion threshold (minimum P-value: 2.4 × 10−5). We found that each of the evaluated haplotypes had a Cox likelihood ratio test (LRT) P-value less than 1.0 × 10−5, but that only one pattern remained after stepwise forward selection. This 19-SNP haplotype was reduced by recursive partitioning to 17 SNP-alleles (Table S1) without alteration of its frequency or breast cancer incidence hazard ratio (HR). As shown in Table 2, this haplotype, designated h1, was highly statistically significant in UKBB (and also DRIVE), but its association strength was on par with that of the GWAS hit. Furthermore, h1 was in LD (r2 = 0.78, D′ = 1.00) with its corresponding GWAS hit, rs554219[G]. We thus concluded that the common haplotype h1 was tagging largely the same pool of at-risk subjects as was rs554219[G].

View this table:
  • View inline
  • View popup
Table 2.

Common-haplotypes at 10q26, 11q13, and 16q12, discovery and replication.

Table S1. Alleles (highlighted) of common haplotypes h1 at three chromosomal loci.

Because we were interested in rare haplotypes that underlie GWAS hits, we hypothesized that there were rare subtypes of h1 in the immediate chromosomal vicinity which could be found by a conditional analysis of h1-bearing chromosomes. To test this hypothesis in Phase 2, we extracted SNPs in the topologically-associated domains (TADs) containing h1, reasoning that chromosomal interactions might mediate the risk associated with the putative rare haplotypes. Using the 3D Genome Browser (Wang et al, 2018) to predict domain boundaries, we extracted haplotypes in the region chr11:69,083,946–69,414,699 (92 SNPs) coinciding with the 5′ boundary of the TAD and the 5′ boundary of h1 (Fig S1) from 2,100 h1-bearing chromosomes of UKBB breast cancer cases. This region contained 699 unique contiguous haplotypes and 241,811 unique noncontiguous pairwise overlaps (Table 1).

Figure S1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S1. Location of common haplotype h1 in a topologically associating domain (TAD) at the 11q13 locus.

SNPs of h1 are indicated below, with blue denoting the reference and red the alternate allele. VC square root-normalized Hi-C data from MCF-7 breast carcinoma cells are plotted on the same intensity scale in all figures. The GWAS hit rs554219 is indicated in blue. This figure may be viewed online at https://tinyurl.com/2neu33z9.

Because of the greater number of SNPs. filtering had to be more stringent in Phase 2, with the additional requirement that a filtered meta-chromosome be the shortest member of a family with the same case/control frequency in UKBB (see the Materials and Methods section). At the chr11 locus, we kept 28 filtered haplotypes with h1-conditional Fisher’s exact test P < 1.0 × 10−4 (among the h1 carriers only) after the first overlap, resulting in a total of 167,998 closed-patterns from Iteration 1 onward (Table 1). As in Phase 1, all closed-patterns at the chr11 locus were discovered within five iterations.

Among the 167,998 noncontiguous closed-patterns and 699 original contiguous patterns, we evaluated the top 246 (with h1-conditional Fisher’s exact test P < 1.0 × 10−4 among h1 carriers only) in a Cox proportional hazards model for association with breast cancer incidence rates, after verifying that none of the original 699 contiguous haplotypes (minimum h1-conditional P-value 3.1 × 10−3) met the inclusion threshold. At the chr11 locus, we found 106 risk-increasing closed-patterns with Cox-LRT P < 1.0 × 10−5 in the UKBB discovery analysis, of which 14 had P < 0.05 in the DRIVE replication analysis (Table S2). In the permutation analysis, no risk-increasing patterns were found at the P < 1.0 × 10−5 level, suggesting that all 106 discovered risk haplotypes were genuine, although the replication rate was only 13% (see the Discussion section).

Table S2. Discovery and replication of 14 rare haplotypes (including h3 and h2, highlighted) at the 11q13 locus.

We next asked whether the 14 risk haplotypes were not variants of the same risk haplotype. Using stepwise forward selection, we found that only two of the 14 risk haplotypes at the chr11 locus were independently associated with breast cancer risk. These haplotypes, designated h2 and h3 in Tables 3 and S3, were both rare (fewer than 5 and 1 copies per 10,000 chromosomes, respectively, in controls) and, when adjoined to 17-SNP h1, were 80 and 86 SNPs long. h2 and h3 were highly risk-increasing (HRs of 4.21 and 16.7) in the discovery analysis and in DRIVE (odds ratios [ORs] of 2.10 and 11.7).

View this table:
  • View inline
  • View popup
Table 3.

Rare-haplotypes at 11q13, discovery and replication.

Table S3. Alleles (highlighted) of chr11 haplotypes h2 and h3; h1 SNPs are in bold.

Features of the rare haplotypes

To investigate how h2 and h3 from the chr11 locus could be involved in breast cancer risk, we used the WashU Epigenome Browser (Li et al, 2022) to plot the locations and SNP-alleles of the haplotypes together with ENCODE Hi-C and ChIP-seq data (ENCODE Project Consortium, 2012; Davis et al, 2018) (Fig 1).

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1. Rare haplotypes, three-dimensional chromosome organization, and transcription factor-binding at the 11q13 locus.

ENCODE Hi-C and ChIP-seq (negative log-10 P-value) data for CTCF, ESRRA, FOXA1, and MYC are from MCF-7 breast carcinoma cells. VC square root-normalized Hi-C data are plotted on the same intensity scale in all figures. The SNPs of h2 and h3 are indicated below, with blue denoting the reference and red the alternate allele. The GWAS hit rs554219 is indicated in blue. This figure may be viewed online at https://tinyurl.com/yc54zcav.

h2 and h3 were highly similar in terms of both their included SNPs and allele phases. Both haplotypes used most, but not all, of the 149 UKBB-typed SNPs in the region. Alternate alleles (red) appeared at discrete locations, most notably at the MYEOV locus (∼69.3 Mb) and at the 5′ end (position ∼69.13 Mb) of the TAD. According to the Hi-C data, these two loci participate in a chromosomal loop and appear to physically interact with another locus in the vicinity of CCND1, the gene encoding the cell cycle regulatory molecule cyclin D1. The similarities of and differences between h2 and h3 suggest (1) that the haplotypes are distinct signals which are also distinct from contiguous haplotypes in the region, and (2) that there may be additional undetected versions of these haplotypes containing a common “backbone” of SNP-alleles.

We also observed coincident binding throughout the TAD of transcription factors relevant to breast cancer, including CTCF, FOXA1, ESRRA (a relative of the estrogen receptor ER), and MYC. We focused on these particular factors because (1) the CTCF–FOXA1–ER triplet is known to mediate cells’ response to estrogen (Hurtado et al, 2011; Ross-Innes et al, 2011), and (2) according to the HACER database (Human ACTive Enhancer to interpret Regulatory variants) (Wang et al, 2019), MYC binds to predicted enhancers of CCND1. We observed colocalization of CTCF, FOXA1, and MYC at the 5′ end of the haplotype and again at a downstream region (∼69.45 Mb) devoid of SNP-alleles. ESRRA, FOXA1, and MYC bind at the MYEOV location (∼69.3 Mb and ∼69.27 Mb) displaced from a CTCF peak in a location adjacent to—but not on top of—several alternative alleles in h2 and h3. We verified that in both of these “empty” regions, the UKBB data are lacking any typed SNPs. Finally, the original GWAS hit rs554219 exhibits the MYEOV-binding pattern and also forms a loop with the locus at 69.45 Mb. These binding data support the hypothesis that the haplotypes identified by Chromosome Overlap are picking out biologically relevant features of the CCND1 locus.

Application of Chromosome Overlap to two other GWAS hits

Having discovered two rare haplotypes that appeared to be genuine predictors of breast cancer risk, we asked if we could not find similar haplotypes at two other GWAS hits. The selected regions for these Phase 1 analyses were chr10:121,481,608–121,680,765 (55 SNPs about rs2981578, Fig S2) and chr16:52,486,414–52,645,181 (56 SNPs about rs4784227, Fig S3), the latter of which contained the GWAS hit as a typed SNP in the phased data. As shown in Table 1, these regions contained 2,343 and 2,858 unique contiguous chromosomes from UKBB breast cancer cases and 2,511,040 and 3,458,020 unique pairwise overlaps.

Figure S2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S2. Location of common haplotype h1 in a topologically associating domain (TAD) at the 10q26 locus.

SNPs of h1 are indicated below, with blue denoting the reference and red the alternate allele. VC square root-normalized Hi-C data from MCF-7 breast carcinoma cells are plotted on the same intensity scale in all figures. The GWAS hit rs2981578 is indicated in blue. This figure may be viewed online at https://tinyurl.com/59nxyscp.

Figure S3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S3. Location of common haplotype h1 in a topologically associating domain (TAD) at the 16q12 locus.

SNPs of h1 are indicated below, with blue denoting the reference and red the alternate allele. VC square root-normalized Hi-C data from MCF-7 breast carcinoma cells are plotted on the same intensity scale in all figures. The GWAS hit rs4784227 is indicated in blue. This figure may be viewed online at https://tinyurl.com/33kte2tf.

After adjusting the Fisher’s exact test P-value thresholds, we found 585,591 closed-patterns at the chr10 locus starting from 91 meta-chromosomes with P < 1.0 × 10−20 and 269,993 at the chr16 locus starting from 246 with P < 1.0 × 10−16, each within five iterations. Surprisingly, even though we started at the chr16 locus with the greater number of filtered meta-chromosomes, we found there the lesser number of closed-patterns, suggesting a complex combinatorial scheme underlying the haplotype diversity at these GWAS-hit loci and making it difficult to formulate an a priori rule for knowing the P-value threshold which should prevent a combinatorial explosion.

Among the 585,591 + 2,343 and 269,993 + 2,858 noncontiguous closed and original contiguous patterns at the chr10 and chr16 loci, respectively, we evaluated the top 30,019 (with Fisher’s exact test P < 1.0 × 10−50) and 29,078 (P < 1.0 × 10−30) in a Cox proportional hazards model for association with breast cancer incidence rates. As at the chr11 locus, none of the original contiguous patterns at these two loci (minimum P-values 1.6 × 10−5 and 3.2 × 10−4) met the inclusion threshold. We discovered, in a similar manner to the chr11 analysis, 10- and 26-SNP haplotypes at the chr10 and chr16 loci, respectively, which were each reduced by recursive partitioning to nine and 20 SNP-alleles (Table S1) without alteration of their frequencies or HRs. As shown in Table 2, these haplotypes, each designated h1, were highly statistically significant in UKBB (and also in DRIVE). Also as previously, the associations were on par with those of the GWAS hits, and the LD of the h1 with their corresponding GWAS hits was strong (r2 = 0.70 and 0.87 and D′ = 0.99 and 1.00 at the chr10 and chr16 loci, respectively). We therefore again concluded that the common haplotypes h1 were tagging largely the same pool of at-risk subjects as were the GWAS hits.

For the Phase 2 analysis, we extracted SNPs in the TADs containing each h1 using the 3D Genome Browser (Wang et al, 2018) to predict domain boundaries. Although the h1s of chr10 and chr16 were both located in the center of their respective TADs (Figs S2 and S3), the chr10 TAD contained a well-defined Hi-C block in the region containing and immediately 5′ of h1, whereas the entire TAD of the chr16 locus was rather diffused. Thus, we extracted haplotypes in the regions chr10:120,928,564–121,477,406 (158 SNPs), coinciding with the 5′ boundary of the chr10 TAD and the 5′ boundary of the chr10 h1, and chr16:52,073,187–53,053,996 (195 SNPs), defined by the 5′ and 3′ boundaries of the chr16 TAD and the 5′ and 3′ boundaries of chr16 h1, from 7,976 and 4,595 h1-bearing chromosomes of UKBB breast cancer cases, respectively. These regions contained 5,543 and 3,452 unique contiguous haplotypes and 15,353,437 and 5,954,654 unique meta-chromosomes after the initial overlap, respectively (Table 1). Although there were many more SNPs and unique haplotypes at these loci than at the chr11 locus, the method of using the TAD boundaries (on either both sides or just the longer side of h1) to determine the included SNPs was consistently applied.

As at the chr11 locus, we adjusted the h1-conditional Fisher’s exact test P-value (among h1 carriers) threshold such that the number of filtered meta-chromosomes was no more than ∼30 to prevent a combinatorial explosion. From Iteration 1 onward, a total of 731,526 closed-patterns were found at the chr10 locus starting from 30 with h1-conditional Fisher’s exact test P < 2.0 × 10−6 (among h1 carriers) and 2,517,921 at the chr16 locus starting from 25 with P < 1.5 × 10−5 (Table 1). As in Phase 1, we observed that most of the patterns were discovered by Iteration 5. However, at the chr16 locus, an additional four iterations were carried out on a small number of meta-chromosomes that had not appeared until Iteration 5.

Among the 731,526 + 5,543 and 2,517,921 + 3,452 noncontiguous closed and original contiguous patterns at the chr10 and chr16 loci, respectively, we evaluated the top 256 and 876 (with h1-conditional Fisher’s exact test P < 1.0 × 10−4 among h1 carriers only) in a Cox proportional hazards model for association with breast cancer incidence rates, after verifying that none of the original 5,543 and 3,452 contiguous haplotypes (minimum h1-conditional P-values 2.2 × 10−4 and 1.4 × 10−3) met the inclusion threshold. We found 92 and 123 risk-increasing closed-patterns at the chr10 and chr16 loci, respectively, with Cox-LRT P < 1.0 × 10−5. However, only 1 of 92 and 1 of 123 had P < 0.05 in the DRIVE replication analysis. The false-positive replication rates from the permutation analysis were estimated to be 3% (1 of 29) and 10% (6 of 60), suggesting that the rare haplotypes at the chr10 and chr16 loci were likely false positives.

Validation of Chromosome Overlap in a region known to contain a rare risk haplotype

To improve the confidence in our chr11 discoveries, we next sought to apply Chromosome Overlap in a region known to contain a rare risk haplotype. Our recent genome-wide rare-haplotype analysis discovered by a sliding-window approach six loci containing rare haplotypes which were associated with increased breast cancer risk in both UKBB and DRIVE (Wang et al, 2023). The lead 50-SNP haplotype (by P-value) discovered at chromosome 22q12, called h38, had a frequency of 10.6 per 10,000 chromosomes and an HR of 2.81 (P = 1.2 × 10−14). We sought to validate the Chromosome Overlap procedure by testing if we could not detect and replicate a noncontiguous rare risk haplotype related to h38.

To achieve this goal, we performed a Phase 1 analysis in region spanned by h38 at chr22:27,749,339–27,874,384 (Table 1 and Fig S4). Note that in the present analysis, h38 contained only 49 SNPs after the exclusion of one SNP that turned out to be triallelic (Table S4). In addition, this region is near (200–300 kb upstream of), but does not contain, rs62237573, a low-frequency (MAF <1%) SNP listed in the NHGRI-EBI GWAS Catalog (Sollis et al, 2023) for breast cancer risk (EFO_0000305) which displayed moderate but not genome-wide significant evidence for association in UKBB (Table 4).

Figure S4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S4. Rare haplotypes, three-dimensional chromosome organization, and transcription factor-binding at the 22q12 locus.

ENCODE Hi-C and ChIP-seq (negative log-10 P-value) data for CTCF, ESRRA, FOXA1, and MYC are from MCF-7 breast carcinoma cells. VC square root-normalized Hi-C data are plotted on the same intensity scale in all figures. The SNPs of h1 and h38 are indicated below, with blue denoting the reference and red the alternate allele. The GWAS hit rs62237573 is indicated in blue. This figure may be viewed online at https://tinyurl.com/4ek6d6bv.

View this table:
  • View inline
  • View popup
Table 4.

Rare-haplotypes at 22q12, discovery and replication.

Table S4. Alleles (highlighted) of chr22 haplotypes h1 and h38.

Table 1 shows that there were 4,221 unique contiguous haplotypes at the chr22 locus among the 18,022 chromosomes of UKBB breast cancer cases and 7,759,059 meta-chromosomes resulting from the initial pairwise overlap. After filtering, we found a total of 398,251 closed-patterns within five iterations starting from 127 with Fisher’s exact test P < 2.0 × 10−12. Among the 398,251 noncontiguous closed and 4,221 original contiguous patterns, we evaluated the top 31,858 (with Fisher’s exact test P < 1.0 × 10−9) in a Cox proportional hazards model for association with breast cancer incidence rates (see the Materials and Methods section), after verifying that the only original contiguous similar pattern which passed the inclusion threshold was h38. We found that all included patterns had Cox-LRT P < 1.0 × 10−5 in the UKBB discovery analysis, and that 29,081 replicated in DRIVE with P < 0.05; however, only a single 34-SNP haplotype remained independently associated after stepwise forward selection. This haplotype, hence reduced to 22 SNPs by recursive portioning without alternation of its haplotype frequency or HR and subsequently designated h1 (Table S4), was both rare (18.5 copies per 10,000 chromosomes in controls) and highly statistically significant in UKBB (HR = 2.58, P = 8.6 × 10−15); in DRIVE, h1 was less strongly associated (OR = 1.39, P = 4.6 × 10−3), but still passed the replication threshold. In comparison, h38 was somewhat rarer (10.8 copies per 10,000 chromosomes) and had a larger but less statistically significant effect in UKBB (HR = 2.89, P = 4.8 × 10−12); in DRIVE; h38’s effect was similarly moderated (OR = 1.92, P = 1.3 × 10−5). As h1 was already quite rare and risk-increasing, we did not perform a Phase 2 analysis of this region.

These results suggested that Chromosome Overlap can pick up rare haplotypes at loci known to harbor validated haplotype associations, lending support to the chr11 results in the manner of a positive control. However, as we did not discover h1 in a region surrounding and in LD with a GWAS hit, we sought to explain the relationship between h1 and h38 instead. h1 and h38 were in moderate LD with each other (r2 = 0.59, D′ = 1.0) and in distinct LD with the GWAS hit, with the D′ (r2) of rs62237573 with h1 and h38 being 0.58 (0.068) and 0.90 (0.097), respectively. These values indicate that most chromosomes with h38 also carry the GWAS hit, whereas only a subset of chromosomes with h1 do; thus, h38—but not h1—is like the rare haplotypes on chr11 in respect of its being a subtype of a haplotype bearing the GWAS hit. To find the proportion of the h1 risk attributable to its being a subtype of the GWAS hit, we adjoined rs62237573[T] to the haplotype and reevaluated it in the model. This procedure reduced the cases frequency of h1 (from 49.9 to 34.4 per 10,000) to a value similar to that of h38’s (32.7 per 10,000) and increased the LD between the two haplotypes to r2 = 0.82 (D′ = 0.91). We drew two conclusions from this observation. First, because h1 + rs62237573[T] and h38 were tagging a highly overlapping set of chromosomes, the additional risk (HR = 3.07 versus 2.58) over carriers of h1 to carriers of h1 + rs62237573[T] was attributable to the contiguous haplotype h38, so that the noncontiguous h1 + rs62237573[T] was therefore a tag for h38. But although it decreased the HR slightly, the removal of rs62237573[T] led to a 45% increase in the size of the h38 at-risk pool. Therefore, we also concluded that the noncontiguous h1 discovered a subset of high-risk individuals not having the GWAS hit.

Finally, we plotted the locations of the SNPs of h1 and h38 to see if they correlated with any ChIP and Hi-C features (Fig S4). The only evident peaks colocalizing with the haplotypes were those of CTCF, and the window did not contain an especially strong Hi-C block. FOXA1-binding occurred in a region devoid of any UKBB-typed SNPs, but without any coincident peaks belonging to other factors. A more intense Hi-C block occurred downstream of h1 and h38 in a region spanned by the original 250-SNP haplotype (Wang et al, 2023). The comparative lack of features about h1 and h38 suggested that the binding events at the chr11 locus may have been specific to regulation of CCND1 and that some other biological mechanism would have to be sought at the chr22 locus, as explored in our previous study (Wang et al, 2023).

Discussion

Our findings with Chromosome Overlap support a model wherein rare haplotypes or mutations underlie the germline genetic risk for breast cancer associated with at least some GWAS hits. Despite a genetically heterogeneous replication population, we replicated three noncontiguous rare haplotypes at 11q13 and 22q12 composed of common genotyped SNPs that elevated the risk for breast cancer by 2.6–17-fold. Effects of this size are typically not detected in association analysis. The discoveries at 11q13 and 22q12 represent two distinct mechanisms whereby rare haplotypes can refine the signal tagged by GWAS hits. At the former locus, we found that h2 and h3 were rare subtypes of chromosomes bearing rs554219[G], whereas at the latter, we found that h1 tagged additional risk not captured by carriers of the GWAS hit rs62237573[T]. These two examples illustrate that GWAS hits may label too many or too few individuals as high risk.

We could not replicate the haplotypes discovered at chr10 and chr16 where mechanisms have been proposed to explain the direct effect of alternate alleles of the GWAS hits (Meyer et al, 2008; Cowper-Sal lari et al, 2012). Although we found individual rare haplotypes at each locus, we could not reject the possibility that they replicated in DRIVE solely by chance. These regions therefore may not harbor any rare risk haplotypes. But it is important to keep in mind the potential for imputation and phasing errors when combining individuals across a range of genetic ancestries; to replicate rare haplotypes, not only must the imputation of non-genotyped UKBB variants in DRIVE be accurate, but so must the computational phasing. We attempted to mitigate this issue by phasing and imputing the UKBB and DRIVE genotype data using the TOPMed panel as a common reference, but without whole-genome sequencing, our replication framework had to be based on the strong assumption that both computational steps were accurate. This is a major limitation of our work.

Supporting our replication of h2 and h3 at chr11 is the report of a non-BRCA1/2 family from the Netherlands with six cases of breast cancer having a strong linkage peak at the 11q13 locus (Rosa-Rosa et al, 2009b). If h2 and h3 are genuine, they could represent a link between a signal identified by two complementary methods, viz., linkage analysis and association studies. The overlap of signals from linkage and association analyses has been demonstrated to be low in the context of type 2 diabetes (McCarthy et al, 2008; Prokopenko et al, 2009), for example, and the reason is thought to be because of the different ages of the causal mutations identified by the two approaches (Ott et al, 2015). According to the theory, when mutations are relatively young, they are in strong linkage disequilibrium with the haplotype on which they arose, and are hence detectable by an excess of haplotype-sharing among affected family members. Over time, recombination breaks the linkage, and the mutation becomes associated with common alleles that are segregating in the population. This mechanism could explain why the frequency spectrum of GWAS hits is so broad (Wray et al, 2011), in contrast to the low-frequency bias predicted under the synthetic-associations model (Dickson et al, 2010). One interpretation of our results, then, is that the rare haplotypes are corrupted versions of an ancestral haplotype on which a single causal variant arose relatively recently. That we could not successfully replicate the chr10 and chr16 haplotypes could simply be a reflection of the fact that the rare variants there arose further in the past. Future applications of chromosome overlap would then have to be focused on linkage peaks and not GWAS hits.

But this model fails to account for the lack of rare variants discovered so far in the vicinity of breast cancer-risk GWAS hits (Lindström et al, 2016; Li et al, 2018) and the comparatively few mutations identified by linkage analysis and subsequent positional cloning (Gonzalez-Neira et al, 2007; Rosa-Rosa et al, 2009a). An alternative hypothesis is that the haplotype itself is the rare causal variant and that risk is because of the interaction of alleles at discrete locations along the haplotype. Two lines of evidence support this hypothesis. The first is the observation that the most statistically significant haplotypes were all noncontiguous, that is, not merely tags of the original contiguous haplotypes. Although the distinction between contiguous and noncontiguous is somewhat arbitrary in our use of only genotyped SNPs, the evidence of a decreasing number of alleles required for the manifestation of the statistical effect suggests that only a few alleles may be involved in mediating risk. In future studies with a greater number of meta-chromosomes retained after filtering, it should, in principle, be possible to reduce the haplotype SNP-alleles down to only the most essential set.

The second line of evidence in support of the noncontiguous rare-haplotype hypothesis is the ChIP and Hi-C data in Fig 1 showing the potential for biologically relevant interactions between key SNPs of h2 and h3. For example, the chromosomal loops on chr11 are complemented by the binding of CTCF, FOXA1, and ESRRA, molecules with potential roles in breast cancer risk. Regarding ESRRA, it is known that the closely related estrogen receptor (ER) binds throughout the genome at enhancers distant from the start sites of genes it regulates (Carroll et al, 2005; Eeckhoute et al, 2006), including CCND1. Several proteins are involved in ER recruitment, including the pioneer factor FOXA1 (Eeckhoute et al, 2006; Hurtado et al, 2011) and CTCF, which acts upstream of FOXA1 to drive ER-mediated transcription via chromosome loops (Zhang et al, 2010) and partitions the genome into ER-responsive blocks (Chan & Song, 2008). A small fraction of binding events (e.g., less than 5% of those of CTCF in MCF-7 cells) involve all three factors, which then contribute to the down-regulation of estrogen target genes (Hurtado et al, 2011; Ross-Innes et al, 2011). A fourth TF with frequent binding in the CCND1 TAD is encoded by the proto-oncogene MYC. MYC-binding was identified by computational analysis in a CCND1 enhancer encompassing the original GWAS hit rs614367 (Wang et al, 2019), later replaced with rs554219 using fine-mapping (French et al, 2013), and it has long been known that MYC can repress CCND1 expression (Jansen-Dürr et al, 1993; Philipp et al, 1994). In our data, the haplotype SNP-alleles of h2 and h3, especially when in the alternate phase, often colocalized with ChIP peaks, suggesting that chromosome overlap is detecting haplotypes involved in CCND1 regulation. Without corroborating experimental evidence, however, our statistical results cannot be used to definitively distinguish between a rare haplotype and a rare-variant model of breast cancer risk.

Nevertheless, our use of closed-pattern mining has made it possible to find rare, risk haplotypes and generate hypotheses that bear directly on the biological consequences of GWAS hits, and it is important to assess the advantages and limitations of this method in future genetic association studies. The principal advantage of closed-pattern mining over other pattern-mining algorithms is its elimination of redundant patterns which occur on one and the same sets of chromosomes. In this regard, Chromosome Overlap is similar to another closed-pattern miner called LCM (Uno et al, 2004) which has been used in a number of recent studies to detect combinations of transcription factor-binding events (Terada et al, 2013), expressed mRNAs (Relator et al, 2018), and SNP genotypes (Terada et al, 2016; Yoshizoe et al, 2018). But where LCM is a “bottom-up” method that extends shorter closed-patterns to longer ones, Chromosome Overlap is a “top-down” method similar to an older algorithm called CARPENTER (Pan et al, 2003) which finds long patterns by overlap. And although use of LCM and Chromosome Overlap both require pruning of the number of SNPs in some fashion (Relator et al, 2018; Yoshizoe et al, 2018), Chromosome Overlap is suited to finding the long, rare patterns that LCM would find only at the end of its search. A potential limitation of our method, however, is that the closed patterns which Chromosome Overlap discovers among cases are not necessarily closed patterns among controls, or in an independent replication population; at best, we can compare the candidate closed-pattern in one population to its closure in another. This is not so much a problem as it is a refinement of the statistical hypothesis, namely, that candidate patterns discovered in cases should have increased frequency among cases by virtue of their being genuine risk-increasing patterns; risk-decreasing patterns should only be shared by controls, who were too numerous to analyze in our study.

Materials and Methods

Chromosome Overlap: overview

Chromosome Overlap consists in iteratively overlapping regions of chromosomes from affected individuals and looking for shared haplotypes that are enriched in cases. First, all pairwise overlaps of the unique chromosomes from N affected individuals are formed to find shared, chiefly noncontiguous, closed haplotype patterns called meta-chromosomes (in this article, “pattern,” “haplotype,” and “meta-chromosome” are used synonymously). Most of these patterns are filtered out, and the remainder are retained for additional rounds of pairwise overlap. The process of forming all pairwise overlaps of a set of meta-chromosomes is known as an iteration, numbered from 0 to count the total number of times the process has been performed, and a pattern is said to have been discovered at the first iteration in which it appears as the longest pattern shared by a pair of meta-chromosomes. Closed haplotype patterns which have been discovered at previous iterations are not included in the next round of overlaps, and the process ceases when no more new patterns are discovered. A mathematical justification that this procedure discovers all closed-patterns among—or what is the same thing, forms all possible groupings of—the filtered set of meta-chromosomes is given first, and a scheme that allows the overlaps to be computed efficiently in parallel is described second. Following is a description of how Chromosome Overlap was applied in this study, including data sources, computational considerations, and statistical analyses.

Chromosome Overlap: mathematical formulation

If the ith chromosome in a sample is defined by the vector xi = (xi,1,…,xi,m) of alleles xi,k = sk at m biallelic SNPs, then a pattern is a list h=(k1,sk1),(k2,sk2),…,(kl,skl) of l ≤ m alleles. A chromosome xi is said to contain a pattern h if xi,kj=skj for all j∈{1,…,l}, and a group of chromosomes is said to share a pattern which is contained by each of the xi. A pattern is said to be closed if it is the longest pattern shared by a group of chromosomes, or what is the same thing, if it is the intersection or overlap of a group of chromosomes.

Overlap of a number σ of chromosomes produces a pattern which is shared by all σ, that is, the set h={(kj,skj)|xi1,kj=⋯=xiσ,kj=skj}; such a pattern is called a meta-chromosome. A meta-chromosome is said to comprise not the chromosomes which contain it, but only those particular chromosomes overlapped to form it, so that h = g may be two equivalent patterns which comprise different sets of chromosomes.

The overlap of a number σ of meta-chromosomes produces a meta-meta-chromosome defined by the intersection h′=hi1∩⋯∩hiσ. If each meta-chromosome hi should comprise exactly σ > 1 distinct chromosomes, then h′ may comprise as few as σ + 1 chromosomes—for example, if any hij should share all but one chromosome with any other hij′, then σ = 3-tuples, say, comprising 123, 124, and 134 will produce the σ+1-tuple 1234. Thus, it is always possible to add exactly one of each unique chromosome to any meta-chromosome. This feature is known as the add-one property and can be used to show that iteration of chromosome overlap forms all possible chromosome combinations, not just a sequence of σ-tuples, σ2-tuples, etc. For if the process should be iterated such that all σ-overlaps of meta-chromosomes h(r−1) comprising σ+r−1 chromosomes are formed by the start of iteration r ≥ 1, then all meta–meta-chromosomes comprising σ+r chromosomes will be produced by the start of iteration r+1. Hence, iterated overlap generates all closed-patterns shared by at least σ chromosomes. If the meta-chromosomes should be filtered at some iteration (e.g., r = 1 in the main text) such that only a subset are kept for further overlaps, then the closed-patterns found become only the closed-patterns of the filtered set, and the word “chromosome” should be replaced with “filtered meta-chromosome.”

Once all overlaps are computed at iteration r, the unique meta-chromosomes are found by pruning the list of duplicates. Now, if a meta-chromosome h(r′−1) appearing at iteration r′ should appear again at iteration r>r′, then h(r′−1) actually will be found to comprise (at least) σ+r−1 chromosomes. But as the previous iterations have already added every single chromosome one-by-one to h(r′−1) via the add-one property, it will not necessary be to overlap h(r−1)=h(r′−1) in iteration r in which σ+r-tuples are generated from σ+r−1-tuples which each share all but one chromosome with h(r−1). Thus, before the start of iteration r+1, all meta–meta-chromosomes which are found to have been generated in previous iterations are purged from the list. The process is complete at the first iteration in which no patterns are found to be novel.

Triangular-array algorithm

Here, we describe the algorithm for determining the chromosome combination corresponding to an arbitrary index. Let N be the number of samples and σ be the degree of overlap. Our problem is to find, for a given N and σ, what chromosome combination is the Ith? To answer this question without generating a list of all (2Nσ) chromosome combinations, we first index each combination by I∈{0,1,…,(2Nσ)−1} corresponding to a unique multiindex I=i1,i2,…,iσ of σ different chromosomes with i1<i2<⋯<iσ. The mapping from I to I is accomplished iteratively using the following lemma. In what follows, we put N↤2N for notational convenience.

Lemma 1.

For any integers N and σ≤N, (Nσ)+(N−1σ)+⋯+(N−(N−σ)σ)=(N+1σ+1).

Proof. We will prove the lemma by induction. Clearly the result holds when σ = 1 as the usual integer-summation formula, and the result when σ = N is trivial. Now, suppose that each term on the l.h.s. is the number of ways to arrange N−j things (0≤j≤N−σ) in a σ-dimensional triangular array (Fig S5), thus, (Nσ) corresponds to the number of arrangements of the items {2,…,N+1}, (σσ) to the arrangements of {N+1}, and, in general, (N−jσ) to arrangements of {j+2,…,N+1}. Now, append to each arrangement an element i1 which is one less than the minimum item j+2 of each array. In this way, we form all σ-combinations of the elements {1,2,…,N+1}, of which there are (N+1σ+1). Because the l.h.s. is equal to the r.h.s., the result follows for all σ up to N.

To use Lemma 1, arrange the values of I in a σ-dimensional triangular array such that i1 takes values in {1,…,N−σ+1}, i2 takes values in {i1+1,…,N−σ+2}, and in general, ij takes values in {ij−1+1,…,N−σ+j} (Fig S5). After the convention that the first dimension of a matrix is its rows, we have that i1 corresponds to the σth dimension, iσ the first, and ij the σ−j+1th. Starting from i1 = 1, i2 = 2,…, iσ = σ and incrementing from right to left (i.e., ranging through all values in the first dimension before changing the value in the second), the Ith element will be the multiindex I. Determining the indices ij is equivalent to determining the value or layer of the jth dimension of the array in which the I = I(0)th element resides. To determine the first layer, recognize that, for fixed i1, there are only (N−i1σ−1) chromosome combinations available to the remaining indices, because ∀j > 1, ij>i1. Then, i1 is determined by subtracting layers in the σth dimension until the smallest k is found which satisfies I(0)−∑l=0k(N−lσ−1)<0, or equivalently (by Lemma 1) ε1:=[(Nσ)−(N−kσ)]−I(0)>0, and putting i1↤k. The error ε1 is the remainder beyond I(0)−1 in the σ−1-dimensional array, which has (N−i1σ−1) combinations. Thus, after determining ij, put I(j)↤(N−ijσ−j)−εj and get the position of the multiindex ij+1,…,iσ in the σ−j-dimensional array of (N−ijσ−j) combinations. To do so, solve again for the smallest k such that εj+1:=[(N−ijσ−j)−(N−ij−kσ−j)]−I(j)>0 and put ij+1↤ij+k, because ij+1 starts from ij+1. The algorithm terminates when I(σ−1)=(N−iσ−11)−εσ−1 is the position of the multiindex iσ in the one-dimensional array of N−iσ−1 “combinations” of single chromosomes.

The foregoing code is implemented in the R script index2combo2.R which takes parameters I (the position), n (the total number of chromosomes), and sigma (the degree of overlap). We use zero-based indexing so that I ranges from 0 to (nσ)−1. The output is a list i1,…, iσ of strictly increasing numbers corresponding to the multiindex I. If instead a non-decreasing sequence is desired, the parameter allow.repeats can be used which increases the total number of chromosomes by σ−1. With this modification, each ij beyond i1 can begin at exactly ij−1 if combinations are subject to the replacement ij↤ij−j+1.

Figure S5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure S5. A generalized triangular array for organizing σ-tuples of chromosomes from a fixed total.

The figure shows how all combinations of σ = 4 chromosomes (e.g., {1, 2, 3, 4}, {1, 2, 5, 8}, {5, 6, 7, 8}) from a fixed total of 2N = 8 can be indexed. The lowest index iσ−3 = i1 takes all values from 1 to 8 − σ + (σ − 3) = 5, corresponding to the layers of the array, and in general, index iσ−j spans the integers iσ−j−1+1 to 8−j in the j+1th dimension. The process may be generalized to any value of σ by adding the appropriate number of dimensions.

Filtering

In the initial pairwise overlap of the unique chromosomes from N individuals, there are a maximum of (2N2) distinct meta-chromosomes, which number will quickly grow to an astronomical size upon iteration. To limit the growth rate, a strict P-value threshold was implemented to select only those haplotype patterns most associated with breast cancer risk in the discovery dataset (see below), and subsequent overlap iterations were performed only on this set. The threshold was set for each examined locus in such a way that the maximum number of meta-chromosomes generated in any iteration was between 105 and 106. The less stringent the threshold, the lesser was our chance of missing a risk haplotype appearing for the first time at a later iteration.

An additional filtering step was applied when the primary filtering was not enough to dampen the growth rate: when a family of patterns not related as subset–superset had the same frequency and crude breast cancer association OR, only the member with the fewest alleles was carried forward. Although not theoretically justified, we reasoned that this filtering removed long patterns that differed by at most a few alleles.

Computational considerations

Chromosome overlap is implemented in the IBM high-performance–computing environment LSF and is generalizable to any degree of overlap σ using an algorithm that iteratively and recursively finds which tuple i1, i2,…, iσ in a σ-dimensional triangular array (Fig S5) corresponding to the Ith overlap of σ meta-chromosomes (see above), alleviating the need to generate the complete list of (Nrσ) overlaps of Nr meta-chromosomes at each iteration r ≥ 0 and allowing the operations to be computed in parallel. In this article, we only consider pairwise overlap, that is, σ = 2. This restriction means that we potentially filter out risk-associated patterns which appear for the first time when three or more chromosomes are overlapped, but ensures that we test all patterns shared by at least two.

Haplotype data

Publicly available haplotype data from the UKBB was used for the discovery analysis. As described previously (Bycroft et al, 2018), this SHAPEIT3-phased dataset consists of 487,409 samples phased at 658,720 autosomal SNPs on the UKBB Axiom Array. The discovery analysis included a total of 181,034 women from the UKBB “white British” ancestry subset after excluding those who (1) were identified to be outliers in heterozygosity or genotype missingness rates; (2) showed any sex chromosome aneuploidies; (3) were second-degree or closer relatives of any or third-degree relatives of more than 10 other genotyped individuals or (4) withdrew from UKBB before this study began. Genotype principal components for the discovery analysis in the 181,034 women were obtained from the UKBB portal. The preceding steps were performed in PLINK (Purcell et al, 2007) and KING (Manichaikul et al, 2010), as described in our previous study (Wang et al, 2023). Breast cancer (UKBB data-field 40006, ICD10 code C50) was reported in 9,011 women with a mean (SD) age of onset of 56.2 (8.6) years; the remaining 172,023 women free of breast cancer had a mean (SD) age of 65.0 (7.9) years at the end of follow-up.

All of the UKBB genotype and phenotype data used in this study are available from the UKBB Portal (https://www.ukbiobank.ac.uk/).

Female subjects in the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study were used for the replication analysis. Briefly, the DRIVE study was initiated in 2010 as part of the NCI’s Genetic Associations and Mechanisms in Oncology initiative (http://epi.grants.cancer.gov/gameon/) and includes data from 60,015 breast cancer cases and controls genotyped on the custom Illumina OncoArray (Amos et al, 2017). The 528,620 OncoArray SNPs were filtered to remove SNPs which (1) had genotype missingness rate >10%, (2) were monomorphic or (3) were not in Hardy–Weinberg equilibrium (P < 1 × 10−10), leaving a total of 433,297. A total of 4,669 subjects were excluded who (1) showed any sex chromosome aneuploidies; (2) had genotyping rates <90%; (3) were identified as male by PLINK’s sex check; (4) were second degree or closer relatives of any other subject; (5) were not of European ancestry according to principal components analysis including the 1,000 Genomes Project Phase 3 data or (6) had missing age data, all as described in our previous study (Wang et al, 2023). The preceding steps were performed in PLINK (Purcell et al, 2007), KING (Manichaikul et al, 2010), and FlashPCA2 (Abraham et al, 2017), leaving 30,064 cases (mean [SD] age at onset: 61.7 [10.7] years) and 25,282 controls (mean [SD] age at end of follow-up: 59.6 [10.7] years).

All of the DRIVE data used in this study are publicly available from dbGaP under accession number phs001265.v1.p1.

Although UKBB contained fewer breast cancer cases than did DRIVE, the former dataset was chosen for the discovery analysis because of its greater genetic homogeneity. To improve the coverage of UKBB Axiom SNPs on the DRIVE OncoArray and minimize systematic phasing errors between the datasets, we carried out genotype imputation as part of our previously published genome-wide rare-haplotype analysis (Wang et al, 2023). Genotype imputation and phasing were performed on the TOPMed Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov) running Minimac4 (Das et al, 2016) and Eagle2 (Loh et al, 2016), with the TOPMed panel including 194,512 haplotypes and 308 million sequenced variants being used as the reference (https://topmed.nhlbi.nih.gov) (Taliun et al, 2021). Only high-quality imputed SNPs (with Minimac Rsq ≥0.8) that were genotyped on the UKBB Axiom Array and biallelic in both cohorts (after eliminating alleles with MAF = 0) were used to form haplotypes in the selected regions. All genomic coordinates quoted in this article are with respect to the GRCh38 Homo sapiens assembly.

ENCODE data

ChIP-seq and HiC data referenced in this study are available from the ENCODE Portal (https://www.encodeproject.org/) with accession numbers ENCSR000DML (MCF-7 CTCF, P-value bigWig file: ENCFF877ZYR), ENCSR954WVZ (MCF-7 ESRRA, P-value bigWig file: ENCFF059ZSC), ENCSR126YEB (MCF-7 FOXA1, P-value bigWig file: ENCFF512UGW), ENCSR000DMQ (MCF-7 MYC, P-value bigWig file: ENCFF149FXH), and ENCSR660LPJ (MCF-7 Hi-C, contact matrix file: ENCFF420JTA).

Statistical analysis

All statistical analyses described in this article were carried out in R (R Core Team, 2022) version 3.6.1.

Two-phase analysis

Chromosome Overlap was applied in two phases on the case-only portion of the UKBB discovery dataset. In Phase 1, a common haplotype linked to the GWAS hit was discovered. In Phase 2, rare subtypes of the common haplotype were discovered via conditional analysis.

Phase 1: common-haplotype discovery

For Phase 1, haplotypes containing UKBB-genotyped SNPs in a region ±100 kb from a GWAS hit were extracted using BCFtools (Li, 2011) version 1.10.2; if the number of genotyped SNPs was greater than 60, the region was reduced symmetrically. After forming all pairwise overlaps of chromosomes from affected individuals, the association of each meta-chromosome with breast cancer was assessed using Fisher’s exact test in all UKBB breast cancer cases and controls. The top ∼100 patterns in Iteration 1 by P-value were retained for iterated overlap to obtain the complete set of closed-patterns, whose associations were assessed by Fisher’s exact test upon completion of the last iteration. The top ∼20–30,000 haplotypes by P-value were then further evaluated one-by-one for their association with breast cancer incidence rates under an additive genetic model using a Cox proportional hazards model having age as the time axis. All analyses were adjusted for the first 10 genotype principal components, and the number (zero, one, or two) of copies of each haplotype was modelled as a continuous variable. The statistical significance of the adjusted hazard ratio (HR) of each haplotype was evaluated using the one-degree-of-freedom LRT. The single most significant haplotype was designated h1 after verifying by stepwise forward-selection that no other haplotype had P < 1.0 × 10−5 relative to the model with h1.

Phase 2: rare-haplotype discovery

In Phase 2, h1-bearing case chromosomes in an extended region covering ∼100–200 SNPs in a ∼1 Mb region were extracted using BCFtools (Li, 2011) version 1.10.2. The region was chosen based on the boundaries of the TAD defined by the 3D Genome Browser (Wang et al, 2018) (Figs S1–S3) and the location of h1 so as to include one or both sides of the TAD between h1 and the TAD boundary. After forming all pairwise overlaps of h1-bearing chromosomes from breast cancer cases, the association of each meta-chromosome with breast cancer incidence rates over and above the risk attributable to h1 was assessed using an “h1-conditional” Fisher’s exact test, that is, one restricted to h1-bearing chromosomes only. The top ∼30 patterns in Iteration 1 by P-value were retained for iterated overlap to obtain the complete set of closed-patterns, whose associations were assessed by Fisher’s exact test upon completion of the last iteration. The resulting haplotypes with Fisher’s exact test P < 1.0 × 10−4 were further evaluated one-by-one for their association with breast cancer incidence rates under an additive genetic model using a Cox proportional hazards model having age as the time axis. All analyses were adjusted for h1 and the first 10 genotype principal components, and the number (zero, one, or two) of copies of each haplotype was modelled as a continuous variable. The statistical significance of the adjusted HR was evaluated using the one-degree-of-freedom LRT relative to the model with h1 alone. Haplotypes were carried forward for replication which had P < 1.0 × 10−5 relative to the model with h1 alone.

Replication

The association of h1 and each of its rare subtypes with breast cancer risk was assessed in DRIVE under an additive genetic model using logistic regression, the method appropriate to case-control designs. All analyses were adjusted for age and the first 10 genotype principal components, and the number (zero, one, or two) of copies of each haplotype was modelled as a continuous variable. The statistical significance of the OR was evaluated for each discovered haplotype in Phase 2 individually using the one-degree-of-freedom LRT relative to the model with h1 alone; a similar confirmation was done in Phase 1 relative to the null model. Successful replication was declared at P < 0.05.

Final set of rare risk haplotypes

The haplotypes which replicated in DRIVE (excepting h1) were again assessed in UKBB using stepwise forward selection. Haplotypes were sequentially added to the model which had LRT P < 1.0 × 10−5 relative to the previous model, beginning with the model having h1 alone. Those haplotypes surviving selection were designated h2, h3, etc., in order of selection; all were rare subtypes of h1.

Permutation analysis

The empirical false-positive rate of rare-haplotype replication was determined by permuting the case/control labels of h1-carriers in such a way that the number of h1 hetero- and homozygotes was preserved, and then repeating Chromosome Overlap on chromosomes from permuted pseudo-cases. In this way, the h1 association with breast cancer incidence rates from Phase 1 had been preserved, and the expected number of rare haplotypes that replicated due merely to chance (i.e., the false-positive discovery rate) was estimated by computing the fraction of discovered haplotypes with P < 1.0 × 10−5 and HR > 1 in the permuted UKBB dataset which replicated in DRIVE with P < 0.05 and OR > 1.

Haplotype reduction

The minimal set of alleles necessary to define each h1 in Phase 1 was determined using rpart (Therneau & Atkinson, 1997), the recursive partitioning procedure implemented in R, with parameters cp = −1 and minsplit = 1 to ensure that the full tree was grown within a maximum of 30 steps. Alleles for splitting at each step were chosen which maximized the Gini impurity reduction. Recursive partitioning was not used for rare haplotypes in Phase 2 where the small number of haplotype carriers made it difficult to define a minimal set of alleles.

Data Availability

The code generated during this study is available on GitHub at https://github.com/wletsou/ChromosomeOverlap.

Acknowledgements

The authors thank the high-performance–computing facility at St. Jude Children’s Research Hospital for computational support. This work was supported by R01CA216354 from the US National Cancer Institute (W Letsou, W Moon, C Im, Y Sapkota, LL Robison, and Y Yasui), T32CA225590 from the US National Cancer Institute (W Letsou), the American Lebanese Syrian Associated Charities (W Letsou, F Wang, Y Sapkota, LL Robison, & Y Yasui), the Alberta Machine Intelligence Institute (C Im & Y Yasui), and the New York Institute of Technology College of Arts & Sciences (W Letsou).

Author Contributions

  • W Letsou: conceptualization, software, formal analysis, methodology, and writing—original draft, review, and editing.

  • F Wang: software, methodology, and writing—review and editing.

  • W Moon: software.

  • C Im: methodology and writing—review and editing.

  • Y Sapkota: methodology and writing—review and editing.

  • LL Robison: funding acquisition, methodology, and writing—review and editing.

  • Y Yasui: conceptualization, formal analysis, supervision, funding acquisition, methodology, and writing—original draft, review, and editing.

Conflict of Interest Statement

The authors declare that they have no conflict of interest.

  • Received May 24, 2023.
  • Revision received July 24, 2023.
  • Accepted July 26, 2023.
  • © 2023 Letsou et al.
Creative Commons logoCreative Commons logohttps://creativecommons.org/licenses/by/4.0/

This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).

References

  1. ↵
    1. Abraham G,
    2. Qiu Y,
    3. Inouye M
    (2017) FlashPCA2: Principal component analysis of biobank-scale genotype datasets. Bioinformatics 33: 2776–2778. doi:10.1093/bioinformatics/btx299
    OpenUrlCrossRefPubMed
  2. ↵
    1. Amos CI,
    2. Dennis J,
    3. Wang Z,
    4. Byun J,
    5. Schumacher FR,
    6. Gayther SA,
    7. Casey G,
    8. Hunter DJ,
    9. Sellers TA,
    10. Gruber SB, et al.
    (2017) The OncoArray consortium: A network for understanding the genetic architecture of common cancers. Cancer Epidemiol Biomarkers Prev 26: 126–135. doi:10.1158/1055-9965.EPI-16-0106
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Anderson CA,
    2. Soranzo N,
    3. Zeggini E,
    4. Barrett JC
    (2011) Synthetic associations are unlikely to account for many common disease genome-wide association signals. PLoS Biol 9: e1000580. doi:10.1371/journal.pbio.1000580
    OpenUrlCrossRefPubMed
  4. ↵
    1. Bycroft C,
    2. Freeman C,
    3. Petkova D,
    4. Band G,
    5. Elliott LT,
    6. Sharp K,
    7. Motyer A,
    8. Vukcevic D,
    9. Delaneau O,
    10. O’Connell J, et al.
    (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature 562: 203–209. doi:10.1038/s41586-018-0579-z
    OpenUrlCrossRefPubMed
  5. ↵
    1. Carroll JS,
    2. Liu XS,
    3. Brodsky AS,
    4. Li W,
    5. Meyer CA,
    6. Szary AJ,
    7. Eeckhoute J,
    8. Shao W,
    9. Hestermann EV,
    10. Geistlinger TR, et al.
    (2005) Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell 122: 33–43. doi:10.1016/j.cell.2005.05.008
    OpenUrlCrossRefPubMed
  6. ↵
    1. Chan CS,
    2. Song JS
    (2008) CCCTC-binding factor confines the distal action of estrogen receptor. Cancer Res 68: 9041–9049. doi:10.1158/0008-5472.CAN-08-2632
    OpenUrlAbstract/FREE Full Text
  7. ↵
    1. Cowper-Sal lari R,
    2. Zhang X,
    3. Wright JB,
    4. Bailey SD,
    5. Cole MD,
    6. Eeckhoute J,
    7. Moore JH,
    8. Lupien M
    (2012) Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nat Genet 44: 1191–1198. doi:10.1038/ng.2416
    OpenUrlCrossRefPubMed
  8. ↵
    1. Das S,
    2. Forer L,
    3. Schönherr S,
    4. Sidore C,
    5. Locke AE,
    6. Kwong A,
    7. Vrieze SI,
    8. Chew EY,
    9. Levy S,
    10. McGue M, et al.
    (2016) Next-generation genotype imputation service and methods. Nat Genet 48: 1284–1287. doi:10.1038/ng.3656
    OpenUrlCrossRefPubMed
  9. ↵
    1. Davis CA,
    2. Hitz BC,
    3. Sloan CA,
    4. Chan ET,
    5. Davidson JM,
    6. Gabdank I,
    7. Hilton JA,
    8. Jain K,
    9. Baymuradov UK,
    10. Narayanan AK, et al.
    (2018) The encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res 46: D794–D801. doi:10.1093/nar/gkx1081
    OpenUrlCrossRefPubMed
  10. ↵
    1. Dickson SP,
    2. Wang K,
    3. Krantz I,
    4. Hakonarson H,
    5. Goldstein DB
    (2010) Rare variants create synthetic genome-wide associations. PLoS Biol 8: e1000294. doi:10.1371/journal.pbio.1000294
    OpenUrlCrossRefPubMed
  11. ↵
    1. Eeckhoute J,
    2. Carroll JS,
    3. Geistlinger TR,
    4. Torres-Arzayus MI,
    5. Brown M
    (2006) A cell-type-specific transcriptional network required for estrogen regulation of cyclin D1 and cell cycle progression in breast cancer. Genes Dev 20: 2513–2526. doi:10.1101/gad.1446006
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. ENCODE Project Consortium
    (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. doi:10.1038/nature11247
    OpenUrlCrossRefPubMed
  13. ↵
    1. Falk CT,
    2. Rubinstein P
    (1987) Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet 51: 227–233. doi:10.1111/j.1469-1809.1987.tb00875.x
    OpenUrlCrossRefPubMed
  14. ↵
    1. Fang G,
    2. Haznadar M,
    3. Wang W,
    4. Yu H,
    5. Steinbach M,
    6. Church TR,
    7. Oetting WS,
    8. Van Ness B,
    9. Kumar V
    (2012) High-order SNP combinations associated with complex diseases: Efficient discovery, statistical power and functional interactions. PLoS One 7: e33531. doi:10.1371/journal.pone.0033531
    OpenUrlCrossRefPubMed
  15. ↵
    1. French JD,
    2. Ghoussaini M,
    3. Edwards SL,
    4. Meyer KB,
    5. Michailidou K,
    6. Ahmed S,
    7. Khan S,
    8. Maranian MJ,
    9. O’Reilly M,
    10. Hillman KM, et al.
    (2013) Functional variants at the 11q13 risk locus for breast cancer regulate cyclin D1 expression through long-range enhancers. Am J Hum Genet 92: 489–503. doi:10.1016/j.ajhg.2013.01.002
    OpenUrlCrossRefPubMed
  16. ↵
    1. Ge T,
    2. Chen CY,
    3. Ni Y,
    4. Feng YCA,
    5. Smoller JW
    (2019) Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10: 1776. doi:10.1038/s41467-019-09718-5
    OpenUrlCrossRefPubMed
  17. ↵
    1. Gonzalez-Neira A,
    2. Rosa-Rosa JM,
    3. Osorio A,
    4. Gonzalez E,
    5. Southey M,
    6. Sinilnikova O,
    7. Lynch H,
    8. Oldenburg RA,
    9. van Asperen CJ,
    10. Hoogerbrugge N, et al.
    (2007) Genomewide high-density SNP linkage analysis of non-BRCA1/2 breast cancer families identifies various candidate regions and has greater power than microsatellite studies. BMC Genomics 8: 299. doi:10.1186/1471-2164-8-299
    OpenUrlCrossRefPubMed
  18. ↵
    1. Hurtado A,
    2. Holmes KA,
    3. Ross-Innes CS,
    4. Schmidt D,
    5. Carroll JS
    (2011) FOXA1 is a key determinant of estrogen receptor function and endocrine response. Nat Genet 43: 27–33. doi:10.1038/ng.730
    OpenUrlCrossRefPubMed
  19. ↵
    1. Jansen-Dürr P,
    2. Meichle A,
    3. Steiner P,
    4. Pagano M,
    5. Finke K,
    6. Botz J,
    7. Wessbecher J,
    8. Draetta G,
    9. Eilers M
    (1993) Differential modulation of cyclin gene expression by MYC. Proc Natl Acad Sci U S A 90: 3685–3689. doi:10.1073/pnas.90.8.3685
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Li H
    (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993. doi:10.1093/bioinformatics/btr509
    OpenUrlCrossRefPubMed
  21. ↵
    1. Li N,
    2. Rowley SM,
    3. Thompson ER,
    4. McInerny S,
    5. Devereux L,
    6. Amarasinghe KC,
    7. Zethoven M,
    8. Lupat R,
    9. Goode D,
    10. Li J, et al.
    (2018) Evaluating the breast cancer predisposition role of rare variants in genes associated with low-penetrance breast cancer risk SNPs. Breast Cancer Res 20: 3. doi:10.1186/s13058-017-0929-z
    OpenUrlCrossRef
  22. ↵
    1. Li D,
    2. Harrison JK,
    3. Purushotham D,
    4. Wang T
    (2022) Exploring genomic data coupled with 3D chromatin structures using the WashU Epigenome Browser. Nat Methods 19: 909–910. doi:10.1038/s41592-022-01550-y
    OpenUrlCrossRef
  23. ↵
    1. Lindström S,
    2. Ablorh A,
    3. Chapman B,
    4. Gusev A,
    5. Chen G,
    6. Turman C,
    7. Eliassen AH,
    8. Price AL,
    9. Henderson BE,
    10. Le Marchand L, et al.
    (2016) Deep targeted sequencing of 12 breast cancer susceptibility regions in 4611 women across four different ethnicities. Breast Cancer Rest 18: 109. doi:10.1186/s13058-016-0772-7
    OpenUrlCrossRef
  24. ↵
    1. Loh PR,
    2. Danecek P,
    3. Palamara PF,
    4. Fuchsberger C,
    5. A Reshef Y,
    6. K Finucane H,
    7. Schoenherr S,
    8. Forer L,
    9. McCarthy S,
    10. Abecasis GR, et al.
    (2016) Reference-based phasing using the haplotype reference consortium panel. Nat Genet 48: 1443–1448. doi:10.1038/ng.3679
    OpenUrlCrossRefPubMed
  25. ↵
    1. Long J,
    2. Cai Q,
    3. Shu XO,
    4. Qu S,
    5. Li C,
    6. Zheng Y,
    7. Gu K,
    8. Wang W,
    9. Xiang YB,
    10. Cheng J, et al.
    (2010) Identification of a functional genetic variant at 16q12.1 for breast cancer risk: Results from the asia breast cancer consortium. PLoS Genet 6: e1001002. doi:10.1371/journal.pgen.1001002
    OpenUrlCrossRefPubMed
  26. ↵
    1. Manichaikul A,
    2. Mychaleckyj JC,
    3. Rich SS,
    4. Daly K,
    5. Sale M,
    6. Chen WM
    (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873. doi:10.1093/bioinformatics/btq559
    OpenUrlCrossRefPubMed
  27. ↵
    1. Mars N,
    2. Koskela JT,
    3. Ripatti P,
    4. Kiiskinen TTJ,
    5. Havulinna AS,
    6. Lindbohm JV,
    7. Ahola-Olli A,
    8. Kurki M,
    9. Karjalainen J,
    10. Palta P, et al.
    (2020) Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med 26: 549–557. doi:10.1038/s41591-020-0800-0
    OpenUrlCrossRef
  28. ↵
    1. Mavaddat N,
    2. Michailidou K,
    3. Dennis J,
    4. Lush M,
    5. Fachal L,
    6. Lee A,
    7. Tyrer JP,
    8. Chen TH,
    9. Wang Q,
    10. Bolla MK, et al.
    (2019) Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am J Hum Genet 104: 21–34. doi:10.1016/j.ajhg.2018.11.002
    OpenUrlCrossRefPubMed
  29. ↵
    1. McCarthy MI,
    2. Zeggini E,
    3. Jafar-Mohammad B,
    4. Timpson NJ,
    5. Frayling TM,
    6. Weedon MN,
    7. Elliot KS,
    8. Lindgren CM,
    9. Lango H,
    10. Perrry JR, et al.
    (2008) Analysis of overlap between type 2 diabetes signals identified through genome-wide linkage and association approaches. Diabetes 57S1: A321. https://professional.diabetes.org/abstract/analysis-overlap-between-type-2-diabetes-signals-identified-through-genome-wide-linkage-and
    OpenUrl
  30. ↵
    1. McClellan J,
    2. King MC
    (2010) Genetic heterogeneity in human disease. Cell 141: 210–217. doi:10.1016/j.cell.2010.03.032
    OpenUrlCrossRefPubMed
  31. ↵
    1. Meyer KB,
    2. Maia AT,
    3. O’Reilly M,
    4. Teschendorff AE,
    5. Chin SF,
    6. Caldas C,
    7. Ponder BAJ
    (2008) Allele-specific up-regulation of FGFR2 increases susceptibility to breast cancer. PLoS Biol 6: e108. doi:10.1371/journal.pbio.0060108
    OpenUrlCrossRefPubMed
  32. ↵
    1. Michailidou K,
    2. Lindström S,
    3. Dennis J,
    4. Beesley J,
    5. Hui S,
    6. Kar S,
    7. Lemaçon A,
    8. Soucy P,
    9. Glubb D,
    10. Rostamianfar A, et al.
    (2017) Association analysis identifies 65 new breast cancer risk loci. Nature 551: 92–94. doi:10.1038/nature24284
    OpenUrlCrossRefPubMed
  33. ↵
    1. Möller S,
    2. Mucci LA,
    3. Harris JR,
    4. Scheike T,
    5. Holst K,
    6. Halekoh U,
    7. Adami HO,
    8. Czene K,
    9. Christensen K,
    10. Holm NV, et al.
    (2016) The heritability of breast cancer among women in the nordic twin study of cancer. Cancer Epidemiol Biomarkers Prev 25: 145–150. doi:10.1158/1055-9965.EPI-15-0913
    OpenUrlAbstract/FREE Full Text
  34. ↵
    1. Okazaki A,
    2. Horpaopan S,
    3. Zhang Q,
    4. Randesi M,
    5. Ott J
    (2021) Genotype pattern mining for pairs of interacting variants underlying digenic traits. Genes (Basel) 12: 1160. doi:10.3390/genes12081160
    OpenUrlCrossRef
  35. ↵
    1. Ott J,
    2. Wang J,
    3. Leal SM
    (2015) Genetic linkage analysis in the age of whole-genome sequencing. Nat Rev Genet 16: 275–284. doi:10.1038/nrg3908
    OpenUrlCrossRefPubMed
  36. ↵
    1. Pan F,
    2. Cong G,
    3. Tung AKH,
    4. Yang J,
    5. Zaki MJ
    (2003) CARPENTER: Finding closed patterns in long biological datasets. In KDD03: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 637–642. New York, NY: Association for Computing Machinery.
  37. ↵
    1. Pasquier N,
    2. Bastide Y,
    3. Taouil R,
    4. Lakhal L
    (1999) Efficient mining of association rules using closed itemset lattices. Inform Syst 24: 25–46. doi:10.1016/S0306-4379(99)00003-4
    OpenUrlCrossRef
  38. ↵
    1. Pharoah PDP,
    2. Antoniou A,
    3. Bobrow M,
    4. Zimmern RL,
    5. Easton DF,
    6. Ponder BAJ
    (2002) Polygenic susceptibility to breast cancer and implications for prevention. Nat Genet 31: 33–36. doi:10.1038/ng853
    OpenUrlCrossRefPubMed
  39. ↵
    1. Philipp A,
    2. Schneider A,
    3. Väsrik I,
    4. Finke K,
    5. Xiong Y,
    6. Beach D,
    7. Alitalo K,
    8. Eilers M
    (1994) Repression of cyclin D1: A novel function of MYC. Mol Cell Biol 14: 4032–4043. doi:10.1128/mcb.14.6.4032-4043.1994
    OpenUrlAbstract/FREE Full Text
  40. ↵
    1. Pounraja VK,
    2. Girirajan S
    (2022) A general framework for identifying oligogenic combinations of rare variants in complex disorders. Genome Res 32: 904–915. doi:10.1101/gr.276348.121
    OpenUrlAbstract/FREE Full Text
  41. ↵
    1. Prokopenko I,
    2. Zeggini E,
    3. Hanson RL,
    4. Mitchell BD,
    5. Rayner NW,
    6. Akan P,
    7. Baier L,
    8. Das SK,
    9. Elliott KS,
    10. Fu M, et al.
    (2009) Linkage disequilibrium mapping of the replicated type 2 diabetes linkage signal on chromosome 1q. Diabetes 58: 1704–1709. doi:10.2337/db09-0081
    OpenUrlAbstract/FREE Full Text
  42. ↵
    1. Purcell S,
    2. Neale B,
    3. Todd-Brown K,
    4. Thomas L,
    5. Ferreira MAR,
    6. Bender D,
    7. Maller J,
    8. Sklar P,
    9. de Bakker PIW,
    10. Daly MJ, et al.
    (2007) PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi:10.1086/519795
    OpenUrlCrossRefPubMed
  43. ↵
    1. R Core Team
    (2022) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-Project.org/
  44. ↵
    1. Relator RT,
    2. Terada A,
    3. Sese J
    (2018) Identifying statistically significant combinatorial markers for survival analysis. BMC Med Genomics 11: 31. doi:10.1186/s12920-018-0346-x
    OpenUrlCrossRef
  45. ↵
    1. Rosa-Rosa JM,
    2. Pita G,
    3. Urioste M,
    4. Llort G,
    5. Brunet J,
    6. Lázaro C,
    7. Blanco I,
    8. Ramón y Cajal T,
    9. Díez O,
    10. de la Hoya M, et al.
    (2009a) Genome-wide linkage scan reveals three putative breast-cancer-susceptibility loci. Am J Hum Genet 84: 115–122. doi:10.1016/j.ajhg.2008.12.013
    OpenUrlCrossRefPubMed
  46. ↵
    1. Rosa-Rosa JM,
    2. Pita G,
    3. González-Neira A,
    4. Milne RL,
    5. Fernandez V,
    6. Ruivenkamp C,
    7. van Asperen CJ,
    8. Devilee P,
    9. Benitez J
    (2009b) A 7 Mb region within 11q13 may contain a high penetrance gene for breast cancer. Breast Cancer Res Treat 118: 151–159. doi:10.1007/s10549-009-0317-1
    OpenUrlCrossRefPubMed
  47. ↵
    1. Ross-Innes CS,
    2. Brown GD,
    3. Carroll JS
    (2011) A co-ordinated interaction between CTCF and ER in breast cancer cells. BMC Genomics 12: 593. doi:10.1186/1471-2164-12-593
    OpenUrlCrossRefPubMed
  48. ↵
    1. Schaid DJ
    (2004) Evaluating associations of haplotypes with traits. Genet Epidemiol 27: 348–364. doi:10.1002/gepi.20037
    OpenUrlCrossRefPubMed
  49. ↵
    1. Sollis E,
    2. Mosaku A,
    3. Abid A,
    4. Buniello A,
    5. Cerezo M,
    6. Gil L,
    7. Groza T,
    8. Güneş O,
    9. Hall P,
    10. Hayhurst J, et al.
    (2023) The NHGRI-EBI GWAS catalog: Knowledgebase and deposition resource. Nucleic Acids Res 51: D977–D985. doi:10.1093/nar/gkac1010
    OpenUrlCrossRef
  50. ↵
    1. Taliun D,
    2. Harris DN,
    3. Kessler MD,
    4. Carlson J,
    5. Szpiech ZA,
    6. Torres R,
    7. Taliun SAG,
    8. Corvelo A,
    9. Gogarten SM,
    10. Kang HM, et al.
    (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590: 290–299. doi:10.1038/s41586-021-03205-y
    OpenUrlCrossRefPubMed
  51. ↵
    1. Terada A,
    2. Okada-Hatakeyama M,
    3. Tsuda K,
    4. Sese J
    (2013) Statistical significance of combinatorial regulations. Proc Natl Acad Sci U S A 110: 12996–13001. doi:10.1073/pnas.1302233110
    OpenUrlAbstract/FREE Full Text
  52. ↵
    1. Terada A,
    2. Yamada R,
    3. Tsuda K,
    4. Sese J
    (2016) LAMPLINK: Detection of statistically significant SNP combinations from GWAS data. Bioinformatics 32: 3513–3515. doi:10.1093/bioinformatics/btw418
    OpenUrlCrossRefPubMed
  53. ↵
    1. Therneau TM,
    2. Atkinson EJ
    (1997) An introduction to recursive partitioning using the RPART routines. Mayo Foundation Technical Report 61: 452. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
    OpenUrl
  54. ↵
    1. Udler MS,
    2. Ahmed S,
    3. Healey CS,
    4. Meyer K,
    5. Struewing J,
    6. Maranian M,
    7. Kwon EM,
    8. Zhang J,
    9. Tyrer J,
    10. Karlins E, et al.
    (2010) Fine scale mapping of the breast cancer 16q12 locus. Hum Mol Genet 19: 2507–2515. doi:10.1093/hmg/ddq122
    OpenUrlCrossRefPubMed
  55. ↵
    1. Suzuki E,
    2. Arikawa S
    1. Uno T,
    2. Asai T,
    3. Uchida Y,
    4. Arimura H
    (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In Lecture Notes in Computer Science. Suzuki E, Arikawa S (eds.). pp 16–31. Berlin: Springer.
  56. ↵
    1. Wang Y,
    2. Song F,
    3. Zhang B,
    4. Zhang L,
    5. Xu J,
    6. Kuang D,
    7. Li D,
    8. Choudhary MNK,
    9. Li Y,
    10. Hu M, et al.
    (2018) The 3D genome browser: A web-based browser for visualizing 3D genome organization and long-range chromatin interactions. Genome Biol 19: 151. doi:10.1186/s13059-018-1519-9
    OpenUrlCrossRefPubMed
  57. ↵
    1. Wang J,
    2. Dai X,
    3. Berry LD,
    4. Cogan JD,
    5. Liu Q,
    6. Shyr Y
    (2019) HACER: An atlas of human active enhancers to interpret regulatory variants. Nucleic Acids Res 47: D106–D112. doi:10.1093/nar/gky864
    OpenUrlCrossRef
  58. ↵
    1. Wang F,
    2. Moon W,
    3. Letsou W,
    4. Sapkota Y,
    5. Wang Z,
    6. Im C,
    7. Baedke J,
    8. Robison L,
    9. Yasui Y
    (2023) Genome-wide analysis of rare haplotypes associated with breast cancer risk. Cancer Res 83: 332–345. doi:10.1158/0008-5472.CAN-22-1888
    OpenUrlCrossRef
  59. ↵
    1. Wray NR,
    2. Purcell SM,
    3. Visscher PM
    (2011) Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol 9: e1000579. doi:10.1371/journal.pbio.1000579
    OpenUrlCrossRefPubMed
  60. ↵
    1. Yasui Y,
    2. Letsou W,
    3. Wang F,
    4. Im C,
    5. Sapkota Y,
    6. Wang Z,
    7. Salehabadi SM,
    8. Baedke JL,
    9. Moon WJ,
    10. Liu Q, et al.
    (2023) Inference on the genetic architecture of breast cancer risk. Cancer Epidemiol Biomarkers Prev OF1-OF6. doi:10.1158/1055-9965.EPI-22-1073
    OpenUrlCrossRef
  61. ↵
    1. Yoshizoe K,
    2. Terada A,
    3. Tsuda K
    (2018) MP-LAMP: Parallel detection of statistically significant multi-loci markers on cloud platforms. Bioinformatics 34: 3047–3049. doi:10.1093/bioinformatics/bty219
    OpenUrlCrossRef
  62. ↵
    1. Zhang Y,
    2. Liang J,
    3. Li Y,
    4. Xuan C,
    5. Wang F,
    6. Wang D,
    7. Shi L,
    8. Zhang D,
    9. Shang Y
    (2010) CCCTC-binding factor acts upstream of FOXA1 and demarcates the genomic response to estrogen. J Biol Chem 285: 28604–28613. doi:10.1074/jbc.M110.149658
    OpenUrlAbstract/FREE Full Text
  63. ↵
    1. Zhang H,
    2. Ahearn TU,
    3. Lecarpentier J,
    4. Barnes D,
    5. Beesley J,
    6. Qi G,
    7. Jiang X,
    8. O’Mara TA,
    9. Zhao N,
    10. Bolla MK, et al.
    (2020) Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat Genet 52: 572–581. doi:10.1038/s41588-020-0609-2
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Download PDF
Email Article

Thank you for your interest in spreading the word on Life Science Alliance.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Refining the genetic risk of breast cancer with rare haplotypes and pattern mining
(Your Name) has sent you a message from Life Science Alliance
(Your Name) thought you would like to see the Life Science Alliance web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Refining breast cancer risk
William Letsou, Fan Wang, Wonjong Moon, Cindy Im, Yadav Sapkota, Leslie L Robison, Yutaka Yasui
Life Science Alliance Aug 2023, 6 (10) e202302183; DOI: 10.26508/lsa.202302183

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
Refining breast cancer risk
William Letsou, Fan Wang, Wonjong Moon, Cindy Im, Yadav Sapkota, Leslie L Robison, Yutaka Yasui
Life Science Alliance Aug 2023, 6 (10) e202302183; DOI: 10.26508/lsa.202302183
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
Issue Cover

In this Issue

Volume 6, No. 10
October 2023
  • Table of Contents
  • Cover (PDF)
  • About the Cover
  • Masthead (PDF)
Advertisement

Jump to section

  • Article
    • Abstract
    • Introduction
    • Results
    • Discussion
    • Materials and Methods
    • Data Availability
    • Acknowledgements
    • References
  • Figures & Data
  • Info
  • Metrics
  • Reviewer Comments
  • PDF

Subjects

  • Cancer
  • Genomics & Functional Genomics
  • Methods & Resources

Related Articles

  • No related articles found.

Cited By...

  • No citing articles found.
  • Google Scholar

More in this TOC Section

  • DiPAK senses DPP8/9 activity
  • Modeling RSV infection with respiratory organoids
  • Optogenetic mating
Show more Methods

Similar Articles

EMBO Press LogoRockefeller University Press LogoCold Spring Harbor Logo

Content

  • Home
  • Newest Articles
  • Current Issue
  • Archive
  • Subject Collections

For Authors

  • Submit a Manuscript
  • Author Guidelines
  • License, copyright, Fee

Other Services

  • Alerts
  • Bluesky
  • X/Twitter
  • RSS Feeds

More Information

  • Editors & Staff
  • Reviewer Guidelines
  • Feedback
  • Licensing and Reuse
  • Privacy Policy

ISSN: 2575-1077
© 2025 Life Science Alliance LLC

Life Science Alliance is registered as a trademark in the U.S. Patent and Trade Mark Office and in the European Union Intellectual Property Office.