Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases

Single-cell multiome dataset
Single-cell multiome (snRNA-seq + snATAC-seq) data of the human left ventricle and lung were processed and clustered on the basis of RNA modality using Scanpy137. The cells with high-quality RNA information (total detected gene > 500, total unique molecular identifiers < 20,000 and mitochondrial read percentage < 10%) were selected for further analysis. Doublets were filtered using scrublet138 with parameters min_counts = 1, min_cells = 10, min_gene_variability_pctl = 90 and n_prin_comps = 30. The thresholds for doublet removal were decided per sample on the basis of the distribution of doublet scores in real versus simulated cells. The top 3,000 highly variable genes were selected by combining the results from each sample separately with seurat_v3 mode. The cell-by-gene count matrices were normalized and scaled. ALLCools with a Python implementation of Seurat integration was used for correction of batch effect between samples with 50 PCs and 30 canonical correlation dimensions136,139. Leiden clustering was performed on a k-nearest neighbor (kNN; k = 25) graph. The cell clusters were annotated and merged to cell types by comparing the expression level of predefined marker genes across clusters. The marker genes in Litviňuková et al. (2020) and Tucker et al. (2020)140,141 were used to annotate the heart cell types.
We also examined the ATAC modality of these cells following the methods described below to ensure that these cells also have high-quality open chromatin information. The cells that did not pass ATAC quality controls (QCs) or constituted an ambiguous cluster in ATAC cell embedding were removed, resulting in 10,233 and 10,330 cells retained for downstream analysis for HCM and severe COVID-19, respectively.
scATAC-seq datasets
The cell type labels for the human pancreas and cortex in the original datasets33,35 were used. To generate cell embeddings, scATAC-seq data were processed and clustered using snapATAC2 (ref. 142) and ALLCools136,139. The fragment files were processed to generate cell-by-bin matrices at 5-kb resolution using snapATAC2 (ref. 142). The cells with 2,000–50,000 total reads and transcription start site (TSS) enrichment > 5 or 7 according to the distribution in specific samples were retained. The cell embeddings were computed with latent semantic indexing (LSI) and batch effects were corrected using the canonical correlation analysis (CCA) LSI mode in ALLCools. Cell-by-peak matrices at 500-bp resolution were generated by calling peaks per cell cluster using snapATAC2. For cortex data, superior and middle temporal gyri and middle frontal gyrus samples were used for AD analysis, resulting in 11,738 cells. For pancreas data, we randomly sampled 10,000 of 64,948 cells covering all annotated cell types for computational acceleration. The single-cell data121 we used in the replication experiments were processed and QCed similarly.
Cell–cell similarity network
Following a previous study46, we used the mutual kNN (M-kNN) to measure the similarity between two different cells. We first used LSI to extract low-dimensional embeddings for individual cells. For cortex and left-ventricle datasets encompassing multiple samples, batch effects were corrected using both CCA and Harmony143 and integrated latent embeddings were adopted. Next, we computed the Euclidean distance for pairs of cells using their embeddings and then constructed the kNN graph \(\hat{G}\in {{\mathfrak{R}}}^{M\times M}\) on the basis of this distance matrix, in which we defined \({\hat{G}}_{i,\;j}=1(i,j=1,\ldots ,M)\) if cell j is within the top k closest cells of cell i and \({\hat{G}}_{i,\;j}=0\) otherwise. The M-kNN graph G was then defined as the graph whose edges connect nodes (that is, cells) that are mutually kNNs of each other, which was calculated by \(G=\hat{G}\circ {\hat{G}}^{T}\), where \(\circ\) denotes the element-wise multiplication.
Target cohorts for T2D and AD
T2D and AD target cohorts were constructed on the basis of the UKBB. All the disease cases were defined according to the ICD-10 (tenth revision of the International Statistical Classification of Diseases and Related Health Problems) code. In particular, all Caucasian individuals with a disease ICD-10 code in the inpatient record, death record or diagnosis summary record were defined as the disease participants. We used E11.9 and G30.9 for AD and T2D, respectively. This resulted in 1,096 T2D and 932 AD cases. We randomly sampled an equal number of healthy controls by matching sex, age and ancestry information for each case group. In addition, individuals with a similar or related phenotype with the disease (T2D: E10, E11, E12, E13, E14, E23.2, N08.3, N25.1, O24, P70.2, Z13.1, Z83.3 and R73.9; AD: F00, G30, F01, F02, F03 and F05) were excluded from constructing the control group. In this study, overweight individuals (body mass index (BMI) ≥ 25) were excluded from constructing the T2D cohort. BMI for each individual was defined as the mean of four BMI measurements in the UKBB Data Field 21001.
Target cohort for HCM
The recruitment of the HCM cohort was part of our California Institute for Regenerative Medicine (CIRM) cardiomyopathy project27. The targeted population constituted persons with various cardiac procedures and noncardiac participants with genetic conditions in clinic who were identified to us by their clinical providers. Noncardiac participants were recruited in person during onsite clinic days or over the phone with permission by the providers. Healthy volunteers were recruited from our cardiovascular prevention clinic (that is, persons with no diagnosis of heart disease).
Library preparation and sequencing was performed by Macrogene (first ten samples) and Novogene on genomic DNA we extracted from iPS cells (Qiagen DNeasy kit). Paired-end 150-bp reads were acquired on the Illumina HiSeq X Ten for a minimum of 90 Gb of data. Reads were processed using Sentieon’s FASTQ-to-VCF pipeline (Sentieon version 201808.07)144. This pipeline is a drop-in replacement for a Burrows–Wheeler aligner (BWA)145 plus GATK best-practices146 pipeline for germline single-nucleotide variations (SNVs) and indels but has been highly tuned for optimal computational efficiency. BWA alignment to hg38 was followed by deduplication, realignment, base quality score recalibration and variant calling to generate g.vcf files for each sample. Coverage was assessed (GATK version 3.7)27. Individual sample g.vcf files were joined and variant quality score recalibration was performed.
Target cohort for severe COVID-19
The VA COVID-19 cohort was derived from the VA MVP. The VA MVP is an ongoing national voluntary research program that aims to better understand how genetic, lifestyle and environmental factors influence veteran health28. Briefly, individuals aged 18 to over 100 years old have been recruited from over 60 VA medical centers nationwide since 2011 with current enrollment at >800,000. Informed consent is obtained from all participants to provide blood for genomic analysis and access to their full electronic health record data within the VA before and after enrollment. The study received ethical and study protocol approval from the VA central institutional review board (IRB) in accordance with the principles outlined in the Declaration of Helsinki. COVID-19 cases were identified using an algorithm developed by the VA COVID national surveillance tool based on reverse transcription (RT)–qPCR laboratory test results conducted at VA clinics, supplemented with natural language processing on clinical documents for SARS-CoV-2 tests conducted outside of the VA147. This resulted in the VA COVID-19 WGS cohort of 2,716 persons with COVID-19 spanning a wide range of ages and ancestries. We defined severe COVID-19 cases as persons who were hospitalized, received acute care, stayed in the intensive cure unit or were deceased and controls as those who did not meet these criteria. To minimize potential confounders, we restricted our analysis to nonelderly individuals (age < 65).
DNA isolated from peripheral blood samples was used for WGS. Libraries were prepared using KAPA hyper prep kits, PCR-free according to manufacturers’ recommendations. Sequencing was performed using an Illumina NovaSeq 6000 System (Illumina) with paired-end 2× 150-bp read lengths and Illumina’s proprietary reversible terminator-based method. The specimens were sequenced to a minimum depth of 25× per specimen and an average coverage of 30× per plate.
Independent target cohorts
The GoT2D cohort including 2,874 individuals was used as the independent target cohort for T2D. Samples were sequenced using three technologies: deep whole-exome sequencing, low-pass (4×) WGS and OMNI 2.5M genotyping. Genotypes (SNVs, indels and structural variants) were called separately for each technology and then integrated by genotype refinement into a single phased reference panel. More details can be found in a previous study39.
The HCM independent target cohort was constructed by extracting non-EUR HCM samples (ICD-10: I42.1/I42.2) and a same number of randomly selected non-EUR controls matching age and sex from the UKBB genotype dataset. This resulted in a total of 152 samples.
The WGS data of the independent target cohort for AD were obtained from the ADNI database. A total of 808 whole genomes were downloaded from ADNI, for which we defined individuals with a diagnosis of ‘dementia’ as cases and ‘cognitively normal’ as controls.
WGS data processing
The WGS data for HCM and COVID-19 were processed using the functional equivalence GATK variant-calling pipeline148, which was developed by the Broad Institute and plugged into our data and task management system Trellis. The human reference genome build was GRCh38. We used BWA-MEM (version 0.7.15) to align reads, Picard 2.15.0 to mark PCR duplicates and GATK 4.1.0.0 for base quality score recalibration and variant calling using the ‘haplotypeCaller’ function. We also used FASTQC (version 0.11.4), SAMtools ‘flagstat’ (version 0.1.19) and RTG Tools ‘vcfstats’ (version 3.7.1) to assess the qualities of the FASTQ, BAM and gVCF files, respectively. In addition, we used ‘verifybamID’ in GATK 4.1.0.0 to estimate DNA contamination rates for individual genomes and removed samples with 5% or more contaminated reads.
QCs of genotype data
We performed stringent QCs for the genotype data following the PRS tutorial (https://choishingwan.github.io/PRS-Tutorial/). For the GWAS summary statistics data (also referred to as the discovery or base data), genetic variants with low MAF and imputation information score (INFO) were removed. We used thresholds suggested in corresponding original papers: MAF < 0.0001, 0.001 and 0.0001 and INFO < 0.4, 0.6 and 0.6 for T2D, HCM and AD, respectively. We also excluded duplicated and ambiguous variants to guarantee the accuracy of PRS calculation.
For the individual-level genotype data (also referred to as the target data), we carried out both variant-level and individual-level QCs. For WGS data, we performed pre-QCs: we removed samples with kinship > 0.03, sample call rate < 0.97 or mean sample coverage ≤ 18×; genomic positions resided in low-complexity regions or ENCODE-blacklisted regions were removed; we filtered out genotypes in individual samples that were detected with too low or too high read coverages (read depth < 5 or read depth > 1,500); we required all calls to have genotype quality ≥ 20 and, for nonreference calls, a sufficient portion (>0.9) of reads was required to cover the alternate alleles.
For all target cohorts, we removed variants with INFO < 0.8 (for UKBB-based cohort), missing call rate > 0.01, MAF < 0.01 or Hardy–Weinberg equilibrium < 1 × 10−6. For variants with mismatching alleles between discovery and target data, we strand-flipped these alleles to their complementary ones. We further excluded individuals with genotyping rate < 0.01 or with extreme heterozygosity rate (that is, beyond 3 s.d. from the mean). Individuals with an up-to-second-degree relative (π > 0.125) within the cohort were also removed to prevent bias in prediction evaluation. Lastly, there were 2,176 (n = 1,088 cases, n = 1,088 controls), 134 (n = 81 cases, n = 53 controls), 1,839 (n = 919 cases, n = 920 controls) and 581 (n = 120 cases, n = 461) individuals passing the above QCs for T2D, HCM, AD and severe COVID-19 cohorts, respectively.
All independent target cohorts were processed and QCed using the same pipeline. After sample-level QCs, the final cohorts consisted of 2,749 samples (1,398 cases and 1,351 controls) for GoT2D, 62 samples (23 cases and 39 controls) for non-EUR UKBB and 469 samples (251 cases and 218 controls) for ADNI.
PC analysis for genotype data
To characterize the population structure of target cohorts, PC analysis was performed after pruning (window size = 200 variants, sliding step size = 50 variants, LD r2 threshold = 0.25). The first ten PCs were retained as covariates in the downstream analysis.
PLINK C+T PRS calculation
The cell-level C+T PRS was computed using PLINK, which is given by
$${\rm{PRS}}_{j}=\frac{{\sum }_{i\in {\rm{cCRE}}_{j}}{\beta }_{i}\times {G}_{i}}{P\times M},$$
where \({\rm{cCRE}}_{j}\) denotes cCREs within cell j, \({\beta }_{i}\) is the effect size of variant i, \({G}_{i}\) represents the number of effect alleles, P is the ploidy of the sample (2 for human) and M is the number of nonmissing variants. In the clumping phase, all index variants were forced to be drawn from the variants located within scATAC-seq peaks of individual cells using the ‘–clump-index-first’ option. Variants within 250 kb of the index variant and three LD thresholds (r2 = 0.1, 0.3 and 0.5) were considered for clumping. After constructing the index variant set, we applied multiple P-value thresholds (P = 1 × 10−5, 1 × 10−4, 1 × 10−3, 0.01, 0.05, 0.1 and 0.5) to compute PRSs, resulting in 21 PRSs calculated for each cell and each individual. We used the 1,000 Genomes Project samples to estimate the LD (out-sample estimation) for the simulation, HCM and severe COVID-19 cohorts because of their limited sample sizes, while using the target data (in-sample estimation) for other cohorts.
The standard C+T PRS was calculated using the same set of parameters as that used in computing cell-level PRS, except that all variants were considered without conditioning. The P-value and LD r2 thresholds were regarded as hyperparameters to be optimized in model selection.
Model details of scPRS
The cell-level PRS matrix \({X}_{n}\in {{\mathfrak{R}}}^{M\times 21}(n\in 1,\ldots ,N)\) presents single-cell-resolved genetic risk features for each individual and it is input into the scPRS model to predict the disease risk. Here, N and M denote the numbers of individuals and cells, respectively.
scPRS consists of three modules (Fig. 1): the feature-embedding module, the graph convolutional network module and the readout module. The feature-embedding module takes normalized cell-level PRS \({X}_{n}\) as the input and uses a one-layer perceptron to reweight and integrate 21 PRS features per cell:
$${h}_{n}^{(0)}={X}_{n}\bullet {\rm{abs}}({W}_{0}),$$
where W0 denotes learnable model parameters, abs represents the absolute function and \({h}_{n}^{(0)}\in {{\mathfrak{R}}}^{M}\) represents the integrated features of M cells for individual n. According to the definition of PRS, larger values in Xn indicate higher disease risk. To maintain this interpretability throughout the modeling, we adopt the absolute function abs to enforce nonnegativity for W0.
We next seek to integrate PRS features across different cells to generate a final risk score. With the consideration of the dropout event and sparsity of scATAC-seq data and assuming that cells with similar low-dimensional embeddings should have comparable epigenomes and then similar genetic signals, we use a GNN149 to smooth and denoise single-cell-level PRS features. More specifically, on the basis of the pre-computed M-kNN graph G, the GNN module is defined as
$${g}_{v}^{(t+1)}=\frac{1}{{\mathrm{deg}} (v)}\sum _{u{{\in }}{\mathscr{N}}{(}v)}\left({\rm{abs}}\left({w}_{1}^{




