Research Project Summary

Characterization

Project TitlePrimary InvestigatorsFocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
Systematic in vivo characterization of disease-associated regulatory variantsHyejung Won, Michael Love, Karen MohlkeThe goal is to systematically characterize the impact of human genetic variation on gene regulation via massively parallel reporter assays (MPRA).MPRA of 500,000 alleles (250,000 variants), each linked to ~500 barcodes and assayed in 20 contexts of multiple tissues, from both sexes, with and without perturbation by lipopolysaccharide (LPS).GWAS variant-encoding constructs systematically delivered to mouse (C57BL/6J), males and females, across 5 organs (e.g. cortex, liver, lung, heart, muscle), acute low dose LPS treatment (10 mg/kg 3 hrs before tissue collection), 10 biological replicates per condition.250k non-coding GWAS variants divided into comprehensive sets at selected signals (hypothesis-free variant selection) and targeted sets (hypothesis-driven variant selection based on annotations). GWAS for brain disorders, respiratory diseases, cardiovascular diseases, muscular disease, and metabolic disorders will be used.N/A
High-Throughput Functional Annotation of Gene Regulatory Elements and Variants Critical to Complex Cellular PhenotypesCharles Gersbach, Greg Crawford, Tim ReddyThe goal of the Duke IGVF Characterization Center is to systematically and comprehensively characterize the function of gene regulatory elements and their variants in four key and diverse biological contexts and to understand how genomes and genomic variation function and orchestrate complex phenotypes. Aim 1 focuses on cell fitness regulatory elements. Aims 2 and 3 focus on cell-type specific regulatory elements that drive lineage specification and eQTLs. Aim 4 focuses on non-coding elements essential to tissue homeostasis in vivo. Collectively, we expect this project comprises one of the largest characterization efforts of functional annotation of regulatory elements and their variants in their endogenous chromosomal context.Assays: high-throughput CERES (CRISPR Epigenetic Regulatory Element Screens) to annotate regulatory function across large noncoding loci, single cell CERES to find target genes and to map causal SNPs in eQTLs, POP-STARR (STARR- seq high-throughput reporter assay generated from a populations of 130 individuals of diverse ancestry) to determine the effects of genetic variants on regulatory element activity, single cell RNA-seq to identify the target gene(s) for a particular regulatory element, qRT-PCR/RNA-seq for validation of candidate regulatory elements and identification of off-target effects, ATAC-seq and ChIP-seq (H3K4me3, H3K4me1, and H3K27ac) on iPSCs differentiated into hepatocytes, myocytes, and neurons to identify regulatory elements, targeted genome editing to validate strongest effects of genetic variants, immunofluorescence staining for lineage-specific markers, and fluorescence-activated cell sorting based on cell surface markers to enrich for a homogenous differentiated cell population (hepatocytes, myocytes, or excitatory neurons). Phenotypes assessed: cell viability, cell proliferation, cell lineage specification, tissue homeostasis in an in vivo animal model. Data types: Illumina sequencing, gRNA sequences, genomic coordinates, flow cytometry.Human primary undifferentiated WTC11 iPSCs as well as iPSC-derived myocyte, neuronal, and hepatocyte lineages. dCas9-p300 and dCas9-KRAB mice. For genetic variants in human, 130 individuals of diverse ancestry. For genetic variants in mice, 47 strains of the Collaborative Cross mouse population.Aim 1: 10,000 regulatory elements (RE) (non-promoter) shared between undifferentiated and differentiated iPSCs. Aim 2: 3x 10,000 RE most differentially accessible between iPSCs and either iPSCs-derived hepatocytes, myocytes, or neurons. Aim 3: 90 eQTLs that are specific to each differentiated iPSC cell type (hepatocytes, myocytes, or neurons). Aim 4: 3,000 orthologous RE in mouse with the largest effect size for essentiality (identified in Aim 1) and cell lineage specification (identified in Aim 2). Over the course of the project, Aim 1 will characterize 4 cell types x 100,000 gRNAs (10,000 regulatory elements) x 2 directions (activation and repression) x 3 replicates = 2.4 million perturbations. Aim 2 will characterize 2.4 million different perturbations. Aim 3 will characterize 10,000 gRNA (for same cell types, replicates, and directions) = 240,000 perturbations. Aim 4 will characterize 30,000 gRNA (for 3 tissues, 10 replicates, 2 directions) = 1.8 million perturbations in an in vivo mouse model. Each gRNA represents a unique epigenetic perturbation to a candidate regulatory element that controls the expression of any of the thousands of genes that we will measure as output.N/A
Characterization Center - Massively parallel characterization of variants and elements impacting transcriptional regulation in dynamic cellular systemsJay Shendure, Nadav Ahituv, Martin KircherWe propose to test over one million human regulatory elements or variants for their functional effects on transcriptional regulation, as well as to query over 100,000 distal regulatory elements for the gene(s) that they regulate.Our experimental approaches include MPRA, crisprQTL, saturation genome editing, and multiplex prime editing, iGONAD. We will measure the activities of regulatory elements as well as variants thereof, primarily in in vitro models of development.Human neuronal differentiation, human cardiomyocyte differentiation, human embryoid bodies, human gastruloids, human teratomas, human cerebral organoids, mouse embryos.Gene regulatory elements - 550K by MPRA, 110K by crisprQTL; Gene regulatory variants - 500K by MPRA; 11K by SGE or MPE; 100 by iGONAD.Predictive modeling including exploring, building and validating cell type-specific DNN architectures.
Comprehensive characterization of variants underlying heart and blood diseases with CRISPR base editingLuca Pinello, Daniel Bauer, Guillaume Lettre, Richard SherwoodWe will focus on genome editing-based functional characterization of loci associated with cardiovascular diseases and its risk factors (blood pressure and dyslipidemia) or with hematological traitsOur planned experimental approaches include three high-throughput CRISPR screen modes--CRISPRi, CRISPRa, and base editing. CRISPRi and CRISPRa epigenetically will repress and activate the gene regulatory function of the ~1-kb element surrounding the variant. Base editing will precisely install the target variant and at most several other adjacent bases (which we can predict a priori). In addition, we will perform Multiplexed Integrated Accessibility Assay (MIAA) and targeted Perturb-seq (TAP-seq) to identify variant effects on chromatin accessibility and gene expression.CRISPR screens will be applied in human erythroid and neutrophil precursors, endothelial cells, and hepatic cells. TAP-seq will be applied in erythroid cells, neutrophils, endothelial cells, and hepatocytes. MIAA will be applied in human hepatic, endothelial, and hematopoietic cell lines.12,000 GWAS-associated variants/elements each for 8 cellular phenotypes.Statistical inference of causal variants by integrating triple modalities of CRISPR screening data, functional screening data MIAA and TAP-seq. Input will be the CRISPR screening, and MIAA, TAP-seq data, output will be tabular data to show the significance of the tested variants.
Multiscale functional characterization of genomic variation in human developmental disordersGary Hon, Nikhil Munshi, Lee KrausThe goal is to functionally characterize how regulatory elements contribute to gene expression phenotypes and morphological phenotypes in early human developmental systems. Extra emphasis will be placed on: pleiotropic regulatory elements, those with potential roles in non-cell autonomous activities, and mechanisms of enhancer RNAs.Our planned experimental approaches include CRISPRi/a; expression phenotypes with single-cell enhancer screens, morphological phenotypes with high content imaging, and secreted molecules with mass spectrometry.H9 human embryonic stem cells differentiated to trophoblasts, cardiomyocytes, and neurons.40,000 elements in cardiac, neuronal, and placental systems perturbed with a single-cell RNA-Seq readout; 8000 elements perturbed with morphological phenotypes; 60 elements perturbed with variant knock-in.N/A
Functional Characterization Center - Connecting DNA Variants to Function and PhenotypeJesse Engreitz, Thomas QuertermousThe goal is to conduct unbiased screens that reveal fundamental properties of cis-regulatory variant and element (CRV/CRE) functions and maximize our ability to train models to predict the cis-regulatory effects of any CRV or CRE in any cell type or disease, and characterize properties and functions of high-confidence common disease risk variants for cardiovascular diseases to inform future studies to connect GWAS variants to functions.Perturbations: CRISPRi KRAB-dCas9 for enhancers and promoters and CRISPR prime editing and base editing for variants. Readouts: HyPR-seq, RNA FlowFISH, and targeted single-cell RNA-seq (TAP-seq), selected disease-relevant cellular phenotypes such as phagocytosis).Four human cardiovascular cell types - smooth muscle cells, endothelial cells, cardiomyocytes, and macrophages (iPSC derived or primary). Mouse aorta for in vivo screens.We will conduct studies to perturb thousands of CREs and CRVs, irrespective of prior disease associations, and measure their effects on gene expression across multiple hPSC-derived cell types and states. We will conduct deeper studies of a set of 200 fine-mapped risk variants for one or more cardiovascular diseases or traits, including variants associated with quantitative traits (cardiac morphology, blood pressure) and diseases (coronary artery disease, atrial fibrillation).We plan to use a number of models to predict the quantitative effects of variants and/or elements on the expression of a nearby gene in a specific cell type, using as input single-cell RNA+ATAC-seq data. Our current approaches include the Activity-by-Contact (ABC) model, which predicts the effects of distal cCRES on nearby genes, and BPNet, which predicts the effects of single-nucleotide variants on chromatin state.
The Center for Actionable Variant Analysis; measuring variant function at scaleLea Starita, Douglas FowlerThe Center for Actionable Variant Analysis (CAVA) will harness multiplexed assays to contribute single nucleotide variant functional data for ~200,000 variants in ~32 of the most clinically impactful protein coding genes to the IGVF Variant/Element/Phenotype Catalog.CAVA will employ two key assays: saturation genome editing (SGE) and variant abundance by massively parallel sequencing (VAMP-seq). In both cases, a variant library is generated and expressed in cultured human cell lines. SGE measures the effect of variants in their endogenous genomic context on cell growth. VAMP-seq measures the effect of transgenically expressed variants on protein abundance.Both assays will be conducted in cultured human cell lines. Initially, SGE will be conducted in HAP1 cells, and VAMP-seq in HEK-293T cells. As these approaches mature, other cell lines may be used.Our center will test ~200,000 single nucleotide variants in ~32 protein-coding genes.Our center will develop statistical models for calculating variant effect scores and error estimates from SGE and VAMP-seq data, specifically FASTQ files.
Molecular phenotyping of ~100,000 coding variants across Mendelian disease genesMarc Vidal, Mike Calderwood, Anne Carpenter, Fritz Roth, Mikko Taipale, David HillWe propose to functionally characterize ~80,000 variants across most of the known Mendelian disease-associated genes by comparing wild-type or “reference” gene products and their corresponding variants for a rich array of fundamental protein properties and phenotypic impacts, including protein stability (expression), subcellular localization, cell viability, cell morphology, and the ability to mediate macromolecular interactions with protein partners.Our planned experimental pproaches are to generate clonal resources for ~80,000 coding variants, measure the impact of coding variation on cellular phenotypes; including protein stability, subcellular localization, cell viability and morphology, and measure the impact of coding variation on protein-protein interactions~80,000 variants and their wild-type counterparts will be screened for protein localization and cellular phenotypes in HeLa/U2OS cells using high-content microscopy and computational analysis.~80,000 missense variants (Pathogenic, VUS and Benign) in protein-coding genes across most of the known Mendelian disease-associated genes.N/A

Alternate Display for Mapping


A Foundational Resource of Functional Elements, TF footprints and Gene Regulatory Interactions

Jason Buenrostro, Bradley Bernstein

FocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
We will apply an innovative suite of single-cell multi-omic assays (‘SHARE-seq’, Ma et al, Cell 2020) to profile RNA transcripts, DNA accessibility, TF footprints and histone modifications in diverse BioSamples. We will integrate the data to parse cell states, annotate and classify functional elements, and infer TF element-gene connections.SHARE-seq - a highly scalable multi-omic assay for profiling RNA and DNA accessibility in the same single cells, and TF footprints and histone modifications in single-cells (SHARE-Footprint, SHARE-ChIP).We will accession ~3000 disease-relevant BioSamples that span a wide range of cell phenotypes and genotypes, including: (1) Blood and immune cohorts (bone marrow, and 600 donor biobank of PBMCs with and without lupus diagnosis). (2) CNS cohorts (fresh brain tissues from surgical resections, iPSC-derived 'villages' of NPCs, neurons and microglia from the CIRM iPSC collection). (3) A tissue biobank comprising 7 donors x 25 tissues selected for relevance to common diseases, consented for open access data sharing, and collected with the common coordinate framework for direct comparison to HCA, dGTEX and HuBMAP. (4) Disease tissues from surgical resections including CAD and IBD. (5) Organoid model timecourses (gut and brain).N/A - We are a mapping center.N/A - We are a mapping center.

Center for Mouse Genomic Variation at Single Cell Resolution

Ali Mortazavi, Barbara Wold

FocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
To map cell-type specific eQTLs, splicingQTLs, and caQTLs in mice as well as response QTLs to LPS stimulation.Single-nucleus RNA-seq using LR-Split-seq and scATAC-seq.Project focuses on CC (Collaborative Cross) Lines as well as founders. We may supplement them with DO (Diversity Outbred) mice.The 8 founders and CC/DO mice have an aggregate of 20 million SNPs.While we are a mapping center, we plan to provide tools to calculate our xQTLs from our single-cell data in kallisto format.

Mapping

Project TitlePrimary InvestigatorsFocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
A Foundational Resource of Functional Elements, TF footprints and Gene Regulatory InteractionsJason Buenrostro, Bradley BernsteinWe will apply an innovative suite of single-cell multi-omic assays (‘SHARE-seq’, Ma et al, Cell 2020) to profile RNA transcripts, DNA accessibility, TF footprints and histone modifications in diverse BioSamples. We will integrate the data to parse cell states, annotate and classify functional elements, and infer TF element-gene connections.SHARE-seq - a highly scalable multi-omic assay for profiling RNA and DNA accessibility in the same single cells, and TF footprints and histone modifications in single-cells (SHARE-Footprint, SHARE-ChIP).We will accession ~3000 disease-relevant BioSamples that span a wide range of cell phenotypes and genotypes, including: (1) Blood and immune cohorts (bone marrow, and 600 donor biobank of PBMCs with and without lupus diagnosis). (2) CNS cohorts (fresh brain tissues from surgical resections, iPSC-derived 'villages' of NPCs, neurons and microglia from the CIRM iPSC collection). (3) A tissue biobank comprising 7 donors x 25 tissues selected for relevance to common diseases, consented for open access data sharing, and collected with the common coordinate framework for direct comparison to HCA, dGTEX and HuBMAP. (4) Disease tissues from surgical resections including CAD and IBD. (5) Organoid model timecourses (gut and brain).N/A - We are a mapping center.N/A - We are a mapping center.
Center for Mouse Genomic Variation at Single Cell ResolutionAli Mortazavi, Barbara WoldTo map cell-type specific eQTLs, splicingQTLs, and caQTLs in mice as well as response QTLs to LPS stimulation.Single-nucleus RNA-seq using LR-Split-seq and scATAC-seq.Project focuses on CC (Collaborative Cross) Lines as well as founders. We may supplement them with DO (Diversity Outbred) mice.The 8 founders and CC/DO mice have an aggregate of 20 million SNPs.While we are a mapping center, we plan to provide tools to calculate our xQTLs from our single-cell data in kallisto format.
Single-cell Mapping Center for Human Regulatory Elements and Gene ActivityAnsuman Satpathy, Ryan CorcesWe propose to develop an IGVF Mapping Center that utilizes our recent development of high- throughput genome-wide technologies to simultaneously map open chromatin sites, gene and protein expression, and clonal lineage tracing in single cells from human tissues during development, adult health, and several disease conditions, including cancer, neurodegeneration, infection, and autoimmunity.We will primarily utilize single-cell ATAC-seq to characterize regulatory regions, which may be achieved via a single modality measurement, or in concert with mtDNA variants for clonal lineage tracing (mtscATAC), surface protein (ASAP-seq), and scRNA-seq (DOGMA-seq). These approaches will be applied to native human tissue samples, and large scale single-cell sequencing data will be generated.All data generated will be human species. Biosamples will be native human tissues (~25 or more tissues) and we anticipate several hundred biosamples before the completion of the project. Diseases will include cancer, neurodegeneration, and autoimmunity. We do not have explicit stimulations planned.We will be charting DNA regulatory elements. We do not have explicit plans to characterize nucleotide variants. Allelic bias in accessible chromatin may be derived from our data source, but we have no explicit plans to do so.As this is a mapping center, we will focus on generating regulatory elements. We will provide inputs of plain text counts of features per cell as well as all detected transposition events (fragments / bed file) that can be readily transferred to other investigators for modeling.

Networks

Project TitlePrimary InvestigatorsFocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
Leveraging genetic variation to dissect gene regulatory networks of reprogramming to pluripotencyChongyuan Luo, Noah Zaitlen, Kathrin PlathOur project will generate single-cell multi-omic datasets using snmCT-seq and scRNA&ATAC-seq for the reprogramming of 100 human fibroblast cell lines to induced pluripotent stem cells (iPSCs). Gene regulatory networks will be constructed using the single-cell multi-omics datasets using Dynamic Regulatory Events Miner (DREM). We will further identify the impact of genomic variation on the networks.snmCT-seq (Joint profiling of transcriptome and DNA methylome), sn-m3C-seq (Joint profiling of chromatin conformation and DNA methylome), 10X Multi-Ome (Joint scRNA-seq & scATAC-seq profiling), Pooled ChIP-seq for four TFs and H3K27ac, MERFISH, and CRISPRi-FlowFISH to determine the regulatory activity of non-coding sequence,.Reprogramming of human fibroblasts to induced pluripotent stem cells (iPSCs).We will apply a pooled CRISPRi-FlowFISH approach (Fulco 2019 Nature Genetics) for an unbiased quantification of the regulatory activity within +/- 2Mb region surrounding up to 100 TF genes. Up to 50 high-confidence QTL variants with larger effect size determined using CRISPRi-FlowFISH will be engineered into human fibroblasts lines using CRISPR/Cas9.We will define the gene-regulatory and cis-regulatory element networks in a reprogramming stage-specific manner and make predictions of TFs regulating these networks using Dynamic Regulatory Events Miner (DREM). FastGxC will be applied to identify context (reprograming stage) specific QTLs with a greater sensitivity and computational efficiency than the standard tissue-by-tissue approach.
Linking genome variation to transcriptional network dynamics in human B cellsHarinder Singh, Nidhi Sahni, Jishnu DasThe two major goals are to delineate and use the cis-regulomes of primary human B cells in their resting, activated and differentiated states to assemble a large scale, signaling-responsive gene regulatory network (GRN) that controls the temporal dynamics of activation and differentiation of B cells into antibody secreting plasma cells or germinal center B cell precursors, and to systematically analyze the consequences of genomic variation on the topology and dynamics of the B cell GRN in human health and disease.Structural genomics: Bulk RNA-seq, ATAC-seq, and Hi-CAR analysis of 4 purified B cell states (resting, activated, plasmablast, and germinal center precursor) and 10X single cell multiome analysis (RNA-seq + ATAC-seq) at multiple time points during B cell activation and differentiation (Day 0, 1, 2, 4, and 6). Functional genomics: LentiMPRA assay of cis-regulatory elements within the B cell GRN including analyses of disease disease associated and eQTL variants, Targeted Perturb-seq (TAP-seq) assay of CREs in native genomic context (100-200 state-specific genes), and Perturb-seq assay of prioritized TFs predicted to be dominant nodes within B cell GRN.Human B cell line GM12878 and primary naive human B cells isolated from healthy donor PBMCs and their activated as well as differentiated counterparts, the latter generated by defined in vitro stimulation conditions.Approximately 5,000 variants that lie in immune disease-associated linkage disequilibrium blocks and map to reference cis-regulome.R-based pipeline for discovery of composite transcriptional regulatory elements in genomic regulatory sequences (CEseek), GRN inference using SCENIC, ARACNe, NetBID, predictive machine learning models (linear models with regularization - LASSO/Elastic Net, bootstrap aggregated decision trees - random forest) to predict the impact of non-coding variants on the underlying dynamic GRN, and Interpretable ML approaches (e.g. Essential Regression) developed by the Das/Singh labs.
The impact of genomic variation on environment-induced changes in pancreatic beta-cell statesMaike Sander, Kyle Gaulton, Bing Ren, Hannah CarterWe will produce generalizable models of genomic and phenotypic response to environmental signals of broad value to the community. Specificaly, we will initialize and refine a GRN to predict CREs with environment-induced effects on insulin secretion in pancreatic beta cells.Genomic assays: 10X multiome scATAC/RNA-seq, sc-methyl-HiC, Perturb-seq. Phenotypes assessed: Glucose-stimulated insulin secretion (GSIS), insulin content.hPSC islet organoids differentiated from hESC H1-(WA01) cell line. The organoids will be exposed to glucose and other insulin secretory stimuli across multiple time points. The samples will be evaluated by GSIS assays and snap frozen. Each genomic assay will be performed on n=2 biological replicates.650 CREs for combinatorial Perturb-seq. 4 CREs for allele-specific editing.Gene regulatory network (GRN) models initialized from snATAC/RNA-seq and sc-methyl-HiC inputs. Refinement/reduction to a task-specific GRN by regression based modeling of CRE-gene interactions combined with machine learning to identify the subset of CREs most improtant for predicting molecular phenotype (insulin secretion) across conditions and time points. Application of the task specific GRN to predict variant effects on phenotype.
Deciphering the Genomics of Gene Network Regulation of T Cell and Fibroblast States in Autoimmune InflammationChristina Leslie, Alexander RudenskyIn this IGVF project, we will use rheumatoid arthritis (RA), a human autoimmune inflammatory disease, as a case study to develop robust machine learning models of gene regulation to decipher the impact of genomic variation on multiple cellular drivers of pathology—namely, inflammatory T cell and fibroblast subsets found in affected joint tissue. The choice of RA is motivated by its public health importance, specified target tissue, access to clinical samples, considerable knowledge of disease-associated gene loci, and our team’s complementary expertise in machine learning, RA pathophysiology, immunology and inflammation, and single-cell functional genomics.Single-cell multiomics, Hi-C/HiChIP, Perturb-seq, and spatial transcriptomics.T cells and fibroblasts derived from F1 hybrid mouse models and patient synovial tissue.We will perform large-scale computational analyses on RA-associated variants and for now plan validation experiments for a small number.Deep sequence models (input: DNA sequence, output: allele-specific accessibility) and graph attention based gene regulatory models using single-cell multiome data, potentially together with bulk histone mark data (there are different model variants, e.g. input: DNA sequence, accessibility and histone marks, 3D interaction data, output: gene expression); multiome data can be analyses at the pseudobulk cluster, metacell, or single cell levels; methods for human-mouse transfer; Perturb-seq analysis approaches
Genomic control of gene regulatory networks governing early human lineage decisionsDanwei Huangfu, Michael Beer, Kat HadjantonakisThis project will construct models of gene regulatory networks controlling early human development by taking an integrative approach involving perturbation of core regulatory network elements, quantitative genomic and proteomic measurements with high temporal and single-cell resolution, and systems level analyses. Knowledge gained from the study will provide a conceptual framework for dissecting gene regulatory networks during cell state transitions, and reveal general features of genomic variants that contribute to cellular and organismal phenotypes.We will conduct perturbation using the following methods: genetic knockout and CRISPRi repression. We will conduct the following phenotypic assays: genomic assays including (sc)RNA-seq, (sc)ATAC-seq, H3K27ac ChIP-seq and HiChIP, TF ChIP-sesq and HiChIP, Hi-C and 4C-seq; proteomics assays using TF ChIP-MS, lineage trajectory analysis using lineage tracing followed by scRNA-seq, and image analysis of cultured organoids.We will investigate trilineage (mesoderm, endoderm and ectoderm) differentiation using cultured human embryonic stem cells (hESCs).We will focus on enhancer elements and when relevant, prioritize enhancers enriched with disease relevant variants. The number of enhancers to be tested is estimated to be 2,000-5,000, possibly higher when we integrate newer and more cost-effective technologies.We will train gkm-SVM on regulatory regions identified from ATAC, predict TF binding sites and enhancer and promoter regions, and use the binding site predictions to position TFs within regulatory network models. We will build stochastic models of network transitions and make predictions for the impact of enhancer perturbations on the cell state transitions.
"Defining causal roles of genomic variants on gene regulatory networks with spatiotemporally-resolved single-cell multiomics derived from single-cell epigenomic measurements (chromatin accessibility and DNA methylation)."Hao Wu, Sreeram Kannan, Hongjun SongThis proposal aims to leverage a panel of multi-ethnic, gender-balanced human induced pluripotent stem cell (hiPSC) lines (European, African American and African hunter gatherers) as well as recent advances in single cell time-resolved or multi-omics technologies, predictive modeling of regulatory networks by machine learning and high throughput single-cell perturbation methods to study the functional impact of genomic variations on regulatory network and cellular phenotypes.Metabolic labeling single-cell RNA-sequencing (scNT-seq), single-cell multiome sequencing (measuring both gene expression and chromatin accessbility from the same cell), single-cell DNA methylome sequencing (snmC-seq2), spatially resolved single molecule FISH (smFISH for RNA) and multiplex immunostaining (for protein), Perturb-seq (cardiac or neural lienage gene expression programs).A group of ancestrally diverse and gender-balanced human induced pluripotent stem cell lines (n=60 lines) will be differentiated into either cardiac (mono-layer culture) or neural (3D brain organoids) lineage cells.We will apply high-throughput combinatorial genetic or epigenetic perturbation approaches to modulate activity of key genes (n=20 lineage or state specific transcription factors) or putative cisregulatory elements (n=20 cis-regulatory elements) at single-cell levels to improve our understanding of network level relationships among genomic variants and phenotypes.To develop causal gene regulatory network (GRN) models for identifying key genes (e.g. transcription factors) and predicting their impacts on GRN activity, we will extend Scribe, dynamo and deep learning based computational approach to incorporate multi-scale timeresolved single-cell transcriptomics and multi-modal measurements (RNA, chromatin states, and DNA methylation). In parallel, we will use scRNA-seq to identify dynamic cis-eQTLs associated with cell differentiation and rank them by cross-referencing with putative cell-type/state-specific regulatory elements

Predictive Modeling

Project TitlePrimary InvestigatorsFocusPlanned ApproachBiological SystemsVariants InvestigatedModeling Approaches
Predictive Modeling of the Functional and Phenotypic Impacts of Genetic VariantsZhiping Weng, Xihong Lin, Manuel GarberTaking part in the IGVF Consortium, we aim to develop computational methods to predict the impact of genetic variants on genome function and phenotypes by integrating IGVF experimental data and large-scale whole-genome sequencing and genome-wide association data.This is a modeling project and hence does not have an experimental component.IGVF and other publicly availabe data on human biosamples, whole-genome sequencing data, phenotype data, and genome-wide trait-association data.N/AFor generating maps of candidate regulatory elements, we will use statistical methods that are extension of those that the Weng lab developped for the ENCODE consortium. For enhancer-target gene connections, we will develop new statistical methods that are based on linear regression and other machine- and statistical techniques. For rare-variant association testing, we will extend our STAAR method by incorporating cell type specific and single-cell data.
Predictive Models of the Impact of Genomic Variation on FunctionRaychaudhuri, Price, SunyaevDeveloping strategies to identify cell-states relevant to disease, cell-state specific regulatory function of varaints, connect non-coding variants to target genes, and to define pathways impliciated by rare variants.Computational and Statistical methods. Human Genetic data. Single cell data, functional genomic data.Human focused. We anticipate a major focus on complex and mendelian diseases, and also single cell data.All common and all rare variants. Tools are designed to be widely applicable to both disease associated variants, and variants associate with molecular tools.Various models.
Predicting the Impact of Genomic Variation on Cellular StatesAlan BoyleWe propose quantitative shifts in cellular state as a new paradigm for defining and predicting variant function. Single-cell transcriptomic and epigenomic data from healthy individuals provide a reference atlas of cell states. By comparing cell state distributions against this reference, we can identify quantitative shifts resulting from genetic variation and explore these deviations as potential disease states. We will then build models to predict shifts in cell state by combining single-cell data with background germline genetic variation, chromatin structure, and supporting functional data.N/APrimary focus is human but the tools used here can also be directly applied to the mouse.Anything produced.Inputs: Aim1 - as many modalities of single cell data as possible across as broad of a set of cell types as possible. Aim2 -Need sequence-level variant calls of cells used gTEX and ENCODE data (all already existing data), MESH, PubMED. Aim 3- Perturbations of enhancers with phenotype as well as exisitng public data like GWAS, linking enhancers to target genes, binding motif data, GO data. PerturbGAN - As many single cells as possible with scRNA-seq readout and a genotype (each cell tagged with perturbation).
Predicting context-specific molecular and phenotypic effects of genetic variation through the lens of the cis-regulatory codeAnshul Kundaje, Alexis Battle, Jonathan Pritchard, Stephen Montgomery, Livnat Jerby, Jesse EngreitzDevelop interpretable, base-resolution deep learning models to decode cis-regulatory control of chromatin and expression dynamics across cell states, space and time. Predict cell-type specific effects of regulatory variants and de-novo mutations on molecular and disease phenotypes. Collaborate with the IGVF consortium to define a next-generation variant catalog.N/AWe had proposed working on adult and fetal cardiovascular tissue, adult and fetal brain tissue samples and in vitro differentiated cell types from these tissues. But our methods will be generalizable to a large number of biological systems. We are keen to collaborate with all consortium members on their respective systems.One of our aims is to enable model-driven experimental design of large-scale variant and element screens. We have proposed approaches to maximize diversity of designs given budget and/or material constraints. Our methods also enable conditional designs to optimize desired attributes of the sequences to be tested. We can also predict effects of regulatory variants across the allele frequency spectrum (common/rare/denovo) and SNPs/indels.Deep learning models to decipher cis-regulatory sequence syntax, long-range regulatory interactions and effects of variants on TF binding, chromatin accessibility, histone marks, splicing, gene expression. Models can be trained on bulk or single cell and spatial omics data. Methods for static and dynamic bulk and single cell QTL analysis + fine tuning deep learning models on allelic effect sizes. Statistical approaches for fine mapping causal common & rare disease variants that incorporate variant impact scores from predictive models. Polygenic scores for diseases and traits using variant effect scores from predictive models. Model-based design of large-scale perturbation screens (CRISPR/MPRA/STARR-seq).
Design, prediction, and prioritization of systematic perturbations of the human genomeAndrew Allen, William Majoros, David PageThe goals of the Duke Prediction Center are to establish best practices in the design and analysis of perturbation assays, to develop novel and biologically interpretable machine learning models to predict the effect of genomic variation on function by integrating information from multiple regulatory scales, and to map out the functionally constrained human genome and to use this to prioritize genomic variation for pathogenicity.Our project does not have an experimental component, however, we plan to utilize data from a large number of perturbation experiments in our simulation studies and predictive models. The below list highlights many of the types of assays we envision we would be working with, including a large number of assays that we have direct previous experience with (in red). Perturbations: CRISPRko: gene knockout via Cas9; CRISPRdel: long deletions via Cas 3; CRISPRi: inhibition via dCas9+KRAB; CRISPRa: activation via dCas9+VP64/P300; MPRA / STARR-seq: reporter assay applied to captured or synthesized genetic variations. Perturbation readouts: Read counts for each guide in bulk or single cells; Read counts for each guide in FACS sorting bins; Genomic DNA reads from target site; MPRA / STARR-seq: sequence of tested element. Downstream effect readouts: RNA-seq: expression of endogenous genes; CHiP-seq: transcription factor binding; MPRA / STARR-seq: expression of reporter gene; Chromosome conformation sequencing; FACS sorting.N/AWe will be developing models that make predictions of the effect of genomic perturbations at various points in the regulatory cascade. For example, we will model the effect of genetic variation on regulatory activity within a DHS as well as the effect of perturbing the DHS (by opening or closing chromatin) on gene expression, etc. This series of models will then be composed, so that predictions can be made over large scales, e.g., from genetic variant to gene expression or other phenotype. The types and number of variants/elements we will consider is limited only by the types of perturbation experiments available to model.We will use structured induction of Bayesian networks and other probabilistic graphical models, where initial structures capture known biology, and experiments provide data for learning links and parameters, such as from variants to accessibilities, and from accessibilities to expression levels. We will attempt to construct as complete a network as possible, with nodes corresponding to vast numbers of variants, accessibilities across the genome, expression levels for all genes, and a large number of functional annotations. To accomplish this, for both computational efficiency and integrated learning across the full network, we will employ deep learning methods based on our innovative view of all deep neural networks themselves as probabilistic graphical models (Bayes nets or Markov nets).
Linking Variants to Multi-scale Phenotypes via a Synthesis of Subnetwork Inference and Deep LearningMark Craven, Audrey Gasch, Qiongshi Lu, Robert SteinerDevelop a trainable approach for predicting the phenotypic impact of genetic variants that is a synthesis of subnetwork-inference methods and deep neural networks, active-learning methods for defining and prioritizing genetic-perturbation and other genomic experiments aimed at uncovering the impact of genomic variation on phenotypes, and a statistical framework for identifying and prioritizing genetic modifiers with a particular emphasis on common, regulatory variants that modulate the phenotypic effects and penetrance of rare, coding variantsN/AN/AN/AOur first proposed modeling approach is a subnetwork inference coupled with machine learning. For training, the inputs are a background network consisting of entities such as genes/proteins and intracellular interactions, and data sets consisting of specific perturbations and measured responses (molecular, cellular, and even larger scale phenotypes). The outputs are trained functions that modify inference in the network. For testing, the inputs are the background network and a genetic variant of interest. The outputs are predicted phenotypic impacts of the variant. Our second modeling approach is active learning methods to identify series of most informative experiments. The inputs are a current predictive model of interest and possible next experiments. The outputs are (batches) of experiments to run next in order to best achieve modeling goal. Our final modeling approach is the modifier gene prioritization method. The inputs are summary association statistics from GWAS and WES studies, and a pre-defined gene network. The outputs are a score for each gene indicating its predicted role as a modifier for a phenotype of interest.
Supporting IGVF by modeling genetics, function, and phenotype with machine learningPredrag RadivojacWe will first develop advanced machine learning approaches to predict variants that impact specific types molecular function and then integrate them to learn the probability of altered phenotypes. These models will be developed in a disease-agnostic manner but will retain the possibility to include context-specific information, model organism data, researcher input as well as specific diseases as determined by the Consortium. We will additionally develop active learning approaches to support data collection within IGVF by identifying informative assays, variants, and genomic regions for experimentation with an emphasis on resource prioritization and support the development and evaluation of computational tools. The expected outcomes of the proposed work include algorithms and open-source software that can generalize to different types of variants, experimental data and model organisms. Finally, we are proposing to fill in the knowledge gap between computational predictions and high-throughput experiments with experimental validation of prioritized variants in a limited low-throughput format (~8 variants/Year in Years 2-5 in the HEK293 or HeLa cell lines, or alternatively ~1-2 variants/Year in Years 2-5 in the iPSC-derived models due to higher cost of experiments in these models) to define the mechanisms that are either not covered by IGVF functional centers, or those with the focus on a specific gene, a group of genes/proteins, or a group of diseases. Ultimately, such validation is necessary to demonstrate the overall success and utility of the IGVF Consortium.We will be validating variants using phenotypic assays that include, but are not limited to, protein stability, degradation, ubiquitination, intracellular localization and neurotransmitter internalization, when appropriate. These experiments will be able to show how missense and other mutations impact protein function at different levels of severity. We plan to implement such strategies for a limited number of prioritized genes or mutations (up to a dozen variants/year in Years 2-5 in the HEK293 or HeLa cell lines, or alternatively under five variants/Year in Years 2-5 in the iPSC-derived models), and we will design the assays that are not covered by IGVF functional characterization centers but are assessed to be of importance. Dr. Iakoucheva's lab has a copy of human ORFeome collection (~17,000 cloned genes) along with numerous alternatively spliced isoforms for a large number of genes through previous collaboration with Marc Vidal. Targeted mutations will be introduced into the ORFeome clones by site-directed mutagenesis or overlapping PCR, with subsequent subcloning of the mutant constructs into corresponding vectors using Gateway cloning system.To perform experimental validation of prioritized variants, we will utilize functional predictions from the models and will devise experimental strategies tailored to each specific variant or gene; e.g., if we target variants within the genes related to synaptic functions or plasticity, we will introduce prioritized loss-of-function (LOF) mutations into iPSC cell lines using genome editing, and then differentiate them into neurons to investigate neuronal morphology and synapse number. We have a collection of iPSC cell lines that were derived from healthy individuals and are routinely used in the laboratory as controls. Likewise, in case of transcription factors (TFs), we will investigate the impact of mutations within TF on the expression of known target genes by qPCR or using luciferase assays. Finally, in case of variants predicted to impact mRNA stability or half-life, we will use Actinomycin D treatment at different time points to block translation, and qPCR to quantify mRNA expression. From these experiments we will infer the rate of mRNA degradation impacted by synonymous or LOF variants. The experiments will be valuable for gaining mechanistic insights into disease. If we find unexpected technical difficulties with iPSC experiments, we will utilize immortalized cell lines (SH-SY5Y, N2A, or more conventional HEK293, HeLa lines) to generate stable cell lines overexpressing genes with prioritized variants, and we will perform functional experiments in these cell lines, which are less demanding than iPSCs.We will be focusing on exonic variation and all types of variants in an exome, including missense variants as well as relatively small insertions and deletions. Our experimental validation is low-throughput yet high-fidelity and we envision being able to handle up to a dozen variants/year in Years 2-5 in the HEK293 or HeLa cell lines, or alternatively up to five variants/Year in Years 2-5 in the iPSC-derived models due to higher experimental costs.We are predominantly a predictive modeling group, with a relatively small low-throughput experimental component. We will be developing methods that will take variants as inputs and provide the type of functional disruption as well as the probability of pathogenicity as an output. Part of our project performs active learning, dedicated here to help in selecting genes, variants, and assays that would maximize the impact of experimental techniques for different objectives. Our computational approaches will integrate modern principles of machine learning and statistical relational learning, including deep learning and transfer learning, but will also allow for human guidance when running our software.