NATURE

A foundation model of transcription across human cell types


ATAC-seq data processing

Pseudobulk

To determine the chromatin accessibility score for each region, we used the scATAC-seq count table and cell annotation table from each respective study. To aggregate single cells into ‘pseudobulk cell types’, we used the Louvain clustering results from each study. Log count per million (logCPM) was used for each pseudobulk. The annotations of each cluster were used to determine the biological cell type. Empirically, we set a threshold of more than 600 cell counts to ensure adequate sequencing depth for selected cell clusters. A comprehensive table of pseudobulk cell types used during the training process can be found in Supplementary Table 1.

In summary, we used ATAC-seq and expression data from refs. 1,2,22. In total, the dataset encompassed 1.3 million single nuclei. The data were only presented in pseudobulked format. All cell types were primary cell types from normal tissue. No disease states were included in the pretraining dataset. We incorporated further datasets in downstream tasks such as K562 and zero-shot analysis in tumour cells.

Cell-type-specific accessible region identification

For identification of cell-type-specific accessible regions, the peak calling results from the original studies of each dataset were used to obtain a union set of peaks. Subsequently, to compile a list of accessible regions specific to each cell type, we filtered out peaks with no counts.

In the context of the human fetal and adult chromatin accessibility atlas, we used the peak set produced by Zhang et al.2, incorporating the fetal chromatin accessibility atlas originally published by Domcke et al.1. We also trained a version of the fetal-only GET model using the original peak calling and cell type annotation from Domcke et al.1, resulting in comparable expression prediction and regulatory analysis performance. For the 10× multiome data, we used the provided peak fragment count matrix. For the K562 NEAT-seq and bulk chromatin accessibility data, a more permissive version of peaks was called using MACS2 (ref. 60), and different logCPM cutoffs were applied to the resulting peak set to select accessible regions. This accessibility-based data augmentation enhances the diversity of input data and fine-tunes the GET model for data from a single cell type.

Accessibility features

In this study, the chromatin accessibility score for a specific genomic region was defined by the count of fragments located within that region for a given cell type pseudobulk. To enhance the generalizability of the model, these counts were further normalized through the logCPM procedure. Specifically, let t be the total fragment count in a pseudobulk, and let ci be the fragment count in region i. Then, the accessibility score si can be computed as:

$${s}_{i}={\log }_{10}\left(\frac{{c}_{i}}{t}+1\right),t=\sum _{i}{c}_{i}.$$

For most of the regulatory analysis, the binary ATAC version of the GET model was used to comprehensively evaluate the regulatory influences exerted by TFs. In both the training and inference phases of this specific model version, the accessibility scores were uniformly set to 1 if the region was identified as a chromatin accessibility peak. This equates to assuming binary chromatin accessibility states in the studied scenario.

Motif features

To calculate the motif binding score within a specific genomic region, the corresponding sequence was scanned against the hg38 reference genome, using 2,179 TF motif position weight matrices previously compiled by Vierstra et al.13 (accessible at https://www.vierstra.org/resources/motif_clustering). For the scanning process, the MOODS tool was used with default threshold61.

More specifically, to represent sequence information while mitigating feature redundancy, a specialized motif scoring process was implemented. Building on Vierstra’s research, we categorized these 2,179 motifs into 282 motif clusters, a classification determined by position weight matrix similarity. Using this established clustering definition, we eliminated redundant nucleotide-level motif matches, retaining only the match with the highest score within overlapping matches belonging to the same motif cluster. Subsequently, the scores of all non-overlapping motif matches within each motif cluster were summed, yielding one cumulative score for each of the 282 clusters. As a final step, motif binding scores for all regions within a given cell type were determined and subjected to min–max normalization across regions. This normalization facilitates model generalization and the training process, ensuring that each motif cluster’s score is processed in a standardized manner.

The annotation of the 213 cell types used in pretraining followed the original cell type classification provided in the fetal and adult accessibility atlases1,2,22. This classification was achieved through clustering of ATAC-seq count profiles, with subsequent labelling on the basis of the expression of known marker genes. The comprehensive list of tissues and cell types, along with their annotations, can be found at the Human Cell Atlas (http://catlas.org/humanenhancer/#!/cellType). The original atlas included 222 fetal and adult cell types but was further filtered to remove cell types with low sequencing coverage (number of cells less than 600). This approach ensured that our model was trained across diverse cellular contexts while ensuring enough coverage in the chromatin accessibility pseudobulk tracks. The data were not further balanced or curated.

Input data

GET is designed to capture the interaction between different regions and regulators. To facilitate this, we needed each input sample to contain a certain number of consecutive accessible regions, mimicking the ‘receptive field’ of an RNA PolII. Through previous experiments, we found that the ideal equivalent genome coverage is around or more than 2 Mbp, a range in which most of the chromatin contact happens. As a result, on the basis of our current data prepossessing pipeline, we chose to use 200 input regions per training sample. We used non-overlapping sampling during the pretraining and a sliding window approach during fine-tuning. The stride of the sliding window was set to half the number of regions in one sample (that is, 100 peaks per step for samples with 200 peaks).

This number of regions per sample was selected to achieve a balance between computational efficiency and the need to encompass a representative sample of the regulatory landscape for each cell type. It is important to note that the actual genomic span covered by these 200 peaks could vary depending on several factors, including cell-type-specific variations in chromatin accessibility, the threshold applied during peak calling and the chromosomal distribution of the peaks. In the context of the uniformly processed datasets covering fetal and adult cell types, we have observed that 200 peaks typically correspond to a genomic range of approximately 2–4 Mbp. This estimation was derived from the understanding that the human genome, with its roughly 3 billion base pairs, yields about 150,000 accessible peaks when analysed comprehensively. Therefore, a subset of 200 peaks would, on average, represent a genomic span of 2–4 Mbp, given the distribution and density of peaks across different cell types. In the training of our main model, the borders were entirely dependent on the boundary of sampled peaks; no other priors were used. Sampling was performed independently in each chromosome starting from the beginning of the chromosomes.

RNA-seq data processing

Cell type matching

For experiments encompassing multiomics, the correspondence between accessibility and expression was inherently determined through cell barcodes. In pseudobulk cases, where accessibility and expression were assessed independently, cell type annotations were used to facilitate the mapping. Specifically, the fetal expression atlas from Cao et al.23 was used for fetal cell types, whereas adult data were extracted from Tabula Sapiens24. When several ATAC pseudobulk shared the same cell type annotation, identical expression labels were assigned. This compromise was necessitated by the current dearth of multiome sequencing data, a situation expected to change dramatically in the near future.

Expression values

To improve training stability, we log-transformed the expression values as log10(TPM + 1). To overcome the problem of most scRNA-seq quantification being at gene level, not transcript level, we mapped the gene expression to accessible regions using the following approach: if a region overlapped with a gene’s TSS, the gene’s expression value was assigned to that region as a label; if a region overlapped with multiple genes’ TSS, the expression values of the corresponding genes were summed, and the sum was used as the label of that region; if a region did not overlap with any TSS, the corresponding expression label was set to 0. Further, if a promoter had very low accessibility (for instance, an accessibility CPM (aCPM) less than 0.05), we also set the corresponding expression value to 0. Finally, each regulatory element was assigned to an expression target value.

Expression values were allocated to each region within our input. Owing to the limitations of poly(A) scRNA-seq data, only aggregated mRNA levels could be captured, resulting in values that were not reflective of the nascent transcription rate more closely tied to regulatory events. Nonetheless, these values provided valuable cell-type-specific information. The process begins by intersecting the input region list with GENCODE v.40 transcript annotation to pinpoint promoters, followed by assignment of logCPM values to regions corresponding to these promoters. All remaining regions are assigned a value of 0. Although this does not perfectly represent all transcription events happening in a cell, we believe the zero label on the non-promoter region helps to deliver informative negative labels to the model.

Input target

In alignment with the 200 × 283 input matrix, the target input is a 200 × 2 matrix, symbolizing the transcription levels of the corresponding 200 regions across both positive and negative strands.

Model architecture

The GET architecture consists of three parts: (1) a regulatory element embedding layer (RegionEmb); (2) regulatory-element-wise attention layers (Encoder); and (3) a linear output layer as the expression prediction head, or, alternatively, other output heads. GET takes 200 regulatory elements, each with 282 motif binding scores and optionally one accessibility score as an input sample. As a result, the input is a 200 × 283 matrix. When we choose to not use the quantitative accessibility score, we set the values in the 283-th column to 1.

We feed the sample into the RegionEmb layer to generate the regulatory element embedding with a dimension of 768 for each peak. As we do not want to lose information in the input of the original regulatory element, we apply a linear layer to capture the general information in the different classes of TF binding sites. To learn the cis– and trans-interactions between regulatory elements and TFs, we apply 12 Encoder layers with a multihead attention (MHA) mechanism on the regulatory element embeddings.

Suppose Nh, dv and dk denote the number of heads, the depth of values and the depth of keys, respectively. The output from each head h is computed as

$${O}_{h}=\text{softmax}\left(\frac{{X}^{{\prime} }{W}_{q}{({X}^{{\prime} }{W}_{k})}^{T}}{\sqrt{{d}_{{\rm{k}}}}}\right)({X}^{{\prime} }{W}_{v}),$$

where \({W}_{q},{W}_{k}\in {{\mathbb{R}}}^{(n\times D)\times {d}_{{\rm{k}}}},{W}_{v}\in {{\mathbb{R}}}^{(n\times D)\times {d}_{{\rm{v}}}}\) are learnable linear transformations.

Then, we concatenate the output from each head h for the regulatory-element-wise attention block. The layer normalization (LN), feed-forward network (FFN) and residual connections are finally used to generate the output for each layer. Thus, the mechanism behind the regulatory-element-wise attention block can be summarized as:

$${{\rm{z}}}_{l}^{{\prime} }=\text{MHA}(\text{LN}({{\rm{z}}}_{l-1}))+{{\rm{z}}}_{l-1}\,;{{\rm{z}}}_{l}=\text{FFN}(\text{LN}({{\rm{z}}}_{l}^{{\prime} }))+{{\rm{z}}}_{l}^{{\prime} },$$

where \({{\rm{z}}}_{l}^{{\prime} },{{\rm{z}}}_{l-1}\) denote the intermediate representation in the block \(l\), \({{\rm{z}}}_{l-1}\) denotes the output from the block \(l-1\), LN is the layer normalization and FFN is the feed-forward network. We apply two linear layers with a GELU activation layer in the feed-forward network layer.

The GET architecture is similar to the state-of-the-art model Enformer4. However, the following changes helped us to improve upon and exceed the performance of that model: GET uses the regulatory element embedding layer to capture the general information of regulatory elements in the different classes of TF binding site. Moreover, a masked regulatory element mechanism was used to learn the general cis– and trans-interactions between regulatory elements and TFs from different human cell types. Specifically, a random set of positions was uniformly selected to mask out with a mask ratio of 0.5.

Similar to the Vision-Transformer-based Masked Autoencoders62, we replaced the regions in the selected positions with a shared but learnable \([{\rm{MASK}}]\) token; the masked input regulatory element is denoted by \({X}^{\text{masked}}=(X,M,[{\rm{MASK}}])\), where \(X={\{{x}_{i}\}}_{i=1}^{n}\) is the input sample with \(n\) regulatory elements. The training goal is to predict the original values of the masked elements \(M\). Specifically, we take masked regulatory element embeddings \({X}^{\text{masked}}\) as input to GET, while a simple linear layer is appended as the prediction head. Therefore, the overall objective of self-supervised training can be formulated as:

$${\mathcal{L}}={\mathbb{E}}\left(\sum _{i\in M}-\log p({x}_{i}| {X}^{\text{masked}})\right),$$

where \({x}_{i}\) denotes the masked region to be predicted.

Training scheme

We conducted pretraining in the large-scale single-cell chromatin accessibility data. Then we fine-tuned the pretrained model on the paired chromatin accessibility–gene expression data with the same Poisson negative log-likelihood loss function as Enformer4.

The GET implementation is based on the PyTorch framework. For the first training stage, we applied AdamW as our optimizer with a weight decay of 0.05 and a batch size of 256. The model was trained for 800 epochs with 40 warmup epochs for linear learning rate scaling. We set the maximum learning rate to 1.5 × 10−4. The training usually takes around a week for a cluster with 16 V100 GPUs. For the second fine-tuning stage, we used AdamW63 as our optimizer with a weight decay of 0.05 and a batch size of 256. The model was trained for 100 epochs, which completes in around 8 h using eight A100 GPUs. Inference for all genes in a single cell type takes several minutes, making it possible to perform large-scale screening.

Training details

We include a more detailed description of the optimization hyperparameters, computation infrastructure and convergence criteria used in the development of the model in the section below.

Pretraining phase

  1. 1.

    Computation infrastructure: the pretraining of our model was conducted using 16 NVIDIA V100 GPUs or eight A100 GPUs, reflecting the computational demands of our training dataset and the complexity of the model architecture.

  2. 2.

    Epochs and duration: the model underwent 800 epochs of training, which spanned approximately 1 week. This extensive training period was essential for the model to learn the regulatory grammar from chromatin accessibility data across a wide array of human cell types.

  3. 3.

    Learning rate: a base learning rate of 1 × 10−3 together with cosine scheduler and linear annealing warmup in the first epoch.

  4. 4.

    Mask ratio: 0.5.

  5. 5.

    Optimizer: AdamW with weight decay of 0.05.

Fine-tuning phase

  1. 1.

    Computational infrastructure: similar to the pretraining phase, fine-tuning was performed on eight NVIDIA A100 GPUs, ensuring consistency in computational resources.

  2. 2.

    Epochs and duration: the fine-tuning process was shorter, consisting of 100 epochs, and completed in around one day. This phase was crucial for adapting the pretrained model to specific gene expression prediction tasks.

  3. 3.

    Learning rate: A base learning rate of 1 × 10−3 together with a cosine scheduler and linear annealing warmup in the first epoch.

  4. 4.

    Optimizer: AdamW with weight decay of 0.05.

  5. 5.

    Early stopping: to optimize performance and prevent overfitting, we used early stopping on the basis of validation loss, enabling us to select the best model checkpoints for subsequent evaluation.

Parameter-efficient fine-tuning

GET provides an option to perform parameter-efficient fine-tuning over any specific layer through low-rank adaptation (LoRA)64. This is commonly used to adapt to a new assay or platform; we apply LoRA to the region embedding and encoder layers, while doing full fine-tuning on the prediction head. This markedly reduces 99% of the parameters.

Model evaluation

Cross-cell-type prediction

We validated the cross-cell-type prediction performance beyond astrocytes to include a broader range of cell types. The benchmark was performed on fetal cell types with a variable length peak set defined in the original fetal accessibility atlas1. This comparison includes quantitative ATAC GET (n = 3), binarized ATAC GET, linear probing of the Enformer CAGE output tracks trained on the Basenji15 training set and inferred genes in the Basenji test set, and a training cell type mean expression baseline. We used Pearson correlation, Spearman correlation and R2 to evaluate prediction performance in all settings.

Benchmarking against supervised approaches

We have implemented comparisons with the following methods on the task of expression prediction when leaving out chromosome 11 and leaving out astrocytes, using the same input data as GET. The following parameters were used in our implementation.

  1. 1.

    MLP: three linear layers separated by ReLU (layer dimensions: 283 input, 512, 256, two output); SoftPlus was used for output activation.

  2. 2.

    CNN: three Conv1d layers (layer dimensions: 283 input, 128, 64, 32, 3 kernel size) followed by FC(32, 512) → ReLU → FC(512, 2); SoftPlus was used for output activation. We used the same optimizer and parameters as used in GET (base learning rate: 1 × 10−3, cosine scheduler, linear annealing warmup, AdamW optimizer with weight decay of 0.05).

  3. 3.

    CatBoost: we used CatBoostRegressor with loss function MultiRMSE for 1,000 iterations (learning rate: 1 × 10−3).

  4. 4.

    SVM: we used scikit-learn Support Vector Regression with epsilon 0.2, linear kernel and max iterations 1,000. MultiOutputRegressor was used to handle two-dimensional output.

  5. 5.

    Random forest: we used scikit-learn RandomForestRegressor with ten estimators and max depth 10. MultiOutputRegressor was used to handle two-dimensional output.

  6. 6.

    Linear regression: we opted to use linear regression instead of logistic regression because our setting aligns better with regression than classification. We used scikit-learn LinearRegression and MultiOutputRegressor with default parameters.

Leave-out-chromosome evaluation

We performed a leave-one-chromosome-out benchmark across all chromosomes and found that the performance remained consistent across chromosomes, conditioned on the same sequencing platform and data sources. We found an average Pearson correlation of 0.78 (minimum: 0.73, maximum: 0.84) on fetal astrocytes. We also extended our evaluation of leave-out chromosomes to tumour cells from patients with IDH1 wild-type GBM from the Human Tumor Atlas Network. We performed fine-tuning of the base GET model on tumour cells from a single patient (case ID: C3L-03405) and evaluated performance on each leave-out chromosome. This evaluation showed an average Pearson correlation of 0.75 (minimum: 0.68, maximum: 0.81) on leave-out chromosomes.

For K562 OmniATAC prediction, we performed leave-one-chromosome-out prediction for all 22 autosomes, finding an average Pearson correlation of 0.81 (minimum: 0.72, maximum: 0.84). For K562 CAGE prediction, we used GET to predict K562 CAGE (FANTOM5 sample ID: CNhs12336). We note that this comparison privileges Enformer, which was trained extensively on CAGE tracks, including K562 (track ID: 4828 and 5111), whereas GET needed to be transferred to the new assay. Here we evaluated fine-tuned GET against Enformer predictions summed across the two CAGE output tracks for a leave-out peak set across chromosome 14. We selected chromosome 14 because it did not appear in the public Enformer checkpoint’s training or validation set. Pretrained GET was fine-tuned in three ways.

  1. 1.

    BATAC LoRA from BATAC pretrain: in this setting, the base model was trained on the fetal and adult atlases with binarized ATAC signal. In the fine-tuning, the ATAC data were binarized.

  2. 2.

    QATAC LoRA from BATAC pretrain: in this setting, the base model was trained on the fetal and adult atlases with binarized ATAC signal. In the fine-tuning, we used the original aCPM for ATAC signal.

  3. 3.

    QATAC LoRA from QATAC fine-tuned: in this setting, the base model was the leave-out astrocyte RNA-seq prediction model trained on the fetal accessibility and expression atlas. In the fine-tuning, we used the original aCPM for ATAC signal.

These experiments leveraged LoRA parameter-efficient fine-tuning to achieve significant gains in time and storage complexity. On a single RTX 3090 GPU, all fine-tuning converged within 30 min, resulting in a 3 MB K562-CAGE-specific adaptor that could be merged into the base model.

Leave-out-motif evaluation

To explore the impact of omitting motifs in the input features, we used K562 scATAC-seq data from ENCODE (accession: ENCFF998SLH) and evaluated the ATAC prediction performance when holding out randomly selected motifs. We first called peaks with MACS2 with a threshold of q = 0.05. Then, we merged this peak set with the union peak set from the fetal pretraining data, keeping the peaks with at least ten counts in K562. For fine-tuning computational efficiency, we used LoRA parameter-efficient fine-tuning of the binary ATAC checkpoint pretrained on fetal and adult ATAC data with a 200-region receptive field (the pretrained checkpoint used for motif analysis in Fig. 4 and onward).

We explored holding out randomly selected 1, 2, 3, 4, 10 and 20 motifs. For each motif, we checked whether a peak’s binding score was larger than the top 20% scores in its score distribution across the genome. During the training stage, if a peak had any of the leave-out motifs passing this threshold, we set all input motif features of that peak as well as the observation aCPM to zero. In this approach, these ‘knockout’ peaks do not contribute to the loss. During the evaluation stage, we calculated Pearson and Spearman correlations of aCPM only on these knockout peaks with the original observed aCPM. For example, when there was only one leave-out motif CTCF, we were in effect be training with about the top 20% of peaks that had low CTCF binding scores on the training chromosomes, assuming evenly distributed binding sites across chromosomes. Similarly, we were evaluating with 20% of peaks with higher CTCF binding scores in the test chromosomes. In these experiments, we evaluated on held-out chromosome 14.

In general, GET showed robust performance when leaving out one to ten motifs. The performance was degraded heavily when using 20 motifs with a top 20% cutoff for each motif independently, owing to removal of most of the training data.

Platform transfer prediction

When transferring to a new sequencing platform, many domain shifts need to be addressed. These include but are not limited to the following.

  1. 1.

    Sequencing depth: lower depth will lead to fewer captured peaks; it will also affect the signal-to-noise ratio in the accessibility quantification.

  2. 2.

    Peak calling threshold and software.

  3. 3.

    Technical bias due to different library construction and sequencing methods.

  4. 4.

    Biological differences.

Owing to these biases, it is difficult to directly apply a model trained on one dataset to a new platform without fine-tuning. Thus, for a new dataset with multiple cell types available, we took a leave-out cell type approach to fine-tuning. For a dataset of sorted cell types where only one cell type was available, we used leave-out chromosome training.

Transfer to new datasets

The primary challenge in adapting our model to new data lies in ensuring compatibility between the input spaces of the training and new datasets. Variations in cell types, sequencing technologies and preprocessing pipelines can result in substantially different ATAC peak sets, potentially leading to incompatible input and embedding spaces. To address this, we developed a strategy to create a compatible peak set by combining new and training peak sets. When overlaps occur between training and new peaks, we assign priority to the training peak set coordinates. Unique peaks from the new data are incorporated as they are. We used a uniform peak calling pipeline to maintain consistent peak lengths (for instance, 400 bp in the fetal–adult atlas) across training and new datasets. The comprehensive coverage of our fetal-only/fetal–adult peak set (1.3 M peaks) typically results in new, unseen peaks contributing less than 10% of the total peaks. This approach has demonstrated promising transferability to various data types, including SHARE-seq data of perturbed human embryonic stem cells and 10× multiome GBM data.

For example, we tested a ‘one-shot’ fine-tuning procedure using a single patient sample from a new dataset of patients with GBM. We then assessed the performance of this fine-tuned model against the pretrained ‘zero-shot’ model on 16 held-out patient samples. To ensure robust evaluation, we excluded two patients from this analysis to serve as a separate test set for assessing fine-tuning stability. The results were promising: fine-tuning on a single tumour patient sample enabled GET to achieve a Pearson correlation exceeding 0.9 when predicting expression for held-out patients, whereas zero-shot performance reached a 0.67 Pearson correlation. This demonstrates the model’s strong generalization capabilities and its potential for rapid adaptation to new datasets with minimal further training. As the availability of ATAC-seq and multiome data continues to grow, more comprehensive reference peak sets, such as the ENCODE DHS index13 and cPeaks65, will further facilitate the adaptation of the GET model to an even broader range of cell types and experimental conditions.

Transfer to new assays

Here we show results for transfer of pretrained GET to different functional genomics assays. For K562 bulk ATAC prediction, we collected ENCODE OmniATAC-seq data for K562 (ENCSR483RKN). After calling peaks using MACS2 with default parameters, we computed the log(aCPM) by counting Tn5 insertions located inside the peak and filtered out the peaks with log(aCPM) less than 0.03. The remaining peaks and corresponding aCPM were used for motif scanning and prediction. We performed leave-one-chromosome-out fine-tuning using 200 peaks per input sample. The base checkpoint was trained on the fetal and adult atlas with the binarized ATAC setting and 200 peaks per input sample. LoRA was used for all layers. Each fine-tuning took around 160 s to complete eight epochs, after which the model started to overfit. Pearson correlation was collected at eight epochs for all fine-tuning. For CAGE prediction, we collected the K562 CAGE (CNhs12336) BAM file from FANTOM5 and used bedtools to extract alignment counts in peaks called from ENCODE K562 scATAC-seq data (ENCFF998SLH). Fine-tuning was performed using 200 peaks per input sample in three settings, depending on how ATAC information was used in conjunction with motif features, plus the base model used for fine-tuning.

  1. 1.

    BATAC from BATAC pretrain: in this setting, the base model was trained on the fetal–adult atlas with binarized ATAC signal; in the fine-tuning, we used binarized ATAC.

  2. 2.

    QATAC from BATAC pretrain: in this setting, the base model was trained on the fetal–adult atlas with binarized ATAC signal; in the fine-tuning, we used the original aCPM ATAC signal.

  3. 3.

    QATAC from QATAC fine-tuned: in this setting, the base model was the leave-out-astrocyte RNA-seq prediction model trained on the fetal accessibility and expression atlas. We further fine-tuned this model using quantitative ATAC signal.

Dataset considerations

Overall our results indicate that certain intrinsic cellular characteristics may contribute to the observed variations in model performance. We demonstrate that GET can be applied and extended to non-physiological cell types and states and captures cell-type-specific transcription information. Beyond the intrinsic biological differences between cell types, we believe the following factors could also affect performance when generalizing to new datasets.

  1. 1.

    Cell type rarity and library size: rare cell types often have smaller data libraries, which can limit the model’s learning potential and affect the accuracy of predictions.

  2. 2.

    Cell type purity and heterogeneity: the dynamic and heterogeneous nature of certain cell types, such as stem cells, and the precision of identification and classification of cell types can introduce variability in gene expression profiles, complicating the prediction task.

Model interpretation

In this study, we conducted thorough model interpretation analyses to ensure that GET learns useful regulatory information and offers valuable biological insights. Below, we outline the method used to interpret GET.

Model used for interpretation

We trained two GET models on data with or without quantitative accessibility:

  1. 1.

    Quantitative ATAC model: accessibility scores are set to the log10(CPM) of Tn5 insertions in the given accessible region.

  2. 2.

    Binary ATAC model: accessibility scores for all regions are set to \(1\); the model focuses solely on motifs in accessible regions.

In our analysis and regulatory interpretation, we primarily used the binary ATAC model. This approach offers improved attribution to sequence features, ensuring that the model does not overly depend on accessibility signal strength as a surrogate for sequence characteristics.

Feature attribution methods

We used multiple feature attribution methods in different analyses and provided all options to users in our packages. More specifically, the gradient of the model’s output with respect to the input features, represented by the vector \(\nabla f(x)\), measures how much the model output (expression) will change when we change a small amount of the input along a dimension (for instance, a certain motif in a cis-regulatory region). The generalization to multiple outputs in the context of neural network feature attribution extends to the Jacobian matrix \({J}_{i,j}=\frac{\partial {f}_{i}}{\partial {x}_{j}}\), where fi is the ith output, representing the transcription level on either the positive or negative strand, and xj is the jth input feature, comprising scanned and summarized binding scores for 282 TF motif clusters and an extra dimension for accessibility scores. This formulation enables computation of the Jacobian matrix, which is vital for understanding the influence of individual features on the transcription levels.

Enhancer–gene pair prediction

We restricted the benchmark dataset to either the fetal erythroblast peak set or K562 DNase peak set for a fair comparison. To obtain an enhancer importance score for each gene from GET, we used the 2s-norm of the region embedding layer Jacobian and weighted it with the aCPM of each region as the GET Jacobian score. This procedure could potentially be improved in the future, for example, by using a random genomic background as the baseline for the Jacobian calculation and other interpretation methods such as Integrated-Gradients66 or DeepLIFT67. However, we believe the current benchmark dataset size for this task is still limiting compared with the genome scale (104 measured pairs versus 106 to 107 required measurements of genome-wide enhancer–promoter interaction). Thus, we leave the systematic optimization of this task for future work. Other scores used in this study were as follows.

  1. 1.

    ABC: we computed ABC Powerlaw by multiplying the powerlaw function in the official ABC repo with γ = 1.024238616787792 and scale = 5.9594510043736655, values that were trained on K562 Hi-C data and provided in the same repo.

  2. 2.

    Enformer: we used Enformer’s contribution score (gradient × input) with background normalization, following the normalization procedure described by Gschwind et al.33.

  3. 3.

    HyenaDNA: we used the largest pretrained model available through Hugging Face (context length: 1 Mbp). To score enhancer–gene pairs, we performed in silico mutagenesis by knocking down the enhancer element (that is, setting each base pair in the enhancer region to the unknown nucleotide N in the vocabulary set) and comparing against the wild-type likelihood of observing the promoter sequence.

  4. 4.

    DeepSEA: nucleotide-level DeepSEA results were retrieved directly from the original publication and averaged over each peak.

All scores in this benchmark (ABC, Enformer, GET, HyenaDNA, DeepSEA and DNase/ATAC) were further normalized across each gene’s ±100 peaks to make them comparable across genes.

Recent studies have highlighted the dominant importance of one-dimensional genomic distance in governing CRISPRi enhancer knockout effects (for instance, Gschwind et al.33). In this benchmark, most methods include a component of genomic distance. For example, Enformer incorporates exponential decay in its positional encodings. HyenaDNA incorporates a sinusoidal positional encoding over the DNA sequence, and our benchmarking results follow an exponential decay from the TSS (Fig. 3c; NFIX). We have also extended GET to incorporate distance information. In particular, we designed a simple DistanceContactMap module for GET to convert the pairwise one-dimensional distance map between peaks to a pseudo-Hi-C contact map. DistanceContactMap is a simple three-layer two-dimensional convolutional neural network (kernel size: 3) with \({\log }_{10}(\text{pairwise distance}\,+\,1)\) as input and SCALE-normalized observed contact frequency as output. A Poisson negative log-likelihood loss was used to train the model. We trained DistanceContactMap with the same K562 Hi-C data (ENCFF621AIY) used for training ABC Powerlaw, resulting in a 0.855 Pearson correlation, which mostly captured the exponential decay in contact frequency. We termed the prediction of this model ‘GET Powerlaw’. The other two scores shown in Fig. 3d are defined as follows:

  1. 1.

    GET (Jacobian, DNase/ATAC, Powerlaw) = GET Jacobian + aCPM × GET Powerlaw;

  2. 2.

    GET (Jacobian, Powerlaw) = GET Jacobian × GET Powerlaw.

This model could be improved in future work by taking GET region embeddings as further input and learning to predict cell-type-specific three-dimensional contacts.

LentiMPRA zero-shot prediction

The experimental procedure involves designing a library of lentivirus vectors that contain both desired sequence elements and a mini promoter. The vector is randomly inserted into the genome through viral infection; the regulatory activity is then measured through sequencing and counting the log copy number of transcribed RNAs and integrated DNA copies.

To simulate this approach using GET, we first collected the sequence element library and constructed the vector sequence for insertion, including both the regulatory sequences and mini promoters. We then followed the same data preprocessing procedure to get the motif scores of the inserted elements. For each element, we performed in silico insertion by summing its motif score with an existing region on the genome. The ±100 regions centred around the insertion region were then used as an input sample for GET to make expression predictions. The mean predicted expression (log10(TPM)) was multiplied by the mean predicted accessibility as the predicted regulatory activity. For each region, we performed 600 insertions across the genome to match the experimental insertion count. We used the GET model fine-tuned on K562 NEAT-seq data to perform the inference. In total, in silico lentiMPRA of all 200,000 elements in K562 took around 5 days.

For Enformer, we performed the same analysis, with the only difference being that we integrated the vector sequence to a random position on the genome and collected a 196,608 bp sequence centred around the insertion site. Enformer is trained on 5,313 human epigenome tracks, with 486 experiments specifically for K562. To compute the regulatory activity, we selected the output from the K562 CAGE track, which is a quantitative and nucleotide-level map of the 5′ regions of transcripts. Following the practice of the original study, we used the average output of the three bins in the centre of the sequence as the predicted expression for a sample. Each element was also inserted into 600 random genome locations to compute the final averaged regulatory activity. We were only able to perform these experiments for 1,000 enhancers and 1,000 non-enhancer elements owing to the time complexity of Enformer inference. The comparison with GET was performed on the same set of elements.

We stratified the K562 lentiMPRA elements (approximately 200,000) by overlapping the annotated 15 ENCODE ChromHMM states computed from histone mark and other ChIP–seq data for K562. We selected the elements overlapping with states ‘12 EnhBiv’, ‘6 EnhG’ and ‘7 Enh’ as enhancers, and those overlapping with ‘13 ReprPC’, ‘14 ReprPCWk’ and ‘15 Quies’ as repressive and quiescent regions.

Identifying important regions and regulators

We first gathered inference samples across the genome by producing 200-region windows centred around each gene’s promoter. Given a specific gene g on strand \(s\in \{0,1\}\), the expression value can be inferred using the GET model f applied to an input matrix \({X}\in {{\mathbb{R}}}^{r\times m}\), where r denotes the number of regions, and m includes motifs and (optionally) accessibility features:

$$\begin{array}{c}E=f(X)\\ {{E}}_{g}=E[r//2,s],\end{array}$$

where \([\cdot ]\) is the index-selection operator and s is the strand of the gene.

The Jacobian matrix (tensor) JX \({{\mathbb{R}}}^{r\times 2\times r\times m}\) of \(f\) at the point (E, X) evaluates how each output dimension will change when each input dimension changes by a small quantity. We specifically pick the output dimension and strand that correspond to the given gene, represented by \({\rm{\nabla }}g\in {{\mathbb{R}}}^{r\times m}\):

$$\begin{array}{c}\nabla g={J}_{X}[r//2,s]\\ {J}_{X}=\frac{\partial E}{\partial X}.\end{array}$$

The feature (motif) importance vector \({{\rm{v}}}_{g}\in {{\mathbb{R}}}^{m}\) is obtained by multiplying the gradient element-wise with the original input and summarizing across regions:

$${{\rm{v}}}_{g}=\mathop{\sum }\limits_{i=1}^{r}(\nabla g\odot X)[i,:],$$

where \(\odot \) signifies the element-wise or Hadamard product. As the gene-by-motif matrix is mostly used for feature–feature interaction analysis, we use the \(X\) with quantitative ATAC signal even when we infer \({J}_{X}\) using a binary ATAC model. This facilitates study of the relationship between regulators and observed chromatin accessibility.

The cell-type-specific genome-wide gene-by-motif matrix for cell type c, Vc is acquired by concatenating the \({{\rm{v}}}_{g}\) across the genome. The same process can be applied to different cell types.

Similarly, the region importance vector \({l}_{g}\in {{\mathbb{R}}}^{r}\) is given by:

$${l}_{g}=\mathop{\sum }\limits_{j=1}^{m}(\nabla g\odot X)[:,j].$$

In practice, we use the l2-norm of the Jacobian of the region embedding with respect to output for calculating the region importance score as the embedding score distribution is less skewed than the input motif binding score, potentially making the Jacobian more comparable across regions. The per-region Jacobian score is normalized by the maximum score per gene to make scores comparable across genes with different expression levels.

Gene ontology enrichment of top target genes of a regulator

Using the gene-by-motif matrix \({V}_{c}\), we can choose a TF or motif (in our case, GATA) and ask which genes will be mostly affected by this TF by identifying the largest entries in the motif column. We chose the top 1,000 genes and performed gene ontology enrichment analysis using g:Profiler with the default g_SCS multiple hypothesis testing correction. We filtered the results using term size (gene number in a term definition) greater than 500 and less than 1,000. Terms with adjusted P value less than 0.05 were retained as significant terms. We further selected TFs in the ‘Hemopoiesis’ term with expression log10(TPM > 1) for visualization against the GATA motif score.

TF and target gene correlation

In this analysis, we sought to elucidate the relationships between TFs and expression of their target genes across different cell types. Gene-by-motif files were aggregated and organized into a unified structure comprising genes, motifs and corresponding cell features. We identified the target genes for each TF within predefined motif clusters and computed the mean expression levels of both the target genes and the corresponding TFs. To avoid potential artifacts in the expression measurement caused by experimental batch effects, we analysed both adult and fetal cell types, and fetal cell types only, and found similar results. The analysis was performed iteratively for all TFs within the motif clusters specific to fetal cell types.

Regulatory embedding

GET is configured using a cross-cell-type architecture to extract the regulatory context for genes spanning various cell types, embedding them within a shared high-dimensional space. We collected the embedding of each gene after each transformer block of GET. The embedding of a gene g is defined as the embedding vector of the promoter in the output of the ith block. The embedding contains both promoter information and information from surrounding regions owing to the attention mechanism. In general, the deeper the layer, the more its space is dominated by the expression output (Supplementary Fig. 4). tsne-cuda was used to visualize the embedding owing to data size. Louvain clustering was performed on the embedding space to colourize the visualization. Resolution was arbitrarily chosen to keep the cluster number around 10 and close to the UMAP density. For cell-type-based subsampling, UMAP68 was used instead for visualization for better visual separation between clusters.

We computed the embeddings in two different settings: the cell-type-specific setting, in which each dot is a gene embedding from a specific cell, and the cell-type-agnostic setting, in which each dot is a gene embedding randomly sampled from all cell types; 50,000 embeddings were sampled in the second case to make the UMAP computation feasible.

Causal discovery of regulator interaction

We performed pairwise Spearman correlation using the gene-by-motif matrix in both cell-type-specific and cell-type-agnostic settings. Input × gradient scores were used to construct the matrix for computational efficiency. For the cell-type-specific settings, all genes with promoter overlap with open chromatin peaks in each cell type were used in the correlation calculation. Causal discovery was performed on the gene-by-motif matrix using LiNGAM69. For the cell-type-agnostic settings, 50,000 genes were randomly sampled from all cell types, and the resulting matrix was subjected to the LiNGAM algorithm implemented in the Causal Discovery Toolbox Python package with default parameters.

To benchmark the predicted causal edges in the cell-type-agnostic setting, we downloaded the known physical interaction subnetwork from the STRING v.11 database39 and kept interactions with a combined score greater than 400 as the ground truth label. As the pairs predicted by GET were on the motif cluster level, we mapped the physical interactions between TFs on to the motif clusters on the basis of the motif cluster annotation. The resulting motif–motif physical interaction network was then compared with our prediction to calculate the precision. We also downloaded and compiled all significant interactions determined by mass spectroscopy40 and mapped them to motif–motif interactions for comparison. For comparison with ChIP–seq colocalization, we acquired colocalization results between ChIP–seq tracks for 677 TFs in HepG2 from TF Atlas. The method for calculating colocalization is documented in the ChIP-Atlas repo. Each ChIP–seq peak set was stratified into three tiers (high, mid and low). Then, for a pair of TFs P1 and P2, we checked colocalization between every tier pair and assigned scores with a preference for high–high colocalization (score = 9). If the strong binding peaks of P1 overlapped with strong binding peaks of P2, the P1–P2 interaction was considered to be more robust than in the case in which the P1 strong binding sites only overlapped with the P2 weak binding sites. We present the colocalizations stronger than mid–mid interactions (score ≥ 4) in Fig. 4b, as these represent the more reliable interactions. A stronger cutoff (score ≥ 9, keeping only high–high interactions) reduced the performance to a 0.097 macro F1 score at 2% recall.

For comparison with motif colocalization, we collected the GET input matrix (accessible-region-by-motif) for hepatocytes or concatenated the input matrix across all fetal and adult cell types. Pairwise Pearson correlation was computed across all collected regions, resulting in a score for every pair of motifs. For the cell-type-specific motif–motif interactions in the GET catalogue, we performed causal discovery using the gene-by-motif matrix for all cell types. Interactions with the top 5% absolute effect size were retained in the final database. For each interaction, we performed structural analysis between the two TFs with the highest expression in the corresponding cell types.

Structural analysis

AlphaFold benchmark on intrafamily binder prediction

We classified a TF as an intrafamily binder if any two members in its TF family had a known physical interaction annotated in the STRING v.11 database, on the basis of the hypothesis that if a TF can bind as a heterodimer, it should also have the potential to bind as a homodimer owing to sequence and structure similarity (although the dimerization affinity might be different). We thus used AlphaFold to predict the hypothetical homodimer structures of all known TFs and tried to predict whether a TF could be an intrafamily binder using AlphaFold-based metrics. We used several different AlphaFold-based metrics, including mean_plddt (average pLDDT score across all residues), pAE (predicted aligned error across all interchain interactions), pDockQ (predicted DockQ metric using interface pLDDT) and pDockQ × pAE. We found that pDockQ × pAE led to the best area under the receiver operating characteristic curve (0.69) and area under the precision–recall curve (0.41) when classifying intrafamily binder TFs.

Protein sequence segmentation

pLDDT from AlphaFold is a reliable protein domain caller owing to its accurate structure prediction performance. We segmented each TF protein sequence into low and high pLDDT regions. Empirically, we found that 80% (recall) of known DNA-binding domains could be easily identified using high pLDDT regions plus a high ratio of positively charged residues. More specifically, we first computed smoothed pLDDT using a ten-amino-acid moving-average kernel and then normalized the score by dividing by the maximum. After that, any region that had a smoothed pLDDT score less than 0.6 was defined as a low pLDDT region. If two low pLDDT regions were close (less than 30 amino acids), they were merged into one. Any region that was not a low pLDDT region was labelled as a high pLDDT region.

Multimer structure prediction

LocalColabFold and ColabFold were used to predict multimer structures with the AlphaFold Multimer v.2.3 model. For homodimer prediction, we used all five models with three recycles. For our large-scale interaction screening, we used model 3 with three recycles for each prediction. The pAE and pLDDT were stored for downstream analysis. pDockQ was calculated using code from FoldDock70.

If the multimer structure had a newly appearing peak, we treated it as evidence of potential interaction. pAE, pDockQ and ipTM were further checked to assess the confidence of the interaction. After AlphaFold3 was released, we reperformed structure prediction for full-length PAX5–NR2C2 sequences and identified the same PAX5 G183–NR domain interaction.

Molecular dynamics simulation

The initial configuration was prepared from the AlphaFold predicted PDB file. The Amber99SB-dispersion (a99SBdisp) force field was used for system parameterization. A cubic simulation box was defined with a box size of 1 nm. Subsequently, the system was solvated using the TIP4P water model through the solvate module. To neutralize the system and generate physiological ion concentrations, sodium (Na+) and chloride (Cl) ions were added using the genion module. The energy minimization terminated upon reaching a maximum force below 1,000 kJ mol−1 nm−1. Each minimization iteration used a step size of 0.01 and was configured to run for a maximum of 50,000 steps. The system was then equilibrated in two steps: first in the NVT (constant number, volume, temperature) ensemble and then in the NPT (constant number, pressure, temperature) ensemble for 100 ps of simulation time. A 100 ns production run was then performed, and trajectories and energy profiles were stored for subsequent analysis. All configs of these are available at the Proscope repo (https://github.com/fuxialexander/proscope). In our analysis, we found that the per-residue pLDDT scores for ZFX, IDR and TFAP2A in the multimer structure were correlated strongly with residue instability, as measured by root mean squared distance, consistent with the results of previous studies indicating that AlphaFold implicitly learns protein folding energy functions.

Structure visualization

ChimeraX was used to visualize the predicted structures.

Biological experiments

Cell lines

HeLa cells (CCL-2) and REH cells (CRL-8286) were purchased from ATCC. Cell lines purchased from a certificated cell line bank were not further authenticated. All cell lines tested negative for mycoplasma. No commonly misidentified cell lines were used in the study.

TFAP2A coimmunoprecipitation

HeLa cells were cultured in DMEM (Gibco, catalogue no. 11965) supplemented with 10% defined fetal bovine serum (HyClone, SH30070), at 37 °C and 5% CO2. HeLa cell protein lysates were generated with 0.5% NP-40 lysis buffer (50 mM Tris-HCl, 150 mM NaCl, 0.5% NP-40) with a phosphatase and protease inhibitor cocktail (Sigma-Aldrich, PPC1010). Samples were incubated with 5 µg agarose-conjugated TFAP2A primary antibody (Santa Cruz Biotechnology, sc-12726 AC) overnight at 4 °C before being run in Laemmli loading buffer (BioRad, 1610737). Proteins were separated on 10% Tris–glycine gels (ThermoFisher, XP00100), transferred to polyvinylidene fluoride membranes (Immobilon-P, IPVH00010) and probed with primary antibodies against TFAP2A (ABclonal, A2294, 1:750), ZFX (ThermoFisher, PA5-34376, 1:500) and β-actin (Santa Cruz Biotechnology, sc-47778, 1:10000), followed by chemiluminescence detection. A repeat experiment was performed for coimmunoprecipitation negative controls, which were probed with primary antibodies against SRF (Abclonal, A16718, 1:750) and β-actin (Cell Signaling Technology, 4967, 1:10000), followed by chemiluminescence detection.

Proximity labelling assay to detect PAX5–NR2C2 interactions

We initially cloned PAX5-WT and the PAX5 G183S mutant into the pCDNA3.1-MCS-13Xlinker-BioID2-HA (Addgene, catalogue no. 80899)71. After verification, we subcloned PAX5-WT-13Xlinker-BioID2-HA and PAX5-G183S-13Xlinker-BioID2-HA into the pCDH-GFP-puro vector (System Bioscience, CD513B-1). We transduced the REH B-ALL cell line (ATCC, CRL-8286) with pCDH-PAX5-WT-13Xlinker-BioID2-HA-GFP and with pCDH-PAX5-G183S-13Xlinker-BioID2-HA-GFP and selected transduced cells with puromycin (1 μg ml−1) to generate stable cell lines. The proximity labelling assay was performed following previously published methods71,72,73. Briefly, REH stable cell lines with control vector pCDH-13Xlinker-BioID2-HA-GFP, pCDH-PAX5-WT-13Xlinker-BioID2-HA-GFP and pCDH-PAX5-G183S-13Xlinker-BioID2-HA-GFP were incubated with 100 μM biotin (Sigma-Aldrich, B4501) for 24 h. We collected the cells, washed them twice in cold phosphate-buffered saline and incubated them for 50 min on ice with occasional vortexing in lysis buffer (150 mM NaCl, 10 mM KCl, 10 mM Tris-HCl pH 8.0, 1.5 mM MgCl, 0.5% IGEPAL) supplemented with protease and phosphatase inhibitors (Life Technologies, catalogue no. 78443) and 63 U of benzonase (Sigma-Aldrich, catalogue no. 70746-3). Proteins were clarified by centrifugation at 21,000g for 15 min at 4 °C. We performed total protein quantification using a Pierce BCA Protein Assay kit (ThermoFisher Scientific, catalogue no. 23225) and incubated 1 mg of total protein extract with 100 μl of magnetic streptavidin beads (Dynabeads MyOne Streptavidin C1, Life Technologies, catalogue no. 65002) on a rotator at 4 °C overnight to isolate biotinylated proteins. We washed the beads twice with lysis buffer, once with 1 M KCl, once with 0.1 M Na2CO3, once with 2 M urea 10 mM Tris-HCL pH 8.0, and twice again with lysis buffer. Biotinylated proteins were eluted by boiling in 4× protein loading buffer supplemented with 2 mM biotin and 50 mM dithiothreitol at 95 °C for 10 min. Biotinylated proteins in total protein extracts or immunoprecipitates were detected by western blotting using standard protocols and the following antibodies: streptavidin–HRP antibody (Life Technologies, catalogue no. S911, 1:1000), anti-PAX5 (Cell Signaling, catalogue no. 8970, 1:500), anti-HA (Cell Signaling, catalogue no. 3724, 1:1000), anti-NR2C2 (Cell Signaling, catalogue no. 31646, 1:500), anti-NCOR1 (Cell Signaling, catalogue no. 5948, 1:500), NRIP1–HRP (Santa Cruz Biotechnology, sc-518071, 1:200) and NR3C1 (Cell Signaling, catalogue no. 12041, 1:500). Proteins were detected using a Li-Cor Odyssey OFC instrument and quantified using the GelAnalyzer 23.1 software.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

PennsylvaniaDigitalNews.com