Single cell transcriptional regulatory network analysis tool (TF, motifs)-SCENIC

created at 07-14-2021 views: 55

Introduction

I believe that many topics should have already talked about the use of SCENIC. Here, I will give a detailed explanation of the problems I encountered in my own learning and use of SCENIC. I hope that it can help everyone understand and study SCENIC.

SCENIC (Single-Cell rEgulatory Network Inference and Clustering) is a tool to infer gene regulatory networks and their related cell states from single-cell RNA data.
The author applied SCENIC to tumor and mouse brain single-cell atlas data, proving that cis-regulatory network analysis can help to dig deeper into the biological significance behind cell heterogeneity, and is useful for the diagnosis, treatment, and developmental differentiation of diseases. Provide valuable clues.

SCENIC was first published in nature methods in 2017 and published in nature protocls after sorting out the process in 2020. There are R version and python version.

SCENIC: single-cell regulatory network inference and clustering https://www.nature.com/articles/nmeth.4463
A scalable SCENIC workflow for single-cell gene regulatory network analysis https://www.nature.com/articles/s41596-020-0336-2

The main topic here is the R version (no matter which language, you can get the code online, the important thing is to understand the meaning of doing it)

The author provides a detailed workflow of SCENIC: here

Construction of gene regulatory network (GRN):

  1. Determine the potential target genes of each TF based on co-expression: filter the expression matrix and run GENIE3/GRNBoost;
    Based on the co-expression, the co-expression module between the transcription factor and the candidate target gene is inferred.
  2. Use RcisTarget to analyze cis-regulatory motif for each co-expression module
  3. Use the AUCell algorithm to score the regulon activity in the cell (calculate AUC)
  4. Determine the stable cell state according to the activity of regulons and explore the results

In short, the gene regulatory network is inferred based on co-expression and DNA motif analysis, and then the network activity is analyzed in each cell to identify the cell state.

reference: here

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("AUCell", "RcisTarget"))
BiocManager::install(c("GENIE3")) # Optional. Can be replaced by GRNBoost
BiocManager::install(c("zoo", "mixtools", "rbokeh"))
BiocManager::install(c("DT", "NMF", "ComplexHeatmap", "R2HTML", "Rtsne"))
BiocManager::install(c("doMC", "doRNG"))
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("aertslab/SCopeLoomR", build_vignettes = TRUE)
devtools::install_github("aertslab/SCENIC") ##Import SCENIC

Here I installed SCENIC on the server. During the installation process, I also encountered many errors, mainly some package import errors (mainly caused by the network and environment). Reinstall the wrong package with conda. Be patient. .

Species specific database (human, mouse, fruit fly)

#RcisTarget Species Specific Database
dbFiles1 <- c("https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-10kb-7species.mc9nr.feather")   
#human,SCENIC uses the database to compare gene promoters (up to 500bp upstream of TSS) and motifs in 20kb (+/-10kbp) around TSS.
dbFiles2 <- c("https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-tss-centered-10kb-7species.mc9nr.feather")  # mouse
dbFiles3 <-c("https://resources.aertslab.org/cistarget/databases/drosophila_melanogaster/dm6/flybase_r6.02/mc8nr/gene_based/dm6-5kb-upstream-full-tx-11species.mc8nr.feather") # Fruit flies
dir.create("./SCENIC") #Build SCENIC analysis folder
setwd("SCENIC")
dir.create("./SCENIC/cisTarget_databases") # Build the database folder
setwd("cisTarget_databases")
for(featherURL in c(dbFiles1,dbFiles2,dbFiles3))
{
  download.file(featherURL, destfile=basename(featherURL)) # saved in current dir
}

Import the package and perform SEUNIC analysis

For the sake of unification and convenience, we still use the pbmc3k data set mentioned before (). After all, there are more people using 10x data. Of course, the importing authors of other data formats also provide code access. The author also provides an example data set as input, and I provide the corresponding code at the end of the article.

# import R packages
library(Seurat)
library(tidyverse)
library(patchwork)
library(SCENIC)

Import pbmc3k data and define cell types. (Before, only the pbmc3k data was clustered, but the cell type was not defined. Here is the definition of cell type).

pbmc <-readRDS("/home/wucheng/jianshu/function/data/pbmc.rds") 
setwd("/home/wucheng/SCENIC/SCENIC_pbmc3k") 
current.cluster.ids <- c(0:8)
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
pbmc@meta.data$celltype <- plyr::mapvalues(x = pbmc@meta.data[,"seurat_clusters"], from = current.cluster.ids, to = new.cluster.ids)
head(pbmc@meta.data)
pdf(file="celltype_umap.pdf",width=10,height=10)    
DimPlot(pbmc, reduction = "umap", group.by= "celltype",label = TRUE, pt.size = 0.5) + NoLegend()
dev.off()

SEUNIC analysis

Prepare cell meta information

cellInfo <- data.frame(pbmc@meta.data)
colnames(cellInfo)[which(colnames(cellInfo)=="orig.ident")] <- "sample"
colnames(cellInfo)[which(colnames(cellInfo)=="nFeature_RNA")] <- "nGene"
colnames(cellInfo)[which(colnames(cellInfo)=="nCount_RNA")] <- "nUMI"
colnames(cellInfo)[which(colnames(cellInfo)=="seurat_clusters")] <- "cluster"
colnames(cellInfo)[which(colnames(cellInfo)=="celltype")] <- "celltype"
cellInfo <- cellInfo[,c("sample","nGene","nUMI","cluster","celltype")]
saveRDS(scenicOptions, file="int/cellInfo.Rds")
# Set the celltype color, or not set it, use the default color
colVars <- list(celltype=c("Naive CD4 T"="forestgreen",  "CD14+ Mono"="darkorange",  "Memory CD4 T"="magenta4", 
"B"="hotpink",  "CD8 T"="red3",  "FCGR3A+ Mono"="skyblue",   "NK"="darkblue",  "DC"="yellow",  "Platelet"="grey"))
colVars$celltype <- colVars$celltype[intersect(names(colVars$celltype), cellInfo$celltype)]
saveRDS(colVars, file="int/colVars.Rds")

Initialize SCENIC settings, set up the analysis environment

org <- "hgnc" # mgi stands for mouse, hgnc stands for human, dmel stands for fly, our pbmc3k is human, choose hgnc
dbDir <- "./SCENIC/cisTarget_databases" # RcisTarget databases location
myDatasetTitle <- "SCENIC example on pbmc3k" # choose a name for your analysis
data(defaultDbNames)
dbs <- defaultDbNames[[org]]
scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=10)  # nCores=10, which means that 10 threads are enabled for calculation
scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"
scenicOptions@inputDatasetInfo$colVars <- "int/colVars.Rds"
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 

Build a Co-expression network

exprMat <- as.matrix(pbmc@assays$RNA@counts) #Prepare the expression matrix. In order to save computing resources, only some cells are randomly selected to calculate the co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions=scenicOptions,minCountsPerGene=3*.01*ncol(exprMat),minSamples=ncol(exprMat)*.01) #Gene filtering/selection, remove genes that are most likely to be noise
exprMat_filtered <- exprMat[genesKept, ]
dim(exprMat_filtered)
runCorrelation(exprMat_filtered, scenicOptions) ##Calculate the correlation matrix, 1.2_corrMat.Rds: the correlation matrix between genes

Infer potential transcription factor targets based on expression data. Using GENIE3 or GRNBoost, GENIE3 is very time-consuming and computationally intensive (it takes several hours or days on a data set of 3-5k units), and GRNboost can be used in a very short time. Provide results similar to GENIE3 within time. For R used here, select GENIC3

##GENIE3, the input of GENIE3 is usually an expression matrix and a list of candidate regulators.
exprMat_filtered <- log2(exprMat_filtered+1) #complete matrix
runGenie3(exprMat_filtered, scenicOptions, nParts = 10) #nParts parameter, is to divide the expression matrix into n parts and calculate separately
#1.4, GENIE3_linkList.Rds: The final result of GENIE3 is to merge the files starting with "1.3_"

Run here, then enter the key analysis steps of SCENIC
1. Get the co-expression module
2. Get the regulator (using RcisTarget): TF motif analysis)
3. Score the GRN in the cell (use AUCell)
4. Cluster cells based on GRN activity

exprMat_log <- log2(exprMat+1) #log standard words original matrix
#scenicOptions <- readRDS("int/scenicOptions.Rds")
scenicOptions@settings$verbose <- TRUE
scenicOptions@settings$nCores <- 1
scenicOptions@settings$seed <- 123
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions) #1. Get coexpression module
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions) #2. Get regulons
#You can use the parameter coexMethod=c("top5perTarget"), coexMethod is available. The author tried a variety of strategies to filter low-relevance TF-Target, and suggested that all six filtering criteria should be used. In fact, it does not take much time. Both are calculated by default.
#w001: Use each TF as the core to reserve genes with weight>0.001 to form a co-expression module; #w005, #top50
#top5perTarget: Each gene retains the weight value of top5 TF to get a simplified TF-Target association table, and then assign the gene to TF to construct a co-expression module; #top10perTarget; #top50perTarget
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log) #3. Scoring the GRN (regulator) in the cell
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
runSCENIC_4_aucell_binarize(scenicOptions, exprMat=exprMat_log)

After running runSCENIC_2_createRegulons:

  • output/Step2_MotifEnrichment.tsv
  • output/Step2_MotifEnrichment_preview.html
  • output/Step2_regulonTargetsInfo.tsv file

Step2_MotifEnrichment.tsv opens as follows:

Step2_MotifEnrichment.tsv

  • eneSet: the name of the gene set;
  • motif: motif ID,
  • NES: standardized enrichment score of motif in the gene set;
  • AUC: Area under the curve (used to calculate NES);
  • TFinDB: Indicate whether the highlighted TF is contained in a high-confidence note (two asterisks) or a low-confidence note (one asterisk);
  • TF_highConf: Transcription factor annotated to motif according to "motifAnnot_highConfCat";
  • TF_lowConf: Transcription factor annotated to motif according to "motifAnnot_lowConfCat";
  • richedGenes: genes ranked higher on a given motif; nErnGenes: the number of top-ranked genes;
  • rankAtMax: rank at the largest enrichment, used to determine the number of enriched genes.

Step2_MotifEnrichment_preview.html

MotifEnrichment_preview.html

Significantly enriched motif comment information, the header is the same as the previous file.

Step2_regulonTargetsInfo.tsv

Step2_regulonTargetsInfo.tsv

  • TF: transcription factor name;
  • gene: TF target gene name;
  • nMotif: the number of motifs of the target gene in the database; bestMotif: the most significantly enriched motif name
  • NES: standard enrichment score, the higher the score, the more significant;
  • highConfAnnot: is it a high-confidence annotation; Genie3Weight: the correlation weight between TF and the target gene

Visualization

After running runSCENIC_3_scoreCells, a regulonAUC matrix is generated, which can be displayed with a heat map, and the t-SNE map separately displays the activity distribution of each regulon.

heatmap

runSCENIC_4_aucell_binarize is to perform binary conversion and derivative analysis, this step is not necessary

 binary conversion

Here, the activity score of GRNs in each cell can also be used to cluster the cells, which may result in different cell state classifications.

nPcs <- c(5,15,50)
scenicOptions@settings$seed <- 123 # same seed for all of them
# Run t-SNE with different settings:
fileNames <- tsneAUC(scenicOptions, aucType="AUC", nPcs=nPcs, perpl=c(5,15,50), onlyHighConf=TRUE, filePrefix="int/tSNE_oHC")
# Plot as pdf (individual files in int/):
pdf("int/AUC_tsne.pdf",width=10,height=10)
plotTsne_compareSettings(fileNames, scenicOptions, showLegend=FALSE, varName="celltype", cex=.5)
dev.off()

 cluster the cells

Of course, in fact, the analysis here is still a bit awkward, regulons may need to be further selected, visualized, and combined with the analysis of the Seurat clustering results is relatively simple and practical.

pbmc <-readRDS("/home/wucheng/jianshu/function/data/pbmc.rds")  # Import pbmc3k data
current.cluster.ids <- c(0:8)
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
pbmc@meta.data$celltype <- plyr::mapvalues(x = pbmc@meta.data[,"seurat_clusters"], from = current.cluster.ids, to = new.cluster.ids)
head(pbmc@meta.data)

##Import the original regulonAUC matrix, which is the AUC matrix generated after running runSCENIC_3_scoreCells, you can view the AUC activity score of each GRN in each cell
setwd("/home/wucheng/SCENIC/SCENIC_pbmc3k")
AUCmatrix <- readRDS("int/3.4_regulonAUC.Rds")
AUCmatrix <- data.frame(t(AUCmatrix@assays@data@listData$AUC), check.names=F)
RegulonName_AUC <- colnames(AUCmatrix)
RegulonName_AUC <- gsub(' \\(','_',RegulonName_AUC) #replace "(" with "_"
RegulonName_AUC <- gsub('\\)','',RegulonName_AUC) # Remove ")" and finally replace KLF3_extended (79g) with KLF3_extended_79g
colnames(AUCmatrix) <- RegulonName_AUC
pbmcauc <- AddMetaData(pbmc, AUCmatrix) # Add the AUC matrix to the metadata information of pbmc
pbmcauc@assays$integrated <- NULL
saveRDS(pbmcauc,'pbmcauc.rds')

## Import binary regulonAUC matrix
BINmatrix <- readRDS("int/4.1_binaryRegulonActivity.Rds")
BINmatrix <- data.frame(t(BINmatrix), check.names=F)
RegulonName_BIN <- colnames(BINmatrix)
RegulonName_BIN <- gsub(' \\(','_',RegulonName_BIN)
RegulonName_BIN <- gsub('\\)','',RegulonName_BIN)
colnames(BINmatrix) <- RegulonName_BIN
pbmcbin <- AddMetaData(pbmc, BINmatrix)
pbmcbin@assays$integrated <- NULL
saveRDS(pbmcbin, 'pbmcbin.rds')

##Use Seurat to visualize AUC
dir.create('scenic_seurat')
#FeaturePlot
GRNs <-intersect(colnames(AUCmatrix),colnames(BINmatrix))
for(i in 1:length(GRNs)){
p1 = FeaturePlot(pbmcauc, features=GRNs[i], label=T, reduction = 'umap')
p2 = FeaturePlot(pbmcbin, features=GRNs[i], label=T, reduction = 'umap')
p3 = DimPlot(pbmc, reduction = 'umap', group.by = "celltype", label=T)
plotc = p1|p2|p3
ggsave(paste("scenic_seurat/",GRNs[i],".png",sep=""), plotc, width=14 ,height=4)
}

 Seurat clustering results

It can be found that there are regulatory networks formed by transcription factors that mainly regulate certain specific cell types, such as CEBPA here.

#RidgePlot&VlnPlot, like single cell display of gene expression, here you can use the violin chart to display the activity score of GRNs
for(i in 1:length(GRNs)){
p1 = RidgePlot(pbmcauc, features =GRNs[i], group.by="celltype") + theme(legend.position='none')
p2 = VlnPlot(pbmcauc, features =GRNs[i], pt.size = 0, group.by="celltype") + theme(legend.position='none')
plotc = p1 + p2
ggsave(paste("scenic_seurat/","Ridge-Vln_",GRNs[i],".png",sep=""),plotc, width=10, height=8)
}

CEBPA

Of course, you can also use heat maps to visualize SCENIC results

library(pheatmap)
cellInfo <- readRDS("int/cellInfo.Rds")
celltype = as.data.frame(subset(cellInfo,select = 'celltype'))
AUCmatrix <- t(AUCmatrix)
BINmatrix <- t(BINmatrix)
#Select some of the regulons of interest
my.regulons <- intersect(rownames(AUCmatrix),rownames(BINmatrix))
myAUCmatrix <- t(AUCmatrix[rownames(AUCmatrix)%in%my.regulons,])
myBINmatrix <- t(BINmatrix[rownames(BINmatrix)%in%my.regulons,])
#Use regulon original AUC value to draw heat map
pdf("scenic_seurat/AUC_heatmap.pdf",width=10,height=20)
pheatmap(myAUCmatrix, show_colnames=F, annotation_col=celltype)
dev.off()
#Use regulon binary AUC value to draw heat map
pdf("scenic_seurat/BIN_heatmap.pdf",width=10,height=20)
pheatmap(myBINmatrix, show_colnames=F, annotation_col=celltype,color = colorRampPalette(colors = c("white","black"))(100))
dev.off()

 heat maps

If you think that there are too many regulons, and some of them are meaningless, you can use RSS to identify and select representative regulons of each cell type for analysis

regulonAUC <-loadInt(scenicOptions, "aucell_regulonAUC")
rss <- calcRSS(AUC=getAUC(regulonAUC), cellAnnotation=cellInfo[colnames(regulonAUC), "celltype"])
rssPlot <-plotRSS(rss) #Size rss rating, color Z-score
ggsave('int/rss.png', rssPlot$plot, width=5 ,height=20)

 RSS

If you don’t want to use the pbmc3k data, you can also use the mouse brain single cell data sampled by the author for SCENIC analysis

library(Seurat)
library(tidyverse)
library(patchwork)
library(SCENIC)
library(SCopeLoomR)
dir.create("./SCENIC_MouseBrain") 
setwd("SCENIC_MouseBrain")
loomPath <- system.file(package="SCENIC", "examples/mouseBrain_toy.loom") ##Example data provided by the author
loom <- open_loom(loomPath)
exprMat <- get_dgem(loom)
cellInfo <- get_cell_annotation(loom)
close_loom(loom)
dim(exprMat)
cellInfo <- data.frame(cellInfo)
cbind(table(cellInfo$CellType))
##Initialize SCENIC settings
org <- "mgi" #mgi stands for mouse, hgnc stands for human, dmel stands for flies
dbDir <- "/home/wucheng/SCENIC/cisTarget_databases" # RcisTarget databases location
myDatasetTitle <- "SCENIC example on Mouse brain" # choose a name for your analysis
data(defaultDbNames)
dbs <- defaultDbNames[[org]]
scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=10) 
scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"
scenicOptions@inputDatasetInfo$colVars <- "int/colVars.Rds"
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 
##Build Co-expression network
#1, infer potential transcription factor targets based on expression data, use GENIE3 or GRNBoost
#GENIE3 is very time-consuming and computationally intensive (it takes several hours or days on a data set of 3-5k units)
#GRNboost can provide results similar to GENIE3 in a very short time
#Gene filtering/selection, remove genes that are most likely to be noise
genesKept <- geneFiltering(exprMat, scenicOptions=scenicOptions,minCountsPerGene=3*.01*ncol(exprMat),minSamples=ncol(exprMat)*.01)
runCorrelation(exprMat_filtered, scenicOptions) #GENIE3, the input of GENIE3 is usually an expression matrix and a list of candidate regulators.
exprMat_filtered <- log2(exprMat_filtered+1) 
runGenie3(exprMat_filtered, scenicOptions, nParts = 10)
#Build and score GRN
#1. Get the co-expression module
#2. Get the regulator (using RcisTarget): TF motif analysis)
Identify cell status:
#3. Score the GRN (regulator) in the cell (using AUCell)
#4. Cluster cells based on GRN activity
exprMat_log <- log2(exprMat+1)
library(SCENIC)
scenicOptions <- readRDS("int/scenicOptions.Rds")
scenicOptions@settings$verbose <- TRUE
scenicOptions@settings$nCores <- 1
scenicOptions@settings$seed <- 123
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) 
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log)
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 

heatmap

created at:07-14-2021
edited at: 07-14-2021: