python: Introduction to RNA rate and installation of scVelo

created at 07-09-2021 views: 9

1 Introduction

Measuring gene activity in individual cells requires destroying these cells in order to read their contents, which makes it challenging to study dynamic processes and understand cell fate decisions. La Manno et al. (Nature, 2018) introduced the concept of RNA rate, using the fact that newly transcribed unspliced precursor mRNA and mature spliced mRNA can be distinguished in the common single-cell RNA-seq process, which can restore orientation Dynamic information, the former can be detected by the presence of introns.

This concept of not only measuring gene activity, but also measuring their changes in a single cell (RNA rate), opens up a new way of studying cell differentiation. The originally proposed framework uses rate as the deviation of the ratio of observed spliced and unspliced mRNA from the inferred steady state. If the central hypothesis of common splicing rate and the observation of complete splicing kinetics with steady-state mRNA levels are violated, rate estimation errors will occur.

Bergen et al. (Nature Biotechnology, 2020) developed scVelo to solve these limitations by using a likelihood-based kinetic model to solve the complete transcription dynamics of splicing kinetics.This generalizes RNA rates to various systems including transient cellular states, which are common in development and in response to disturbances. In addition, scVelo infers gene-specific rates of transcription, splicing, and degradation, and restores potential time in cellular processes. This latent time, based solely on its transcriptional dynamics, represents the internal clock of the cell and approximates the real time that the cell experiences during differentiation. In addition, scVelo recognizes mechanisms that regulate changes, such as the stage of cell fate determination, and systematically detects putative driver genes in it.

1.1 RNA rate model

Use RNA rate to explore the inference of the directional trajectory by linking the measurement results to the potential mRNA splicing dynamics: the transcription induction of a specific gene leads to an increase in unspliced mRNA from the (newly transcribed) precursor, and conversely, the inhibition of transcription or The deletion results in a reduction of unspliced mRNA. Therefore, by distinguishing between unspliced mRNA and spliced mRNA, the change in mRNA abundance (RNA rate) can be approximated. The combination of rates across mRNAs can then be used to estimate the future state of individual cells.

There are currently three existing methods to deal with RNA rate estimation:

  1. Steady state/deterministic model (using steady state residuals)
  2. Stochastic model (using second moments),
  3. Dynamic model (using a likelihood-based framework).

The steady-state/deterministic model, as used in velocyto, estimates the rate as follows: assuming that the transcription phase (induction and inhibition) lasts long enough to reach a steady-state equilibrium (active and inactive), the rate is quantified as an observation To the ratio and its steady-state ratio. The equilibrium mRNA level is estimated by linear regression of the assumed steady state in the lower and upper quantiles. This simplification raises two basic assumptions: the co-splicing rate across genes and the steady-state mRNA level to be reflected in the data. It can lead to errors in rate estimates and cell states, because these assumptions are often violated, especially when a population contains multiple heterogeneous subpopulation dynamics.

The stochastic model aims to better capture the steady state. By treating transcription, splicing, and degradation as probabilistic events, the resulting Markov process is estimated by the moment equation. By including the second moment, it not only takes advantage of the balance of unspliced and spliced mRNA levels, but also takes advantage of their covariance. It has been demonstrated on the endocrine pancreas that randomness adds valuable information, and generally produces higher consistency than deterministic models, while maintaining the same efficiency in calculation time.

The kinetic model (the most powerful and computationally expensive) solves all the dynamics of each gene in splicing kinetics. Therefore, it adapts the RNA rate to widely varying specifications, such as non-stationary populations, because it does not depend on common splicing rates or the steady-state limitations of the sample to be sampled.

Splicing dynamics

Splicing dynamics

In the likelihood-based expectation maximization framework, iteratively estimate the reaction rate and the parameters of potential cell-specific variables, namely the transcription state k and the internal potential time t of the cell.

Therefore, it aims to learn the unstitched/stitched phase trajectory. Four transcription states are modeled to explain all possible configurations of gene activity: two dynamic transient states (induction and inhibition) and two stable states (active and inactive) that may be reached after each dynamic transition.

In the desired step, for a given model estimation of the unspliced/spliced phase trajectory, the potential time is allocated to the observed mRNA by minimizing its distance from the phase trajectory. The transcription state is then assigned by associating the possibilities with the individual segments on the phase trajectory (induction, inhibition, active and inactive homeostasis). Then in the maximization step, the overall possibility is optimized by updating the reaction rate parameter.

This model produces a more consistent rate estimate and better recognition of transcription status. It can also systematically identify dynamic driving genes in a possibility-based manner, so as to find the key driving factors that control cell fate transitions. In addition, the kinetic model infers the potential time within a common cell shared across genes, so that it can associate genes and identify transcriptional change mechanisms.

In order to obtain the best results and the above-mentioned additional insights, we recommend using a kinetic model. If running time is important, it is recommended to use a stochastic model, because it approximates the dynamic model very effectively and takes a few minutes on 30k cells. Kinetics can take up to an hour.

If you want to learn more about the principles, Bergen et al. (2020) elaborated on these methods.

2. Installation

scVelo requires Python 3.6 or higher. It is recommended to use Miniconda.

2.1 Install scVelo from PyPI

Install scVelo from PyPI using the following command:

pip install -U scvelo
  • -U is the abbreviation for --upgrade. If you get a Permission denied error, please use pip install -U scvelo --user instead

2.2 beta version

To use the latest development version, install from GitHub using the following command:

pip install git+


git clone
pip install -e scvelo
  • -e is the abbreviation of --editable, link the package to the original clone location, so that the pulled changes will also be reflected in the environment.

2.3 dependence

  • anndata-Annotated data object.
  • scanpy-Toolkit for single cell analysis.
  • numpy, scipy, pandas, scikit-learn, matplotlib.

Some scVelo (directional PAGA and Louvain modular) need to be installed (optional):

pip install python-igraph louvain

The use of fast proximity search via hnswlib further requires installation (optional):

pip install pybind11 hnswlib

2.4 Jupyter Notebook

To run the tutorial in local Jupyter, please install:

conda install notebook

And run jupyter notebook in the terminal. If you receive the error Not a directory:'xdg-settings', please use jupyter notebook --no-browser instead and open the URL manually.

created at:07-09-2021
edited at: 07-10-2021: