Motif visualization-from PFM matrix to sequence logo

created at 07-28-2021 views: 10

How is the Sequence Logo drawn?

Calculate PFM

According to each sequence containing the Motif, count the number of occurrences of the four nucleotides at each site, and calculate the frequency to obtain the Position Frequency Matrix (PFM).

Calculate PFM

Take the first site as an example. A appears twice and T appears 12 times. Therefore, in the first column of the matrix, the A count is 2, the T count is 12, the C and G counts are 0, and so on .

example

Calculate the frequency of four nucleotides at each site to obtain a PPM matrix. The file is generally a matrix with N rows and 4 columns. The matrix describes the frequency of each base at each position. Taking the first line for example, the frequency of occurrence of A is 2/(2+12)=0.142857; the frequency of occurrence of T is 14/(2+12)=0.857143. Assuming that the various sites of PFM are independent of each other, for example, the Motif sequence is: TACTGTATATAHAHMCAG, then the probability of the sequence is the product of the probability of the base at each site, 0.860.86111*0.93... and so on.

Calculate the frequency of four nucleotides

The PPM matrix file can generally be downloaded directly from the results of the web version. Click Submit/Download to select the data format of the Probability Matrix.

 Probability Matrix

Get PWM

Next, we need to get the position weight matrix (PWM). Use the formula below where k represents A/C/G/T, j represents the site, b represents the background base frequency, and M represents the base frequency in the PPM matrix.

Get PWM

The background base content has statistical results in the last part of the web page results, as shown in the figure below.

Therefore, the PPM matrix can be converted into a PWM matrix according to the formula, as shown in the figure below. Take the first row and first column as an example: log2(0.142857/0.29)=-1.02148117. According to the PWM matrix, the score of the sequence can be calculated, that is, the value of the base in each position in the PWM matrix is added, so that it can be judged whether the sequence is a random sequence or a functional site. For example, the score of the above sequence TACTGTATATAHAHMCAG is 1.56+1.56+2.25+1.79+2.25+1.68+..., if the score is greater than 0, the sequence is considered to be a potential functional site and the Motif is predicted; if the score is less than 0 , It is considered that the sequence is a random sequence; if it is equal to 0, it is considered that each has a 50% probability.

example

Plot seqlog

The R package ggseqlogo can visualize Motif based on the PPM matrix.

library(ggseqlogo)
library(ggplot2)
Motif <- t(read.table("ppm.txt"))
rownames(Motif) <- c("A","C","G","T")
##list_col_schemes(v = T)查看配色
p1 <- ggseqlogo(Motif,method="prob",col_scheme="base_pairing")
p2 <- ggseqlogo(Motif,method="bits",col_scheme="nucleotide")
gridExtra::grid.arrange(p1,p2)

example  Plot seqlog

created at:07-28-2021
edited at: 07-28-2021: