R language for principal component analysis (PCA)

created at 11-15-2021 views: 6

Code and data download link

https://github.com/plemey/SARSCoV2origins

Today’s tweet, let’s repeat Figure 5b in the paper.

figure in paper

In his picture, the RSCU value of the codon is used for principal component analysis, and then a scatter diagram is used to show the result.

Load the required R package

library(tidyverse)
library(ggplot2)
library(ggrepel)

Read data

all_rscu <- read_csv('all_rscu_codonBiasReanalysis.csv')

all_df <- as.data.frame(all_rscu[,2:26])
row.names(all_df) <- all_rscu$codon

Principal component analysis

df_pca <- prcomp(t(all_df))

Add grouping information to the results of principal component analysis

subset_viruses <- c('MG772934','MG772933','GU190215','KP886808','KP886809','MN908947')

df_out <- as.data.frame(df_pca$x)
df_out$group <- ifelse(row.names(df_out) %in% subset_viruses,'virus','vertebrate')
df_out$species <- row.names(df_out)

Code mapping provided in the paper

ggplot(df_out,aes(x=PC1,y=PC2,color=group)) +
  geom_point(show.legend = F) +
  scale_color_manual(values = c('dodgerblue4','firebrick')) +
  geom_label_repel(aes(label = species),show.legend = F) +
  theme_bw() + theme(aspect.ratio = 1, panel.grid = element_blank())

principal component analysis

However, there are still some details that are different between this picture and the final picture presented in the paper. We modify the code and try to repeat the original picture as much as possible.

add arrow

ggplot(df_out,aes(x=PC1,y=PC2)) +
  geom_point(show.legend = F)+ 
  geom_label_repel(aes(label = species,
                       color=group),
                   show.legend = F) +
  scale_color_manual(values = c('black','#6a9a97')) +
  theme_bw() + 
  theme(aspect.ratio = 1, 
        panel.grid = element_blank())+
  geom_segment(aes(x=1.5,y=0,xend=0,yend=1),
               arrow = arrow(type="closed"),
               lty="dashed",
               color="grey")+
  geom_segment(aes(x=1.5,y=0,xend=3,yend=-1),
               arrow = arrow(type="closed"),
               lty="dashed",
               color="grey")

add arrow

Add comment text

ggplot(df_out,aes(x=PC1,y=PC2)) +
  geom_point(show.legend = F)+ 
  geom_label_repel(aes(label = species,
                       color=group),
                   show.legend = F) +
  scale_color_manual(values = c('black','#6a9a97')) +
  theme_bw() + 
  theme(aspect.ratio = 1, 
        panel.grid = element_blank())+
  geom_segment(aes(x=1.5,y=0,xend=0,yend=1),
               arrow = arrow(type="closed"),
               lty="dashed",
               color="grey")+
  geom_segment(aes(x=1.5,y=0,xend=3,yend=-1),
               arrow = arrow(type="closed"),
               lty="dashed",
               color="grey")+
  annotate(geom = "text",
           x=-2,y=-1,
           label="Eukaryotes",
           fontface="bold")+
  annotate(geom = "text",
           x=1.6,y=0.5,
           label="Coronaviruses",
           fontface="bold",
           color='#6a9a97')+
  annotate(geom = "text",
           x=0,y=1.2,
           label="High GC")+
  annotate(geom = "text",
           x=3,y=-1.2,
           label="Low GC")

add result

The final figure presented in the paper also changed the text label of one of the points. This figure was changed with the help of other software.

created at:11-15-2021
edited at: 11-15-2021: