Written in front: This process takes the rye genome with long chromosomes as an example, each chromosome reaches about 1Gb.
Step1: Perform quality control on the off-machine data and remove the connectors
fastp -i 20P13_CENH3_ChIp_R1.fq.gz -I 20P13_CENH3_ChIp_R2.fq.gz -o 20P13_CENH3_ChIp_1.clean.fq.gz -O 20P13_CENH3_ChIp_2.clean.fq.gz -z 4 -q 20 -u 30 -n 10 -h 20P13_CENH3_ChIp.clean.html
Step2: Use bwa
to compare the sequencing data to the reference genome
Note: Since the chromosome ID
of the rye genome is GWHASIY00000001-8
, in order to facilitate the subsequent sorting of the bam
file, the chromosome ID
can be converted to Chr1-Chr8
.
Change the ID before rye genome chromosome:
for i in {1..8}; do sed -i "s/GWHASIY0000000$i/Chr$i/g" GWHASIY00000000.genome.fa;done
Create an index file for the reference genome
bwa index -a bwtsw GWHASIY00000000.genome.fa
(Suitable for indexing of reference genomes larger than 2G)
Use bwa mem for comparison
bwa mem -M -t 20 GWHASIY00000000.genome.fa 20P13_CENH3_ChIp_1.clean.fq.gz 20P13_CENH3_ChIp_2.clean.fq.gz >20P13_CENH3_ChIp.sam 2> 20P13_CENH3_ChIp_bwa.log
Tips: -t
is followed by the number of threads, set the number of threads according to the server situation, GWHASIY00000000.genome.fa
is the genome file, and the index file for the genome file needs to be placed in the genome directory.
Step3: Sort the sam file
Note: Under normal circumstances, you need to convert sam
to bam
before sorting, but samtools
has a problem that the sorting of long chromosome genomes cannot be sorted normally. Below is the bam file sorted by samtools
. Check the third column (chromosome ID column) of the bam
file sorted by samtools
and found that it is not sorted correctly;
samtools view PI428373_CENH3_ChIp.sorted.bam |cut -f 3 | uniq
If the sorting is normal, the output should be Chr1-8
, but after checking with the above script, it is found that the sorting is not correct, which is the reason why the bam
file cannot be indexed.
So we need to use the perl
script to sort the sam
files (sort first and then transfer to bam
)
perl sort_sam.pl 20P13_CENH3_ChIp.sam
$cat sort_sam.pl
#!/usr/bin/perl
use warnings;
use strict;
open IN,"$ARGV[0]";
my ($name) = $ARGV[0] =~ /(.*).sam/;
open OUT,">$name.perl.sorted.sam";
my @data = ();
while (<IN>) {
my @a = split/\s+/;
next unless $a[2] =~ /^Chr/;
push @data,[$a[2],$a[3],$_];
}
close IN;
foreach (sort{$a->[0] cmp $b->[0] || $a->[1] <=> $b->[1]}@data) {
print OUT "$_->[2]";
}
close OUT;
Generate 20P13_CENH3_ChIp.perl.sorted.sam
file
Step4: Convert the sorted sam
file into a bam
file
samtools view -Sb -o 20P13_CENH3_ChIp.perl.sort.bam 20P13_CENH3_ChIp.perl.sort.sam
Step5: Remove PCR repeat and build index
samtools rmdup 20P13_CENH3_ChIp.perl.sort.bam 20P13_CENH3_ChIp.perl.sort_rmdup.bam
samtools index -c 20P13_CENH3_ChIp.perl.sort_rmdup.bam
Step6: Convert bam file to bw (bigwig format) file
bamCoverage --normalizeUsing RPGC --effectiveGenomeSize 7730188902 --binSize 5 --bam 20P13_CENH3_ChIp.perl.sort_rmdup.bam -o 20P13_CENH3_ChIp.perl.sort_rmdup.bw
Among them, effectiveGenomeSize
is followed by the genome size, which can be calculated with the following python script.
python GenomeSize.py GWHASIY00000000.genome.fa
$cat GenomeSize.py
#coding=utf-8
import sys
aList=[]
fa_file = sys.argv[1]
with open(fa_file,'r') as f:
for line in f:
line = line.strip()
line = line.upper()
if not line.startswith(">"):
baseA = line.count("A")
baseT = line.count("T")
baseC = line.count("C")
baseG = line.count("G")
aList.extend([baseA, baseT, baseC, baseG])
# print(aList)
print("effective_genome_size =", sum(aList))
Step7: IGV visualization (Windows local visualization)
Put the genome file, index file, and bw file in the same folder, open IGV
, Genomes-Load Genomes from Files
to import the genome file, and File-Load form file
to import the bw
file for visualization.
Tip: IGV download link Downloads | Integrative Genomics Viewer. If java is not installed on the computer (IGV must rely on JAVA), select Java include to download and install according to the computer system.
Note: The previous Sam
sorting to bam
files can be completed in one step through the pipeline, but it is recommended to complete it step by step if the server performance is not strong, to avoid errors in the intermediate files. All of the above software can be installed with conda
.