ChIP-seq data analysis example (for long chromosomes, the index cannot be established and the bw file generation fails)

created at 12-29-2021 views: 9

Written in front: This process takes the rye genome with long chromosomes as an example, each chromosome reaches about 1Gb.

Step1: Perform quality control on the off-machine data and remove the connectors

fastp -i 20P13_CENH3_ChIp_R1.fq.gz -I 20P13_CENH3_ChIp_R2.fq.gz -o 20P13_CENH3_ChIp_1.clean.fq.gz -O 20P13_CENH3_ChIp_2.clean.fq.gz -z 4 -q 20 -u 30 -n 10 -h 20P13_CENH3_ChIp.clean.html

Step2: Use bwa to compare the sequencing data to the reference genome

Note: Since the chromosome ID of the rye genome is GWHASIY00000001-8, in order to facilitate the subsequent sorting of the bam file, the chromosome ID can be converted to Chr1-Chr8.

Change the ID before rye genome chromosome:

change the ID

for i in {1..8}; do sed -i "s/GWHASIY0000000$i/Chr$i/g" GWHASIY00000000.genome.fa;done


Create an index file for the reference genome

bwa index -a bwtsw  GWHASIY00000000.genome.fa

(Suitable for indexing of reference genomes larger than 2G)

Use bwa mem for comparison

bwa mem -M -t 20 GWHASIY00000000.genome.fa 20P13_CENH3_ChIp_1.clean.fq.gz 20P13_CENH3_ChIp_2.clean.fq.gz >20P13_CENH3_ChIp.sam 2> 20P13_CENH3_ChIp_bwa.log

Tips: -t is followed by the number of threads, set the number of threads according to the server situation, GWHASIY00000000.genome.fa is the genome file, and the index file for the genome file needs to be placed in the genome directory.

Step3: Sort the sam file

Note: Under normal circumstances, you need to convert sam to bam before sorting, but samtools has a problem that the sorting of long chromosome genomes cannot be sorted normally. Below is the bam file sorted by samtools. Check the third column (chromosome ID column) of the bam file sorted by samtools and found that it is not sorted correctly;

samtools view PI428373_CENH3_ChIp.sorted.bam |cut -f 3 | uniq

If the sorting is normal, the output should be Chr1-8, but after checking with the above script, it is found that the sorting is not correct, which is the reason why the bam file cannot be indexed.

So we need to use the perl script to sort the sam files (sort first and then transfer to bam)

perl 20P13_CENH3_ChIp.sam

use warnings;
use strict;
open IN,"$ARGV[0]";
my ($name) = $ARGV[0] =~ /(.*).sam/;
open OUT,">$name.perl.sorted.sam";
my @data = ();
while (<IN>) {
    my @a = split/\s+/;
    next unless $a[2] =~ /^Chr/;
    push @data,[$a[2],$a[3],$_];
close IN;
foreach  (sort{$a->[0] cmp $b->[0] || $a->[1] <=> $b->[1]}@data) {
    print OUT "$_->[2]";
close OUT;

Generate 20P13_CENH3_ChIp.perl.sorted.sam file

Step4: Convert the sorted sam file into a bam file

samtools view -Sb -o 20P13_CENH3_ChIp.perl.sort.bam 20P13_CENH3_ChIp.perl.sort.sam

Step5: Remove PCR repeat and build index

samtools rmdup 20P13_CENH3_ChIp.perl.sort.bam 20P13_CENH3_ChIp.perl.sort_rmdup.bam

samtools index -c 20P13_CENH3_ChIp.perl.sort_rmdup.bam

Step6: Convert bam file to bw (bigwig format) file

bamCoverage --normalizeUsing RPGC --effectiveGenomeSize 7730188902 --binSize 5 --bam  20P13_CENH3_ChIp.perl.sort_rmdup.bam -o

Among them, effectiveGenomeSize is followed by the genome size, which can be calculated with the following python script.

python GWHASIY00000000.genome.fa

import sys
fa_file = sys.argv[1]
with open(fa_file,'r') as f:
    for line in f:
        line = line.strip()
        line = line.upper()
        if not line.startswith(">"):
            baseA = line.count("A")
            baseT = line.count("T")
            baseC = line.count("C")
            baseG = line.count("G")
            aList.extend([baseA, baseT, baseC, baseG])
#            print(aList)
print("effective_genome_size =", sum(aList))  

Step7: IGV visualization (Windows local visualization)

Put the genome file, index file, and bw file in the same folder, open IGV, Genomes-Load Genomes from Files to import the genome file, and File-Load form file to import the bw file for visualization.

Tip: IGV download link Downloads | Integrative Genomics Viewer. If java is not installed on the computer (IGV must rely on JAVA), select Java include to download and install according to the computer system.

Note: The previous Sam sorting to bam files can be completed in one step through the pipeline, but it is recommended to complete it step by step if the server performance is not strong, to avoid errors in the intermediate files. All of the above software can be installed with conda.

created at:12-29-2021
edited at: 12-29-2021: