pybedtools is an encapsulation and extension of BEDTools software using Python. Why use pybedtools instead of BEDTools directly? Of course we want to use BEDTools in Python (😀)
Why pybedtools? The author also gave an example to find out the SNPs in the intergenic region, and then get the gene names within <5kb from the SNPs. If you use pybedtools:
import pybedtools
snps = pybedtools.example_bedtool('snps.bed.gz')
genes = pybedtools.example_bedtool('hg19.gff')
intergenic_snps = snps.subtract(genes)
nearby = genes.closest(intergenic_snps, d=True, stream=True)
for gene in nearby:
if int(gene[-1]) < 5000:
print(gene.name)
If you use a shell script, you may need these codes:
snps=snps.bed.gz
genes=hg19.gff
intergenic_snps=/tmp/intergenic_snps
snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
gene_fields=9
distance_field=$(($gene_fields + $snp_fields + 1))
intersectBed -a $snps -b $genes -v > $intergenic_snps
closestBed -a $genes -b $intergenic_snps -d \
awk '($'$distance_field' < 5000){print $9;}' \
perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'
rm $intergenic_snps
By comparison, if you use pybedtools, you only need to be familiar with Python and pybedtools. To achieve the same function with shell, you may need Perl
, bash
, awk
and bedtools
.
For me, it is mainly because of using pybedtools, which allows me to use Python code throughout the process to get the same results as bedtools.
Via conda
$ conda install --channel conda-forge --channel bioconda pybedtools
Install via pip
$ pip install pybedtools
To use pybedtools, the first step is to import the pybedtools package and create the BedTool
object. The BedTool
object encapsulates all the available program functions of the BedTools
program, so that they can be better used in Python. Therefore, in most cases, we use pybedtools, that is, to manipulate the BedTool
object. The creation of the BedTool object can use the interval file (BED, GFF, GTF, VCF, SAM, or BAM format)
. Create a BedTool
object from a file:
# 'test.bed' is the file path.
test = pybedtools.BedTool('test.bed')
In addition, it can also be created from a string:
s ='''
chrX 1 100
chrX 25 800
'''
## Controlled by the parameter from_string
a = pybedtools.BedTool(s, from_string=True)
pybedtools also provides some test files, which will be used as demonstrations below.
# list_example_files will list example files
>>> pybedtools.list_example_files()
['164.gtf',
...
'a.bed',
'a.bed.gz',
'b.bed',
'bedpe.bed',
'bedpe2.bed',
'c.gff',
'd.gff',
...
'venn.b.bed',
'venn.c.bed',
'x.bam',
'x.bed',
'y.bam']
Use sample files to create BedTool objects
## Just pass in the file name'a.bed'
a = pybedtools.example_bedtool('a.bed')
View the first few rows of data
>>> a.head()
chr1 1 100 feature1 0 +
chr1 100 200 feature2 0 +
chr1 150 500 feature3 0-
chr1 900 950 feature4 0 +
BEDTools is often used to take intersection, let's see how to operate in pybedtools
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
a_and_b = a.intersect(b)
>>> a.head()
chr1 1 100 feature1 0 +
chr1 100 200 feature2 0 +
chr1 150 500 feature3 0 -
chr1 900 950 feature4 0 +
>>> b.head()
chr1 155 200 feature5 0 -
chr1 800 901 feature6 0 +
>>> a_and_b.head()
chr1 155 200 feature2 0 +
chr1 155 200 feature3 0 -
chr1 900 901 feature4 0 +
If you want to check whether there is overlap, the same as intersectBed usage, add the parameter u:
a_with_b = a.intersect(b, u=True)
>>> a_with_b.head()
chr1 100 200 feature2 0 +
chr1 150 500 feature3 0 -
chr1 900 950 feature4 0 +
After running the intersect
method, what is returned is a new BedTool
example, which temporarily saves the content on a hard disk temporary address. Temporary address acquisition:
>>> a_and_b.fn
'/tmp/pybedtools.kfnxvuuz.tmp'
>>> a_with_b.fn
'/tmp/pybedtools.qre024y9.tmp'
The BedTool.saveas()
method can copy the temporary file pointed to by the BedTool instance to a new address. At the same time, you can add a trackline
to upload the UCSC Genome Browser
directly without opening the file again to add the trackline
.
c = a_with_b.saveas('intersection-of-a-and-b.bed', trackline='track name="a and b"')
print(c.fn)
# intersection-of-a-and-b.bed
The BedTool.saveas()
method returns a new BedTool
instance pointing to a new address.
And BedTool.moveto()
is used to directly move the file to a new address. If you don't need to add the trackline
and the file is very large, this method will be very fast.
d = a_with_b.moveto('another_location.bed')
print(d.fn)
# 'another_location.bed'
Since it has been moved, that is, the old file does not exist anymore, an error will be reported if you view its content again.
>>> a_with_b.head()
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_2075/544676037.py in <module>
.........
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pybedtools.4uqthcml.tmp'
When using BEDTools, we will specify the -a
and -b
parameters, but in the previous operation, we did not specify them. When this is because pybedtools gives the default settings, the instance of the method call is used as -a
, and the other instance passed in is used as -b
, that is, exons
corresponds to -a ("a.bed")
, snps
corresponds to -b("b.bed")
import pybedtools
exons = pybedtools.example_bedtool('a.bed')
snps = pybedtools.example_bedtool('b.bed')
exons_with_snps = exons.intersect(snps, u=True)
$ intersectBed -a a.bed -b b.bed -u > tmpfile
The above two are the same. But you can also specify a
, b
# these all have identical results
x1 = exons.intersect(snps)
x2 = exons.intersect(a=exons.fn, b=snps.fn)
x3 = exons.intersect(b=snps.fn)
x4 = exons.intersect(snps, a=exons.fn)
x1 == x2 == x3 == x4
# True
The chain call of the method is similar to the pipeline operation under the linux shell. For example: a
and b
check whether the intersection or not, then merge the coordinate intervals, and run separately.
x1 = a.intersect(b, u=True)
x2 = x1.merge()
Chaining calls can be done like this:
x3 = a.intersect(b, u=True).merge()
another example
x4 = a.intersect(b, u=True).saveas('a-with-b.bed').merge().saveas('a-with-b-merged.bed')
Operator overloading, an amazing method!
x5 = a.intersect(b, u=True)
x6 = a + b
x5 == x6
# True
x7 = a.intersect(b, v=True)
x8 = a - b
x7 == x8
# True
That is, the +
operator means intersectBed
uses the -u
parameter, and the -
operator means intersectBed
uses the -v
parameter.
It is still possible to continue chaining calls using operator overloading
x9 = (a + b).merge()
In pybedtools, an Interval
object represents a line of data in a BED
, GFF
, GTF
or VCF
file. Note that the above and the following are demonstrations in the Python3 version (there will be no one using Python2)
Or go ahead and create a BedTool
object as an example,
a = pybedtools.example_bedtool('a.bed')
feature = a[0]
features = a[1:3]
BedTool
supports slicing operations, getting a single-line element is an Interval
object
>>> type(feature)
pybedtools.cbedtools.Interval
Printing the Interval
object will return the row data it represents.
>>> print(feature)
chr1 1 100 feature1 0 +
All feature
(referring to Interval
), regardless of the file type from which BedTool
was created, have chrom
, start
, stop
, name
, score
, strand
properties. where start
, stop
are integers and everything else (including score
) are strings. However, we should often have only chrom
, start
, stop
data, let's see how pybedtools
handles the remaining missing attributes.
>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True)[0]
>>> print(feature2)
chrX 500 1000
It seems that print(feature2)
prints the raw line data.
>>> feature2.chrom
'chrX'
>>> feature2.start
500
>>> feature2.stop
1000
>>> feature2.name
'.'
>>>feature2.score
'.'
>>>feature2.strand
'.'
The missing default value is .
.
Interval
can be indexed by list
and dict
.
>>> feature2[:]
['chrX', '500', '1000']
>>> feature2[0]
'chrX'
>>> feature2["chrom"]
'chrX'
However, indexing in a dictionary has an advantage. For missing values, the default value can be automatically returned, and an error will be reported through an integer index.
>>> feature2["name"]
'.'
>>> feature2[3]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_2075/566460392.py in <module>
----> 1 feature2[3]
/opt/conda/lib/python3.9/site-packages/pybedtools/cbedtools.pyx in pybedtools.cbedtools.Interval.__getitem__()
IndexError: field index out of range
Interval
has an Interval.fields
property that splits the original line into a list
. When using integer indexing, this fields list
is indexed.
>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True)[0]
>>> f.fields
['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd']
>>> len(f.fields)
10
A troublesome thing when using multiple formats, the coordinate system of BED files is different from GFF/GTF/VCF files.
GFF, GTF, and VCF files are 1-based (i.e. the first position on the chromosome is 1) features include stop positions For convenience, pybedtools makes these conventions:
Interval.start
is 0-based start position, no matter what file. Even GFF
, or other 1-based features
.
len()
to get the length of Interval
, always return Interval.stop - Interval.start
. This way, no matter what the file format, the length is guaranteed to be correct. Also simplifies the underlying code. We can consider all Intervals
to be the sameInterval.fields
are all strings, which are the division of the original line.
So for GFF feature
, Interval.fields[3]
and Interval[3]
are not the same as Interval.start
. That is Interval.fields[3] = Interval.start + 1
. Because Interval.fields[3]
is the original 1-based
data, and Interval.start
uses 0-basedHere are two examples to create a GFF Interval
gff = ["chr1",
"fake",
"mRNA",
"51", # <- start is 1 greater than start for the BED feature below
"300",
".",
"+",
".",
"ID=mRNA1;Parent=gene1;"]
gff = pybedtools.create_interval_from_list(gff)
Create a BED Interval with the same gff coordinates as before. But their coordinate systems are different.
bed = ["chr1",
"50",
"300",
"mRNA1",
".",
"+"]
bed = pybedtools.create_interval_from_list(bed)
Confirm the respective file formats,
>>> gff.file_type
'gff'
>>> bed.file_type
'bed'
the respective original representation
>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> print(bed)
chr1 50 300 mRNA1 . +
>>> bed.start == gff.start == 50
True
>>> bed.end == gff.end == 300
300
>>> len(bed) == len(gff) == 250
True
GFF and GTF files have lots of useful information in their attributes field (the last field in each line). These attributes
can be accessed via Interval.attrs
.
>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> gff.attrs
{'ID': 'mRNA1', 'Parent': 'gene1'}
>>> gff.attrs['Awesomeness'] = "99"
>>> gff.attrs['ID'] = 'transcript1'
You can directly modify attrs
gff.attrs['Awesomeness'] = "99"
gff.attrs['ID'] = 'transcript1'
print(gff.attrs)
# {'ID': 'transcript1', 'Parent': 'gene1', 'Awesomeness': '99'}
print(gff)
# chr1 fake mRNA 51 300 . + . ID=transcript1;Parent=gene1;Awesomeness=99;
BedTool.filter()
can filter BedTool
objects. Pass a function whose first argument is an Interval
. Returns True/False
for filtering.
Pick >100bp
features
a = pybedtools.example_bedtool('a.bed')
b = a.filter(lambda x: len(x) > 100)
print(b)
# chr1 150 500 feature3 0 -
It can also be defined more generically. The selection length is determined by the incoming parameter
def len_filter(feature, L):
"Returns True if feature is longer than L"
return len(feature) > L
a = pybedtools.example_bedtool('a.bed')
print(a.filter(len_filter, L=200))
# chr1 150 500 feature3 0 -
Also supports chained calls
a.intersect(b).filter(lambda x: len(x) < 100).merge()
The featurefuncs
module contains some functions implemented in Cython, which are about 70% faster than pure Python.
from pybedtools.featurefuncs import greater_than
def L(x,width=100):
return len(x) > 100
### Cython's length comparison is greater than the implementation
test.filter(greater_than, 100)
### Pure Python implementation of greater than
test.filter(L, 100)
Similar to BedTool.filter()
, the action function is applied to each Interval
, based on the returned boolean value to determine whether to retain or not. BedTool.each()
also applies the function to each Interval
, but mainly modifies the Interval
.
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
## intersect" with c=True, count of duplicate features
with_counts = a.intersect(b, c=True)
Then define a function to normalize the count,
def normalize_count(feature, scalar=0.001):
"""
assume feature's last field is the count
"""
counts = float(feature[-1])
normalized = round(counts / (len(feature) * scalar), 2)
# need to convert back to string to insert into feature
feature.score = str(normalized)
return feature
>>> with_counts.head()
chr1 1 100 feature1 0 + 0
chr1 100 200 feature2 0 + 1
chr1 150 500 feature3 0 - 1
chr1 900 950 feature4 0 + 1
>>> normalized = with_counts.each(normalize_count)
>>> print(normalized)
chr1 1 100 feature1 0.0 + 0
chr1 100 200 feature2 10.0 + 1
chr1 150 500 feature3 2.86 - 1
chr1 900 950 feature4 20.0 + 1
Chained calls:
a.intersect(b, c=True).each(normalize_count).filter(lambda x: float(x[4]) < 1e-5)