How to use BEDTools in Python (pybedtools)

created at 01-07-2022 views: 26

Preface

pybedtools is an encapsulation and extension of BEDTools software using Python. Why use pybedtools instead of BEDTools directly? Of course we want to use BEDTools in Python (😀)

Why pybedtools? The author also gave an example to find out the SNPs in the intergenic region, and then get the gene names within <5kb from the SNPs. If you use pybedtools:

import pybedtools

snps = pybedtools.example_bedtool('snps.bed.gz')
genes = pybedtools.example_bedtool('hg19.gff')

intergenic_snps = snps.subtract(genes)                  
nearby = genes.closest(intergenic_snps, d=True, stream=True) 

for gene in nearby:             
    if int(gene[-1]) < 5000:    
        print(gene.name) 

If you use a shell script, you may need these codes:

snps=snps.bed.gz
genes=hg19.gff
intergenic_snps=/tmp/intergenic_snps

snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
gene_fields=9
distance_field=$(($gene_fields + $snp_fields + 1))

intersectBed -a $snps -b $genes -v > $intergenic_snps

closestBed -a $genes -b $intergenic_snps -d \
  awk '($'$distance_field' < 5000){print $9;}' \
  perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'

rm $intergenic_snps

By comparison, if you use pybedtools, you only need to be familiar with Python and pybedtools. To achieve the same function with shell, you may need Perl, bash, awk and bedtools.

For me, it is mainly because of using pybedtools, which allows me to use Python code throughout the process to get the same results as bedtools.

use

Install

Via conda

$ conda install --channel conda-forge --channel bioconda pybedtools

Install via pip

$ pip install pybedtools 

Create a BedTool

To use pybedtools, the first step is to import the pybedtools package and create the BedTool object. The BedTool object encapsulates all the available program functions of the BedTools program, so that they can be better used in Python. Therefore, in most cases, we use pybedtools, that is, to manipulate the BedTool object. The creation of the BedTool object can use the interval file (BED, GFF, GTF, VCF, SAM, or BAM format). Create a BedTool object from a file:

# 'test.bed' is the file path.
test = pybedtools.BedTool('test.bed')

In addition, it can also be created from a string:

s ='''
chrX 1 100
chrX 25 800
'''
## Controlled by the parameter from_string
a = pybedtools.BedTool(s, from_string=True)

pybedtools also provides some test files, which will be used as demonstrations below.

# list_example_files will list example files
>>> pybedtools.list_example_files()
['164.gtf',
...
 'a.bed',
 'a.bed.gz',
 'b.bed',
 'bedpe.bed',
 'bedpe2.bed',
 'c.gff',
 'd.gff',
...
 'venn.b.bed',
 'venn.c.bed',
 'x.bam',
 'x.bed',
 'y.bam']

Use sample files to create BedTool objects

## Just pass in the file name'a.bed'
a = pybedtools.example_bedtool('a.bed')

View the first few rows of data

>>> a.head()
chr1 1 100 feature1 0 +
  chr1 100 200 feature2 0 +
  chr1 150 500 feature3 0-
  chr1 900 950 feature4 0 +

Intersections

BEDTools is often used to take intersection, let's see how to operate in pybedtools

a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
a_and_b = a.intersect(b)
>>> a.head()
chr1    1   100 feature1    0   +
chr1    100 200 feature2    0   +
chr1    150 500 feature3    0   -
chr1    900 950 feature4    0   +

>>> b.head()
chr1    155 200 feature5    0   -
chr1    800 901 feature6    0   +

>>> a_and_b.head()
chr1    155 200 feature2    0   +
chr1    155 200 feature3    0   -
chr1    900 901 feature4    0   +

If you want to check whether there is overlap, the same as intersectBed usage, add the parameter u:

a_with_b = a.intersect(b, u=True)
>>> a_with_b.head()
chr1 100 200 feature2 0 +
 chr1 150 500 feature3 0 -
 chr1 900 950 feature4 0 +

After running the intersect method, what is returned is a new BedTool example, which temporarily saves the content on a hard disk temporary address. Temporary address acquisition:

>>> a_and_b.fn
'/tmp/pybedtools.kfnxvuuz.tmp'
>>> a_with_b.fn
'/tmp/pybedtools.qre024y9.tmp'

Saving the results

The BedTool.saveas() method can copy the temporary file pointed to by the BedTool instance to a new address. At the same time, you can add a trackline to upload the UCSC Genome Browser directly without opening the file again to add the trackline.

c = a_with_b.saveas('intersection-of-a-and-b.bed', trackline='track name="a and b"')
print(c.fn)
# intersection-of-a-and-b.bed

The BedTool.saveas() method returns a new BedTool instance pointing to a new address.

And BedTool.moveto() is used to directly move the file to a new address. If you don't need to add the trackline and the file is very large, this method will be very fast.

d = a_with_b.moveto('another_location.bed')
print(d.fn)
# 'another_location.bed'

Since it has been moved, that is, the old file does not exist anymore, an error will be reported if you view its content again.

>>> a_with_b.head()
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_2075/544676037.py in <module>
.........

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pybedtools.4uqthcml.tmp'

Default arguments

When using BEDTools, we will specify the -a and -b parameters, but in the previous operation, we did not specify them. When this is because pybedtools gives the default settings, the instance of the method call is used as -a, and the other instance passed in is used as -b, that is, exons corresponds to -a ("a.bed") , snps corresponds to -b("b.bed")

import pybedtools
exons = pybedtools.example_bedtool('a.bed')
snps = pybedtools.example_bedtool('b.bed')
exons_with_snps = exons.intersect(snps, u=True)
$ intersectBed -a a.bed -b b.bed -u > tmpfile

The above two are the same. But you can also specify a, b

# these all have identical results
x1 = exons.intersect(snps)
x2 = exons.intersect(a=exons.fn, b=snps.fn)
x3 = exons.intersect(b=snps.fn)
x4 = exons.intersect(snps, a=exons.fn)
x1 == x2 == x3 == x4
# True

Chaining methods together (pipe)

The chain call of the method is similar to the pipeline operation under the linux shell. For example: a and b check whether the intersection or not, then merge the coordinate intervals, and run separately.

x1 = a.intersect(b, u=True)
x2 = x1.merge()

Chaining calls can be done like this:

x3 = a.intersect(b, u=True).merge()

another example

x4 = a.intersect(b, u=True).saveas('a-with-b.bed').merge().saveas('a-with-b-merged.bed')

Operator overloading

Operator overloading, an amazing method!

x5 = a.intersect(b, u=True)
x6 = a + b

x5 == x6
# True
x7 = a.intersect(b, v=True)
x8 = a - b

x7 == x8
# True

That is, the + operator means intersectBed uses the -u parameter, and the - operator means intersectBed uses the -v parameter.

It is still possible to continue chaining calls using operator overloading

x9 = (a + b).merge()

Intervals

In pybedtools, an Interval object represents a line of data in a BED, GFF, GTF or VCF file. Note that the above and the following are demonstrations in the Python3 version (there will be no one using Python2)

Or go ahead and create a BedTool object as an example,

a = pybedtools.example_bedtool('a.bed')
feature = a[0]
features = a[1:3]

BedTool supports slicing operations, getting a single-line element is an Interval object

>>> type(feature)
pybedtools.cbedtools.Interval

Common Interval attributes

Printing the Interval object will return the row data it represents.

>>> print(feature)
chr1 1 100 feature1 0 +

All feature (referring to Interval), regardless of the file type from which BedTool was created, have chrom, start, stop, name, score, strand properties. where start, stop are integers and everything else (including score) are strings. However, we should often have only chrom, start, stop data, let's see how pybedtools handles the remaining missing attributes.

>>> feature2 = pybedtools.BedTool('chrX 500 1000', from_string=True)[0]
>>> print(feature2)
chrX 500 1000

It seems that print(feature2) prints the raw line data.

>>> feature2.chrom
'chrX'
>>> feature2.start
500
>>> feature2.stop
1000 
>>> feature2.name
'.'
>>>feature2.score
'.'
>>>feature2.strand
'.'

The missing default value is ..

Indexing into Interval objects

Interval can be indexed by list and dict.

>>> feature2[:]
['chrX', '500', '1000']
>>> feature2[0]
'chrX'
>>> feature2["chrom"]
'chrX'

However, indexing in a dictionary has an advantage. For missing values, the default value can be automatically returned, and an error will be reported through an integer index.

>>> feature2["name"]
'.'
>>> feature2[3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_2075/566460392.py in <module>
----> 1 feature2[3]

/opt/conda/lib/python3.9/site-packages/pybedtools/cbedtools.pyx in pybedtools.cbedtools.Interval.__getitem__()

IndexError: field index out of range

Fields

Interval has an Interval.fields property that splits the original line into a list. When using integer indexing, this fields list is indexed.

>>> f = pybedtools.BedTool('chr1 1 100 asdf 0 + a b c d', from_string=True)[0]
>>> f.fields
['chr1', '1', '100', 'asdf', '0', '+', 'a', 'b', 'c', 'd']
>>> len(f.fields)
10

BED is 0-based, others are 1-based

A troublesome thing when using multiple formats, the coordinate system of BED files is different from GFF/GTF/VCF files.

  • BED files are 0-based, (the first position on the chromosome is 0), features do not contain stop positions
  • GFF, GTF, and VCF files are 1-based (i.e. the first position on the chromosome is 1) features include stop positions For convenience, pybedtools makes these conventions:

  • Interval.start is 0-based start position, no matter what file. Even GFF, or other 1-based features.

  • Use len() to get the length of Interval, always return Interval.stop - Interval.start. This way, no matter what the file format, the length is guaranteed to be correct. Also simplifies the underlying code. We can consider all Intervals to be the same
  • The contents of Interval.fields are all strings, which are the division of the original line. So for GFF feature, Interval.fields[3] and Interval[3] are not the same as Interval.start. That is Interval.fields[3] = Interval.start + 1. Because Interval.fields[3] is the original 1-based data, and Interval.start uses 0-based

Worked example

Here are two examples to create a GFF Interval

gff = ["chr1",
       "fake",
       "mRNA",
       "51",   # <- start is 1 greater than start for the BED feature below
       "300",
       ".",
       "+",
       ".",
       "ID=mRNA1;Parent=gene1;"]
gff = pybedtools.create_interval_from_list(gff)

Create a BED Interval with the same gff coordinates as before. But their coordinate systems are different.

bed = ["chr1",
       "50",
       "300",
       "mRNA1",
       ".",
       "+"]
bed = pybedtools.create_interval_from_list(bed)

Confirm the respective file formats,

>>> gff.file_type
'gff'
>>> bed.file_type
'bed'

the respective original representation

>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> print(bed)
chr1 50 300 mRNA1 . +
>>> bed.start == gff.start == 50
True
>>> bed.end == gff.end == 300
300
>>> len(bed) == len(gff) == 250
True

GFF features have access to attributes

GFF and GTF files have lots of useful information in their attributes field (the last field in each line). These attributes can be accessed via Interval.attrs.

>>> print(gff)
chr1 fake mRNA 51 300 . + . ID=mRNA1;Parent=gene1;
>>> gff.attrs
{'ID': 'mRNA1', 'Parent': 'gene1'}
>>> gff.attrs['Awesomeness'] = "99"
>>> gff.attrs['ID'] = 'transcript1'

You can directly modify attrs

gff.attrs['Awesomeness'] = "99"
gff.attrs['ID'] = 'transcript1'
print(gff.attrs)
# {'ID': 'transcript1', 'Parent': 'gene1', 'Awesomeness': '99'}
print(gff)
# chr1 fake mRNA 51 300 . + . ID=transcript1;Parent=gene1;Awesomeness=99;

Filtering

BedTool.filter() can filter BedTool objects. Pass a function whose first argument is an Interval. Returns True/False for filtering.

Pick >100bp features

a = pybedtools.example_bedtool('a.bed')
b = a.filter(lambda x: len(x) > 100)
print(b)
# chr1 150 500 feature3 0 -

It can also be defined more generically. The selection length is determined by the incoming parameter

def len_filter(feature, L):
    "Returns True if feature is longer than L"
    return len(feature) > L

a = pybedtools.example_bedtool('a.bed')
print(a.filter(len_filter, L=200))
# chr1        150     500     feature3        0       -

Also supports chained calls

a.intersect(b).filter(lambda x: len(x) < 100).merge()

Fast filtering functions in Cython

The featurefuncs module contains some functions implemented in Cython, which are about 70% faster than pure Python.

from pybedtools.featurefuncs import greater_than

def L(x,width=100):
     return len(x) > 100

### Cython's length comparison is greater than the implementation
test.filter(greater_than, 100)
### Pure Python implementation of greater than
test.filter(L, 100)

Each

Similar to BedTool.filter(), the action function is applied to each Interval, based on the returned boolean value to determine whether to retain or not. BedTool.each() also applies the function to each Interval, but mainly modifies the Interval.

a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')

## intersect" with c=True, count of duplicate features
with_counts = a.intersect(b, c=True)

Then define a function to normalize the count,

def normalize_count(feature, scalar=0.001):
    """
    assume feature's last field is the count
    """
    counts = float(feature[-1])
    normalized = round(counts / (len(feature) * scalar), 2)

    # need to convert back to string to insert into feature
    feature.score = str(normalized)
    return feature
>>> with_counts.head()
chr1 1 100 feature1 0 + 0
 chr1 100 200 feature2 0 + 1
 chr1 150 500 feature3 0 - 1
 chr1 900 950 feature4 0 + 1

>>> normalized = with_counts.each(normalize_count)
>>> print(normalized)
chr1        1       100     feature1        0.0     +       0
chr1        100     200     feature2        10.0    +       1
chr1        150     500     feature3        2.86    -       1
chr1        900     950     feature4        20.0    +       1

Chained calls:

a.intersect(b, c=True).each(normalize_count).filter(lambda x: float(x[4]) < 1e-5)
created at:01-07-2022
edited at: 01-07-2022: