Easy Sequence Alignment with Biopython

Whether you want to do an alignment of protein or nucleotide sequences, Biopython offers a handy tool for a “quick and dirty” job: the pairwise2 module. It allows for global/local alignment, using custom-built matrices, predefined ones, or none at all, and an array of other options that truly make this a very very flexible tool.

To those concerned with speed issue, pairwise2 has some parts coded in C which will be used unless they weren’t compiled properly. So, on to the actual code. To compute an alignment between two sequences, we need only 10 lines of code.

from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist

matrix = matlist.blosum62
gap_open = -10
gap_extend = -0.5

# Only using the first 60 aas for visualization reasons..
p53_human = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP"
p53_mouse = "MEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRV"

alns = pairwise2.align.globalds(p53_human, p53_mouse, matrix, gap_open, gap_extend)

top_aln = alns[0]
aln_human, aln_mouse, score, begin, end = top_aln

print aln_human+'\n'+aln_mouse
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP----
MEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFF-EGPSEALRV

# Comparing with EBI NEEDLE output
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP------
MEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFF---EGPSEALRV

A list of the predefined matrices available for this is under another module, SubsMat.MatrixInfo:

benner6, benner22, benner74, blosum100, blosum30, blosum35, blosum40, blosum45, blosum50, blosum55, blosum60, blosum62, blosum65, blosum70, blosum75, blosum80, blosum85, blosum90, blosum95, feng, fitch, genetic, gonnet, grant, ident, johnson, levin, mclach, miyata, nwsgappep, pam120, pam180, pam250, pam30, pam300, pam60, pam90, rao, risler, structure

The function name, globalds, is a custom built name that already indicates 1) which kind of alignment we are performing and 2) which scoring scheme we are using. Quoting the documentation of the function:

The names of the alignment functions in this module follow the
 convention
 XX
 where  is either "global" or "local" and XX is a 2
 character code indicating the parameters it takes. The first
 character indicates the parameters for matches (and mismatches), and
 the second indicates the parameters for gap penalties.
The match parameters are
 CODE DESCRIPTION
 x No parameters. Identical characters have score of 1, otherwise 0.
 m A match score is the score of identical chars, otherwise mismatch score.
 d A dictionary returns the score of any pair of characters.
 c A callback function returns scores.
The gap penalty parameters are
 CODE DESCRIPTION
 x No gap penalties.
 s Same open and extend gap penalties for both sequences.
 d The sequences have different open and extend gap penalties.
 c A callback function returns the gap penalties.

This tool is probably not suitable for very accurate alignments, as shown by comparison with NEEDLE. Yet, it serves as a very practical tool that can easily be included in a script without further dependencies (like Clustal, or Needle) and few hassle.

About these ads

6 thoughts on “Easy Sequence Alignment with Biopython

  1. Hi I received an error (KeyError: (‘M’, ‘U’)) when doing this alignment of the two seqences.
    seq_A= “SAARLSAVAQSTVYAFSARPLAGGEPVSLGSLRGKVLLIENVASLGTTTRDYTEMNDLQKRLGPRGLVVLGFPCNQFGHQENGKNEEILNSLKYVRPGGGFEPNFTLFEKCEVNGEKAHPLFTFLRNALPAPSDDPTALMTDPKYIIWSPVCRNDISWNFEKFLVGPDGVPVRRYSRRFRTIDIEPDIEALLSKQPSNP”
    seq_B= “MCAARLAAAAAAAQSVYAFSARPLAGGEPVSLGSLRGKVLLIENVASLUGTTVRDYTQMNELQRRLGPRGLVVLGFPCNQFGHQENAKNEEILNSLKYVRPGGGFEPNFMLFEKCEVNGAGAHPLFAFLREALPAPSDDATALMTDPKLITWSPVCRNDVAWNFEKFLVGPDGVPLRRYSRRFQTIDIEPDIEALLSQGPSCA”
    And I ran this:
    aln = pairwise2.align.globalds(seq_A,seq_B,Matrix,gap_open, gap_extend)

  2. Quote “This tool is probably not suitable for very accurate alignments, as shown by comparison with NEEDLE. ”

    What makes you think that NEEDLE is more accurate? E.g. a gap extension of -0.3 instead of -0.5 would give you exactly the NEEDLE result.

  3. Hey das,

    Surely it is possible. It’s a matter of how to code your loop, not Biopython per se. You want to have something like this:

    sequences = ['aaa', 'bbb', 'ccc', ... ]
    nsequences = len(sequences)
    
    for i in xrange(nsequences):
      for j in xrange(i+1, nsequences):
        aln = alignment_function(sequences[i], sequences[j])
        print sequences[i], sequences[j], aln
    

    This should then output something like:
    seq1 seq2 aln12
    seq1 seq3 aln13
    seq2 seq3 aln23

  4. Thanks a lot

    My data set contains 200 sequences. I would like to do pair wise sequence alignment using my data set. I know that I can use clustalw. But I would like to get the output like this.

    firstsequence- second sequence
    first sequence-3rd seq ————etc
    first seq – 200 seq
    second sequence-third seq
    2nd seq–4th seq etc
    2nd seq- 200 seq.

    Is this possible in Biopython? Please help me

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s