Whether you want to do an alignment of protein or nucleotide sequences, Biopython offers a handy tool for a “quick and dirty” job: the pairwise2 module. It allows for global/local alignment, using custom-built matrices, predefined ones, or none at all, and an array of other options that truly make this a very very flexible tool.
To those concerned with speed issue, pairwise2 has some parts coded in C which will be used unless they weren’t compiled properly. So, on to the actual code. To compute an alignment between two sequences, we need only 10 lines of code.
from Bio import pairwise2 from Bio.SubsMat import MatrixInfo as matlist matrix = matlist.blosum62 gap_open = -10 gap_extend = -0.5 # Only using the first 60 aas for visualization reasons.. p53_human = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP" p53_mouse = "MEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRV" alns = pairwise2.align.globalds(p53_human, p53_mouse, matrix, gap_open, gap_extend) top_aln = alns[0] aln_human, aln_mouse, score, begin, end = top_aln print aln_human+'\n'+aln_mouse MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP---- MEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFF-EGPSEALRV # Comparing with EBI NEEDLE output MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP------ MEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFF---EGPSEALRV
A list of the predefined matrices available for this is under another module, SubsMat.MatrixInfo:
benner6, benner22, benner74, blosum100, blosum30, blosum35, blosum40, blosum45, blosum50, blosum55, blosum60, blosum62, blosum65, blosum70, blosum75, blosum80, blosum85, blosum90, blosum95, feng, fitch, genetic, gonnet, grant, ident, johnson, levin, mclach, miyata, nwsgappep, pam120, pam180, pam250, pam30, pam300, pam60, pam90, rao, risler, structure
The function name, globalds, is a custom built name that already indicates 1) which kind of alignment we are performing and 2) which scoring scheme we are using. Quoting the documentation of the function:
The names of the alignment functions in this module follow the convention XX where is either "global" or "local" and XX is a 2 character code indicating the parameters it takes. The first character indicates the parameters for matches (and mismatches), and the second indicates the parameters for gap penalties. The match parameters are CODE DESCRIPTION x No parameters. Identical characters have score of 1, otherwise 0. m A match score is the score of identical chars, otherwise mismatch score. d A dictionary returns the score of any pair of characters. c A callback function returns scores. The gap penalty parameters are CODE DESCRIPTION x No gap penalties. s Same open and extend gap penalties for both sequences. d The sequences have different open and extend gap penalties. c A callback function returns the gap penalties.
This tool is probably not suitable for very accurate alignments, as shown by comparison with NEEDLE. Yet, it serves as a very practical tool that can easily be included in a script without further dependencies (like Clustal, or Needle) and few hassle.
Hey das,
Surely it is possible. It’s a matter of how to code your loop, not Biopython per se. You want to have something like this:
sequences = ['aaa', 'bbb', 'ccc', ... ] nsequences = len(sequences) for i in xrange(nsequences): for j in xrange(i+1, nsequences): aln = alignment_function(sequences[i], sequences[j]) print sequences[i], sequences[j], alnThis should then output something like:
seq1 seq2 aln12
seq1 seq3 aln13
seq2 seq3 aln23
Thanks a lot
My data set contains 200 sequences. I would like to do pair wise sequence alignment using my data set. I know that I can use clustalw. But I would like to get the output like this.
firstsequence- second sequence
first sequence-3rd seq ————etc
first seq – 200 seq
second sequence-third seq
2nd seq–4th seq etc
2nd seq- 200 seq.
Is this possible in Biopython? Please help me