In this practical you will have to write some simple Perl scripts for sequence manipulation using various well-known file formats. Starting from a file containing proteins in SwissProt format, you will have to write a program for converting them to Fasta format. Similarly, starting from a file in PDB format you have to convert it to Fasta format. Two simple programs for performing these tasks are given below swiss2fasta.pl $/="\/\/\n"; while (<>) { if ($_=~/^ID\s{3}([\w\d\-\_]+)?\s+.*/m) { print ">$1\n"; } while ($_=~/^ (.*)/mg) { $line=$1; $line=~s/\s//g; print $line."\n"; } } $/="\n"; pdb2fasta.pl #!/usr/bin/perl use diagnostics; use strict; my %aa=( 'ALA' => 'A', 'ARG' => 'R', 'ASN' => 'N', 'ASP' => 'D', 'CYS' => 'C', 'GLN' => 'Q', 'GLU' => 'E', 'GLY' => 'G', 'HIS' => 'H', 'ILE' => 'I', 'LEU' => 'L', 'LYS' => 'K', 'MET' => 'M', 'PHE' => 'F', 'PRO' => 'P', 'SER' => 'S', 'THR' => 'T', 'TRP' => 'W', 'TYR' => 'Y', 'VAL' => 'V' ); foreach my $file ( @ARGV ) { my $seq=''; print "\n$file\n"; open TMP, $file or die "not found\n"; my @c=<TMP>; close TMP; foreach my $i (0.. $#c) { if( substr( $c[$i], 0, 6) ne'SEQRES' ) { next;} my $len=substr( $c[$i], 13, 4); my $chain=substr( $c[$i], 11, 1 ); if( substr( $c[$i], 7, 3) >1 ) { next; } $seq.=">$chain $len\n"; my $lc=0; foreach my $j( 0..$len/13 ) { foreach my $k( 0..12 ) { if( $j*13 + $k <$len ) { if( exists $aa{ substr( $c[$i+$j],19+$k*4,3 )}) { $seq.=$aa{ substr( $c[$i+$j], 19+$k*4,3 )}; } else { $seq.='X'; } $lc++; if( $lc==60 or $j*13 + $k==$len-1 ) { $lc=0; $seq.="\n"; } } } } if( substr( $c[$i], 0, 6) ne'SEQRES' ) { last;} } print $seq; } In the final part, you will have to write a custom script that reads a file in SwissProt format and produces a "three-line" fasta output. That is, a file with a header, a single line for the sequence and a line with the sequence labels. For labels we will use the transmembrane segments of the proteins, as they appear in the FT TRANSMEM field. ID 140U_DROME Reviewed; 261 AA. AC P81928; Q9VFM8; FT CHAIN 1 261 RPII140-upstream gene protein. FT /FTId=PRO_0000064352. FT TRANSMEM 67 87 Potential. FT TRANSMEM 131 151 Potential. FT TRANSMEM 183 203 Potential. FT CONFLICT 64 64 S -> F (in Ref. 1). SQ SEQUENCE 261 AA; 29182 MW; 5DB78CF6CFC4435A CRC64; MNFLWKGRRF LIAGILPTFE GAADEIVDKE NKTYKAFLAS KPPEETGLER LKQMFTIDEF GSISSELNSV YQAGFLGFLI GAIYGGVTQS RVAYMNFMEN NQATAFKSHF DAKKKLQDQF TVNFAKGGFK WGWRVGLFTT SYFGIITCMS VYRGKSSIYE YLAAGSITGS LYKVSLGLRG MAAGGIIGGF LGGVAGVTSL LLMKASGTSM EEVRYWQYKW RLDRDENIQQ AFKKLTEDEN PELFKAHDEK TSEHVSLDTI K The final output should look like this: >P81928 MNFLWKGRRFLIAGILPTFEGAADEIVDKENKTYKAFLASKPPEETGLERLKQMFTIDEF GSISSELNSVYQAGFLGFLIGAIYGGVT -------------------------------------------------------------------MMMMMMMMMMMMMMMMMMMM- For input, you can use a file containing transmembrane proteins in SwissProt format. |