3. Sequence manipulation

In this practical you will have to write some simple Perl scripts for sequence manipulation using various well-known file formats. Starting from a file containing proteins in SwissProt format, you will have to write a program for converting them to Fasta format. Similarly, starting from a file in PDB format you have to convert it to Fasta format.

Two simple programs for performing these tasks are given below


swiss2fasta.pl
$/="\/\/\n";

while (<>)
{
    if ($_=~/^ID\s{3}([\w\d\-\_]+)?\s+.*/m)   
    {
        print ">$1\n";
    }

    while ($_=~/^     (.*)/mg)
    {
        $line=$1;       
        $line=~s/\s//g;   
        print $line."\n";
    }
}

$/="\n";

pdb2fasta.pl
#!/usr/bin/perl
use diagnostics;
use strict;


my %aa=(
        'ALA' => 'A', 'ARG' => 'R', 'ASN' => 'N', 'ASP' => 'D', 'CYS' => 'C',
        'GLN' => 'Q', 'GLU' => 'E', 'GLY' => 'G', 'HIS' => 'H', 'ILE' => 'I',
        'LEU' => 'L', 'LYS' => 'K', 'MET' => 'M', 'PHE' => 'F', 'PRO' => 'P',
        'SER' => 'S', 'THR' => 'T', 'TRP' => 'W', 'TYR' => 'Y', 'VAL' => 'V'
);

foreach my $file ( @ARGV )
{
my $seq='';

print "\n$file\n";

open TMP, $file or die "not found\n";
       
my @c=<TMP>;
close TMP;
       
foreach my $i (0.. $#c)
{
    if( substr( $c[$i], 0, 6) ne'SEQRES' )
    { next;}
    my $len=substr( $c[$i], 13, 4);
    my $chain=substr( $c[$i], 11, 1 );
    if( substr( $c[$i], 7, 3) >1 )
     { next; }

    $seq.=">$chain $len\n";
    my $lc=0;
    foreach my $j( 0..$len/13 )
    {
        foreach my $k( 0..12 )
        {
            if( $j*13 + $k <$len )
            {
                if( exists $aa{ substr( $c[$i+$j],19+$k*4,3 )})
                { $seq.=$aa{ substr( $c[$i+$j], 19+$k*4,3 )}; }
                else
                { $seq.='X'; }
                $lc++;
                if( $lc==60 or $j*13 + $k==$len-1 )
                { $lc=0; $seq.="\n"; }
            }
        }
    }
    if( substr( $c[$i], 0, 6) ne'SEQRES' )
    { last;}

}

print $seq;

}

In the final part, you will have to write a custom script that reads a file in SwissProt format and produces a "three-line" fasta output. That is, a file with a header, a single line for the sequence and a line with the sequence labels. For labels we will use the transmembrane segments of the proteins, as they appear in the FT TRANSMEM field.

ID   140U_DROME              Reviewed;         261 AA.
AC   P81928; Q9VFM8;
FT   CHAIN         1    261       RPII140-upstream gene protein.
FT                                /FTId=PRO_0000064352.
FT   TRANSMEM     67     87       Potential.
FT   TRANSMEM    131    151       Potential.
FT   TRANSMEM    183    203       Potential.
FT   CONFLICT     64     64       S -> F (in Ref. 1).
SQ   SEQUENCE   261 AA;  29182 MW;  5DB78CF6CFC4435A CRC64;
     MNFLWKGRRF LIAGILPTFE GAADEIVDKE NKTYKAFLAS KPPEETGLER LKQMFTIDEF
     GSISSELNSV YQAGFLGFLI GAIYGGVTQS RVAYMNFMEN NQATAFKSHF DAKKKLQDQF
     TVNFAKGGFK WGWRVGLFTT SYFGIITCMS VYRGKSSIYE YLAAGSITGS LYKVSLGLRG
     MAAGGIIGGF LGGVAGVTSL LLMKASGTSM EEVRYWQYKW RLDRDENIQQ AFKKLTEDEN
     PELFKAHDEK TSEHVSLDTI K

The final output should look like this:
>P81928
MNFLWKGRRFLIAGILPTFEGAADEIVDKENKTYKAFLASKPPEETGLERLKQMFTIDEFGSISSELNSVYQAGFLGFLIGAIYGGVT
-------------------------------------------------------------------MMMMMMMMMMMMMMMMMMMM-

For input, you can use a file containing transmembrane proteins in SwissProt format.