Sequence editor that can delete columns?

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Thu Feb 27 12:01:12 EST 1997


In article <3314A3F7.1C33 at freenet.carleton.ca>, ac562 at freenet.carleton.ca (Robert J. Forster) writes:
>I have quite a few alignments of 16S rRNA sequences from which I would
>like to delete a highly variable region before analysis.  I have used
>ClustalX and Seqpup on a mac to produce the alignments. When I select a
>block of sequences in seqpup and then hit the Edit clear or cut
>functions nothing happens.  I can delete the region one sequence at a
>time, but with hundreds of sequences I am searching for an easier way. 
>I have put GDE on an HP-UX machine and DCSE on Linux, but both of these
>installations are not exactly stable, and the documentation for these
>programs does not indicate whether the proposed task would be very easy.
>If anyone knows of a program that could help me out I would appreciate
>some pointers.

If you have access to EGCG you will find a program CREFORMAT, which is a 
variant of GCGs REFORMAT that I wrote specifically to address the needs of
a user here who has zillions of aligned tRNAs - a situation fairly similar
to yours.  CREFORMAT adds these switches to the standard ones: 

                            file
/BEGin           beginning of range, defaults to 1
/END             end of range, defaults to maximum sequence length
   Use these to extract a subsequence from a sequence or MSF file.
/DELete          delete the subsequence in the range, leave the rest
/REVerse         return the reverse strand
/LOOKup="U.,TZ"  convert characters in first string to matching character
                 in second string.

You can use this to automate column deletions for use in batch files and so
forth, or do it interactively from the command line.  It will operate on 
any sequence file that REFORMAT can (*.seq, whatever.msf{*}, @file.list).

So in your case, you could do:

$ creformat/infile=whatever.msf{*}/msf/begin=90/end=100/delete

and that would remove columns 90 through 100 inclusive.  (Note that when 
doing multiple columns you specify the regions to remove back to front
so the numbering doesn't change as you go along.)

It's also handy for picking out a column of data, like this:

$ creformat/infile=whatever.msf{*}/msf/begin=90/end=100/outfile=thin.msf

or for just yanking a subregion out of a database entry, when you know
a priori where the region of interest is, as here, when yanking the CDS
for the glucose transporter gene out of the 338234 bp entry for the 
Bithorax Complex:

$ creformat/infile=GB_IN:DMU31961/begin=193566/end=195089 -
  /reverse/outf=glucose.seq



If you don't have EGCG you could use NEDIT, which will also do column cuts
and pastes, and is free. 


Regards,


David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 



More information about the Mol-evol mailing list