Standard Multiple Sequence Alignment File Format

Don Gilbert gilbertd at sunflower.bio.indiana.edu
Thu Jan 26 18:02:41 EST 1995


I don't know of a file format that does what you want as-is.
If you are looking for wider compatibility however, than
you might use the Pearson/Fasta format to do what you want
and many more programs would read such files.  The main difference
between this GDE format and the Fasta format for FastA compatibility
is the trivial change from "#" to ">".

GDE
#name1(offset)
sequence1
#name2(offset)
sequence2

Pearson/FastA
>name1 any added tags you want here
sequence1
>name2 any added tags you want here
sequence2


There are a lot of programs out there that read FastA format.  If you
are going to define a new standard, it would be a flexible choice
to start from.  GDE should be modifiable to the extent of using ">"
instead of "#" for the name line key.

The readseq reformatter sticks the sequence length and 32bit checksum 
on each FastA name line, as

>acarr58sst 183 bases, CEC03C8E checksum.
a-------actcctaacaacGgAtatCTtGgtT-CtcgcgagGatGAaGa
acGcAGcg--AaatGcGatacgtagtgtgaatcgc-agggatcagtgaat
catcgaatctttgaacgcaagttgcgctctcgtg--gtttaaccccccgg
gagc-acgttcgcttgagtgcc--gctt-----

If you limit your enhancements to things after the ">name " part
of the FastA line, you shouldn't break any software that reads FastA 
format.  Your suggested tags might look like:

>name 123 offset, 456 bases, 789 checksum, [consensus]
     
If you want, add base composition counts, but the total base
count plus the checksum values will give you unambiguous indication
if two sequences are identical or not.

-- Don
-- 
-- d.gilbert--biocomputing--indiana u--bloomington--gilbertd at bio.indiana.edu




More information about the Bio-soft mailing list