[DI] Computer readable FASTA output

"BillPearson"wrp at avery.med.virginia.edu "BillPearson"wrp at avery.med.virginia.edu
Thu Oct 12 09:55:25 EST 1995


I have been asked by several people to modify FASTA to produce a more
easily "parsed" alignment output that would be easier to use with other
programs, i.e. for HTML output or subsequent analysis.  

Since FASTA already supports a variety of alignment output formats, I
am happy to add another one that is more machine readable.  At the moment,
my inclination is to provide the following:


(1) put a delimiter, e.g. ">>", at the beginning of each alignment record
    with some hope that the ">>" would not be used again until the next
    alignment record.

(2) put out a list of scores in some tagged format, e.g.:
 
	n1:
	initn:
	init1:  
	opt:
	smith_waterman:
	identity:
	length:
	expect:
	start_seq1:
	end_seq1:
	start_seq2:
	end_seq2:

(3) put out the alignments without any extra information (just letters).

Thus, a -m 10 alignment might look like:

>>LCBO - Prolactin precursor - Bovine
; n1: 229
; initn:  442
; init1:  314
; opt:  501
; Smith-Waterman: 501
; z-score: 600.7
; expect: 1.5e-27
; identity:   0.365
; overlap: 222
; start_seq1: 1
; end_seq1: 224
; start_seq2: 1
; end_seq2: 229
>musplf ..
 MLPSLIQPCSWILLLLLVNSSLLWKNVASFPMCAMRNGRCFMSFEDTFE
LAGSLSHNISIEVSELFTEFEKHYSNVSGLRDKSPMRCNTSFLPTPENKE
QARLTHYSALLKSGAMILDAWESPLDDLVSELSTIKNVPDIIISKATDIK
KKINAVRNGVNALMSTMLQNGDEEKKNPAWF....LQSDNEDARIHSLYG
MISCLDNDFKKVDIYLNVLKCYMLKIDNC
>LCBO ..
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFD
RAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKE
QAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIE
EENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYN
LLHCLRRDSSKIDTYLKLLNCRIIYNNNC

SSEARCH would use the same format, but without
the initn, init1, and opt values.


I am willing to entertain comments on this proposed output,
or some other, until Nov. 1, at which point I will try 
to produce a new output format.  Note that this will
not replace any existing format, so it should not break
anything that exists today.

An alternative might be to use the BLAST ASN.1 format
for output.  If you would prefer that format, send a vote.

Bill Pearson
-- 
wrp at virginia.EDU
Dept. of Biochemistry #440
U. of Virginia
Charlottesville, VA 22908



More information about the Bio-www mailing list