Computer readable FASTA output

Ewan Birney birney at molbiol.ox.ac.uk
Fri Oct 20 04:02:54 EST 1995


>>>LCBO prolactin precursor - bovine
>; n1: 229
>; initn:  442
>; init1:  314
>; opt: 501
>; z-score: 600.7
>; expect: 1.5e-27
>; smith-waterman: 501
>; ident: 0.365 
>; overlap: 222
>; start_seq1: 1
>; stop_seq1: 224
>; start_seq2: 1
>; stop_seq2: 229
>>musplf ..
> MLPSLIQPCSWILLLLLVNSSLLWKNVASFPMCAMRNGRCFMSFEDTFE
>LAGSLSHNISIEVSELFTEFEKHYSNVSGLRDKSPMRCNTSFLPTPENKE
>QARLTHYSALLKSGAMILDAWESPLDDLVSELSTIKNVPDIIISKATDIK
>KKINAVRNGVNALMSTMLQNGDEEKKNPAWF....LQSDNEDARIHSLYG
>MISCLDNDFKKVDIYLNVLKCYMLKIDNC
>>LCBO ..
>MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFD
>RAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKE
>QAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIE
>EENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYN
>LLHCLRRDSSKIDTYLKLLNCRIIYNNNC
>>>LCPG prolactin precursor - pig                     (229 aa)

This looks great - I agree with Keith's point of adding in // a deliminters
to each entry

What are the .. points at the end of the sequence? Is this for better
GCG parsing (???) or indicative of more text for each entry?

Can also suggest that it is
	start_query:
	stop_query:  
and	start_hit
	stop_hit

which makes the seq1 and seq2 mean more to people reading it.

Naturally - this suggests making a "standardised" format for database searches
which would be

>>Hit_name
; tag: item
; tag: item
;
>query_name
Query_sequence alignment
>hit_name
Hit_sequence alignment
//


It would be trivial for me to get SearchWise to chuck this sort of thing out
(except.... what do you do with TFASTA/Protein query vs DNA sequence hits?)

Can I suggest one thing in the parsing: that items are either
one word or " " deliminted for strings. Do you want to build in line-overrun
systems (some sort of backslash?) 


And should we have an agreed set of tags (eg start_query, start_hit)


This is a good suggestion though


ewan

birney at molbiol.ox.ac.uk





More information about the Bio-soft mailing list