Parseable FASTA alpha-test

William R. Pearson wrp at alpha0.bioch.virginia.edu
Fri Jan 12 15:20:59 EST 1996


An alpha-test version of the parseable FASTA/SSEARCH programs is now
available from ftp.virginia.edu in pub/fasta/alpha/fasta20x4.shar.Z.
This is a unix version only, and has not been tested extensively.  It
is a first attempt to provide a version of FASTA with easily parseable
output for further processing of FASTA output.  It follows closely a
proposal that was posted last fall.

Here is a description of the new output format (-m 10). In most
other respects, the program has not changed.

Bill Pearson

================================================================

Changes with 2.0x4  (January, 1996)

This is a preliminary "alpha-test" release for comment and bug
reports.  Please send me suggestions about problems with the
design or implementation of this new format by Feb. 1, 1996

The major change in with 2.0x4 is the ability to get a parseable
output from FASTA/TFASTA/SSEARCH.  This can be done using output
option -m 10.  With -m 10, the initial histogram and list of best
scores is unchanges, but the alignments are now in a parseable form:

>>>mgstm1.aa, 217 aa vs s library
; pg_name: FASTA
; pg_ver: version 2.0x4 Jan., 1996
; pg_matrix: BLOSUM50
; pg_gap-pen: -12 -2
; pg_ktup: 1
; pg_optcut: 30
; pg_cgap: 42
>>GTB1_MOUSE GLUTATHIONE S-TRANSFERASE GT8.7 (EC 2.5.1.18
; fa_initn: 1490
; fa_init1: 1490
; fa_opt: 1490
; fa_z-score: 1916.0
; fa_expect:      0
; sw_score: 1490
; sw_ident: 1.000
; sw_overlap: 217
>GT8.7  ..
; sq_len: 217
; sq_type: p
; al_start: 1
; al_stop: 217
; al_display_start: 1
PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
KLGLDFPNLPYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVE
NQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
KVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSS
RYIATPIFSKMAHWSNK
>GTB1_MOUSE ..
; sq_len: 217
; sq_type: p
; al_start: 1
; al_stop: 217
; al_display_start: 1
PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
KLGLDFPNLPYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVE
NQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
KVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSS
RYIATPIFSKMAHWSNK
>>GT28_SCHJA GLUTATHIONE S-TRANSFERASE 28 KD (EC 2.5.1.18
; fa_initn: 190
; fa_init1: 97
; fa_opt: 169
; fa_z-score: 217.9
; fa_expect: 1.1e-05
; sw_score: 169
; sw_ident: 0.277
; sw_overlap: 228
>GT8.7  ..
; sq_len: 217
; sq_type: p
; al_start: 4
; al_stop: 180
; al_display_start: 1
PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
KLGLDFPNLPY--LID--GSHK-ITQSNAILRYLARKHHLDGETEEERIR
ADIVENQVMDTRMQLIMLCYNPDFEKQK--PEFLK-TIPEKMKLYSEFLG
KRP--WFAGDKVTYVDFLAYDILDQYRMFEPKCLDA-FPNLRDFLARFEG
LKKISAYMKSSRYIATPIFSKMAHWSNK
>GT28_SCHJA ..
; sq_len: 206
; sq_type: p
; al_start: 3
; al_stop: 180
; al_display_start: 1
-VKLIYFNGRGRAEPIRMILVAAGVEFEDERIEFQDWP----------KI
KPTIPGGRLPIVKITDKRGDVKTMSESLAIARFIARKHNMMGDTDDEYYI
IEKMIGQVEDVESEYHKTLIKPPEEKEKISKEILNGKVPILLQAICETLK
ESTGNLTVGDKVTLADVVLIASIDHITDLDKEFLTGKYPEIHKHRKHLLA
TSPKLAKYLSERHATAF
>>GT2_DROME GLUTATHIONE S-TRANSFERASE 2 (EC 2.5.1.18).
; fa_initn: 124
; fa_init1: 124
; fa_opt: 164
; fa_z-score: 210.1
; fa_expect: 2.9e-05
; sw_score: 164
; sw_ident: 0.248
; sw_overlap: 251
>GT8.7  ..
; sq_len: 217
; sq_type: p
; al_start: 4
; al_stop: 198
; al_display_start: 1
---------------------------PMILGYWNVRGLTHPIRMLLEYT
DSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNLPYL-IDGSHKITQS
NAILRYLARKHHLDGETEEERIRADIVENQVMDTRMQLIMLCYNPDFEKQ
KPEFLKTIPEKMKLYSEFLGKR-----PWFAGDKVTYVDFLAYDILDQYR
-MFEPKCLDAFPNLRDFLARFEGLKKISAYMKSSRYIATPIFSKMAHWSN
K
>GT2_DROME ..
; sq_len: 247
; sq_type: p
; al_start: 52
; al_stop: 240
; al_display_start: 22
PPAEGAEGAVEGGEAAPPAEPAEPIKHSYTLFYFNVKALPSPC------A
TCSDGNQEYE--DVAHPRRVPALKPTMPMG----QMPVLEVDGK-RVHQS
ISMARFLAKTVGLCGATPWEDLQIDIVVDTINDFRLKIAVVSYEPEDEIK
EKKLVTLNAEVIPFYLEKLEQTVKDNDGHLALGKLTWADVYFAGITDYMN
YMVKRDLLEPYPAVRGVVDAVNALEPIKAWIEKRPVTEV


Note that the parseable output starts with ">>>" and that each
alignment record starts with ">>" while each aligned sequence record
starts with ">"

All parameters produced by the fasta package will be of the form:

	; xx_yyyyy

In this version, we have xx:

	pg - program parameters (name, version, matrix)
	fa - fasta scores, expect values, etc.
	sw - Smith-Waterman scores, expect values.
	sq - sequence length, type
	al - alignment start, stop, display_offset
	
Other FASTA distributors may choose to add additional fields.  If they
do, they should use a tag with more than two characters, e.g.:

	ebi_access: 
or
	gcg_?????

The FASTA tags will be limited to two characters followed by a "_".

All of the output parameters correspond to values that are presented
in other FASTA output formats, with the exception of the "al_"
parameters.

al_start gives the location of the alignment start in the
	original sequence

al_stop gives the location of the end of the alignment in the
	original sequence

al_display_start 
	gives the location of the first displayed amino acid residue
	in the original sequence.  The -m 10 alignments are the same
	as those produced in the other modes. In particular,
	FASTA/SSEARCH provide some context for the alignment; if the
	"-a" option is not used, FASTA/SSEARCH will try to provide
	about 30 residues on either side of the actual local
	alignment, if alignment is in the middle of one or the other
	sequence.  If the begining of the query sequence aligns with
	the 10'th residue of the library sequence, then the query
	sequence will be padded with ten leading "-" to produce the
	alignment.  The leading '-' are a formatting convenience only;
	they are not considered in the numbering system for
	al_display_start, al_start, or al_stop.
	
	Thus:
	    
	>GT8.7  ..
	; sq_len: 217
	; sq_type: p
	; al_start: 3
	; al_stop: 180
	; al_display_start: 1
	---PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLN
	EKFKLGLDFPNLPYLIDGSHKITQSNAILRYLARKHH---LDGETEEERI
	RADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKR
	PWFAGDKVTYVDFLAYDILDQYRMFEPKCLDA------FPNLRDFLARFE
	GLKKISAYMKSSRYIATPIFSKMAHWSNK
	>ARP2_TOBAC ..
	; sq_len: 223
	; sq_type: p
	; al_start: 6
	; al_stop: 181
	; al_display_start: 1
	MAEVKLLGFW-YSPFSHRVEWALKIKGVKYE---YIEEDRD--NKSSLLL
	QSNPV---YKKVPVLIHNGKPIVESMIILEYIDETFEGPSILPKDPYDRA
	LARFWAKFLDDKVAAVVNTFFRKGEEQEKGK--EEVYEMLKVLDNELKDK
	KFFAGDKFGFADIAANLVGFWLGVFEEGYGDVLVKSEKFPNFSKWRDEYI
	NCSQVNESLPPRDELLAFFRARFQAVVASRSAPK

 	Says that to align the two sequences, the first 'P' of GT8.7 must
	line up with the first 'V' (residue 4) in ARP2_TOBAC but that
	the actual best local alignment starts with the first 'I' in
	GT8.7 and the first 'L' in ARP2_TOBAC.

================================================================




More information about the Bio-soft mailing list