FASTA format - proposed max line limit

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Fri Nov 13 12:18:07 EST 1998


In article <72e3hu$p86$1 at news.fas.harvard.edu>, "tendo" <tendo at nucleus.harvard.edu> writes:
>Nice proposal!
>But why don't you allow comment lines other than identifier line? I mean,
>semicolon can be another character to indicate a comment line.  In fact, the
>FASTA programs seem to accept semicolon-starting line as a comment line.
>This way is nice because 1) you can put as many comment as you want at the
>same place of sequence data,  2) you can seperate identifier line from other
>comments, and 3) comments can be very easily removed without removing
>identifier using grep command.   The third feature is actually important
>because those lines should be easily removed in case they are problems -
>actually, BLAST and CLUSTAL W don't seem to treat them as comments.
>Seperate reference file is a good idea in this sense, but it is often messy
>to handle reference separately.

The primary reason:  it is incompatible with many current programs - you
even cited some yourself. The primary goal of the proposed standardization
is to make sure that all fasta datasets will work with all programs which
claim to read FASTA format.

I'm so used to having the reference information in a separate file (GCG
databases), that I don't view it as an inconvenience.  Rather, I want
to separate the messy problem of parsing the reference information,
which can be in ASN.1, Genbank, etc, or even just free comments, from the
much simpler problem of parsing the DNA/Protein sequences.  The proposed
(new) reference FASTA format contains just enough information (the unique
identifier, common between sequence and reference) to allow programs
to match up the two pieces of information - so that database generators
can safely put the (often) unformatted text that has been going onto
the ">" lines into a safer place, and the end users can still find it, even
if they are only using "more", or Microsoft Word. ie, they need only search
for "{NEWLINE}>identifier{ENDofLINE}".  If the people writing the
database show a bit of restraint in their reference files, 
">identifier" alone would suffice. 

>
>
>I have one more proposal - there should be a BLANK LINE AFTER EACH SEQUENCE.
>This makes it easy to search for a database by keyword with a Perl code -

In the given standard, this sequence occurs once per entry:

{Beginning of line} {'>'} {identifier string} { ' ' OR EOL}

Adding the blank line doesn't get you very much, it just changes this 
sequence to:

(first line, as before)
{Beginning of line} {'>'} {identifier string} {' ' OR EOL}

(all subsequent lines)
{blank line} {Beginning of line} {'>'} {identifier string} {' ' OR EOL}

<SNIP>

>Only 5 lines of perl code is enough for search!  Also, it makes coding
>easier also in C, I think.

Blank lines and extra spaces should be ignored (that would fall under the
alphabet business, which I didn't specify.)  They should not be required to
convey any particular meaning. For sure they don't make reading a FASTA
file in C any easier: 

char buffer[80];           /* conformant FASTA files only */
  if(gets(buffer)){         /* else it blows up here */
     if(buffer[0] == '>'){  /* uniquely identifies start of an entry */
     }
     else {                 /* remainder of an entry */
     }
  }

> (I don't know about fortran...but who cares?)

Let's not go there.

Thanks for the feedback,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 




More information about the Bio-soft mailing list