FASTA format - proposed max line limit

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Wed Nov 4 20:03:17 EST 1998


We all know that the FASTA format is a bit restrictive in that there is
only the one line for comments, but can the software/database community
*please* agree on some reasonable maximum line length for both the comments
and the sequence?  The basic FASTA format is defined by example in the
FASTA2 distribution file "format.doc".  The format is not defined there
with very many (any?) limitations, so consequently many FASTA files these
days have comment lines in excess of 500 characters (Unigene, for
instance).  Others have huge sequences on one line.  

The original FASTA examples, with shorter lines, could be easily viewed
with text editors and other similar tools, but the overgrown variants
cause all sorts of problems.  This is very similar to the problem one sees
with Postscript, where the standard clearly states that lines can only be
so long - and a lot of software merrily ignores that standard, and so
breaks on various printers. 

Maybe WRP has already written something like this, but here is a more
restrictive standard I propose, let's call it FASTA-1998, version 0.1, so
that it has a specific name.  It isn't very long - and it would not be at
_all_ difficult for the lot of us to stay within it.  (Hint, hint). 

**********************************************************************

Fasta-1998 0.1 defines two formats:

  Sequence:  for DNA and Protein sequences
  Reference: for any other information  (optional)

Common characteristics of both file types:

C1.  The FASTA-1998 format is a TEXT format, not a BINARY one.
C1.a   Programs which write FASTA-1998 files may do so in the native
        text mode of the operating system where they operate.
C1.b   Programs which read FASTA-1998 files will consider a cluster of
        any of the following ASCII symbols to be a SINGLE end of line:
        null ('\0'), LineFeed, CarriageReturn.
C1.c   No carriage control or "special characters" may exist in the file,
        other than those included in the end of line indicator. Tabs
        are specifically excluded.

The Fasta-1998 SEQUENCE format is:

S1.  Each sequence entry in a FASTA-1998 file will begin with a comment line.
S1.a   This line will begin with the ">" character.
S1.b   This line will not exceed 80 characters in length, not counting any
        end of line characters.
S1.c   Immediately following the ">" there will be an identifying name
        consisting of a combination of the alphanumeric symbols, plus the
        dash and underscore, but no other punctuation characters.
        All identifying names in a FASTA-1998 file must be unique within
         that file.
        The case of any letters in the name may be either upper, lower,
         or a combination of the two, but two identifying names in a
         FASTA file may not differ solely by case.
S1.d   Immediately following the identifier there may be either
        an end of line or a " " (space) character.  The latter may be followed
        by any text, up to the end of the comment line. 
S1.e   DNA and Protein alphabet specified here.

S2.  Each entry will one or more sequence lines.
S2.a   Sequence lines may not exceed 80 characters in length, excluding the
        end of line characters. 

The Fasta-1998 REFERENCE format is very similar to the SEQUENCE format.

R1. The reference file will hold information that didn't fit
      inside the Sequence file.

R1.a  The comment line for each entry in the reference file must
        contain the ">" followed by the identifier, but no other information.
R1.b  If the reference file exists, it must contain the same number
        of entries, with the same identifiers, and in the same order,
        as appeared in the sequence file.
R1.c  The data lines in the reference file may be up to 80 characters long,
        not including the end of line characters.  They may not
        include special characters or carriage control information.
R1.d  Each entry may contain zero or more data lines.


**********************************************************************

That isn't so terrible, is it?

S1.c bears special mention.  Entry names tend to turn into file names,
and if they contain odd combinations of characters they can be either
hard to deal with, or just plain illegal, depending on the OS.  For
instance: "gi|R34734987" can get you into trouble on Unix, and it's
just plain illegal on VMS.

Regards,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 




More information about the Bio-soft mailing list