FASTA format - proposed max line limit

Peter Rice pmr at sanger.ac.uk
Fri Nov 6 05:20:43 EST 1998


mathog at seqaxp.bio.caltech.edu writes:

> We all know that the FASTA format is a bit restrictive in that there is
> only the one line for comments, but can the software/database community
> *please* agree on some reasonable maximum line length for both the comments
> and the sequence?

I would welcome a standard "unique identifier" format after the ">".

We use FASTA format extensively at the Sanger Centre, but we need to
hold both an identifier and an accession number. In specific cases
we also need a database name. Often there is other information used
(typically numeric) to generate several unique forms from one
original name.

Curiously, one reason for the expansion of FASTA format is BLAST, as
it takes as its database a file of many FASTA format sequences which
need to have unique identifiers.

One option to get extra identifier information is to use the NCBI
style with "|" characters to split the fields. Sometimes this seems to
have special information in the first word(s) of the description too,
for example in dbEST.weekly.FASTA

   >gi|1622446|dbj|C21336|C21336 HUMGS0003372, Human Gene Signature, \
	3'-directed cDNA sequence

(actually this is followed by "ctrl-A" and more description - see
"other horrors" below)

Another, since we have some FASTA files generated by GCG, is the GCG
syntax of:

   >DB:entryname accnum yet...more...description

We generate FASTA files from our unfinished sequence data where the
unique name is built from the clone and contig, using "." as a
delimiter, for example:

   >bK109G6.05061 Unfinished sequence: bK109G6  Contig_ID: 05061  \
	acc=AL023879  Length: 25298 bp 
   >bK109G6.05234 Unfinished sequence: bK109G6  Contig_ID: 05234  \
	acc=AL023879  Length: 129756 bp 

I have seen various other styles of identifier to represent
subsequences with a unique name, typically needed in protein
domain databases, for example:

   >entryname-start-end  (e.g. SBASE)
   >entryname/start-end  (horrible for generating filenames)
   >entryname\start-end  (the "/" still causes confusion with filenames)

Other horrors:

Using control characters to fake extra lines in the description,
for example ctrl-a appears in NCBI's dbEST.weekly.FASTA files.

UniGene's "seq.all" file has clusters of FASTA format sequences
headed by comment lines starting with "#"

BLAST1.4 pressdb fails if the sequence lines are not all the same
length.

An additional need is to have parseable information in the description
which can be used to efficiently markup blast search results for a Web
service.

>The Fasta-1998 REFERENCE format is very similar to the SEQUENCE format.
>
>R1. The reference file will hold information that didn't fit
>      inside the Sequence file.
>
>R1.a  The comment line for each entry in the reference file must
>        contain the ">" followed by the identifier, but no other information.

A nice idea. I would certainly support this kind of format for EMBOSS.

It is of course closely related to NBRF format, and its derivative
GCG database format(s).

File naming could be a problem - the right FASTA REFERENCE file has to
be associated with a FASTA SEQUENCE file. A ".ref" extension would help,
but the sequence file may itself have various (or no) file extensions.


-- 
----------------------------------------------------------------------
Peter Rice                | Informatics Division, The Sanger Centre,
E-mail: pmr at sanger.ac.uk  | Wellcome Trust Genome Campus,
Tel: (44) 1223 494967     | Hinxton, Cambridge, CB10 1SA, England
Fax: (44) 1223 494919     | URL: http://www.sanger.ac.uk/Users/pmr/




More information about the Bio-soft mailing list