FASTA format - proposed max line limit
dalke at bioreason.com
Thu Dec 10 09:18:41 EST 1998
Simon R Tomlinson <plxsrt at pln1.nott.ac.uk> said:
> Programs that stuff a lot of details into the header are bound to
> make the format more complex. I would suggest that if the format
> becomes more complex then it'll become less popular.
I agree, and I don't think anyone else here argues against that
viewpoint. While there's been no consensus, I've been arguing
that XML should be used as the basis for a more generalized format.
At the very least, it allows Unicode so we can write names properly
with umlauts, tildes, accents, etc.
> In an extreme case you could take the header from any sequence record
> and stuff it onto a single line. But surely this is just reinventing
> an old format without the end-of-line breaks?!
> A lot of programs that I use will truncate the long header line anyway
> (eg clustalw) so you lose the long header details.
The problem comes with intermediate usages of the comment field,
expecially when using software to create those intermediate forms
automatically. "A lot of programs" will use long header lines.
Consider NCBI's merged FASTA records which combine duplicate records
into one record. The headers are joined into one line (seperated by
control-A (ASCII 1) characters). For example, on one line,
| >gi|1469284 (U05042) afuC gene product [Actinobacillus
| pleuropneumoniae]^Agi|1477453 (U04954) afuC gene product
| [Actinobacillus pleuropneumoniae]
> To avoid this I usually truncate the header myself to give the
> record a unique identifier. I usually use the accession number
> or sequence name.
The problem is *you* know which information should be truncated since
you know where the data will be used, but it's quite hard for software
to do that. Even in this example, should it be truncated to the first
to all record identifiers (note: if there are over about 8 records,
this will also exceed 80 character lines, and there are many cases
(esp. with records from PDB entries) where this can happen):
| >gi|1469284 gi|1477453
or to the 1st 75 characters (arbitrary truncation of text)?
Some people just need a single identifier, some need all of them
(eg, to answer the question "how many duplications of this sequence
exist) and some want to see them all (eg, look at the results of the
NCBI BLAST searches).
Since I cannot think of a good way to truncate comment lines generated
from programs that produce >80 characters, I am convinced that
truncation is not the solution because that will render some
existing programs useless and some data file uninterpretable. My
proposal of using multiple lines will also render some programs
useless but all existing files can be easily translated to that
format with no loss of content, the file format is only slightly
more complicated than the existing FASTA format, and it is possible
to convert any record to a form where all lines are <80 columns,
if so desired.
dalke at bioreason.com
More information about the Bio-soft