FASTA format - proposed max line limit

tendo tendo at fas.harvard.edu
Fri Nov 27 11:27:37 EST 1998


mathog at seqaxp.bio.caltech.edu wrote in message
<73eok6$2oo at gap.cco.caltech.edu>...

>Sitting back and taking notes is not losing interest - there just wasn't
>any need to reply to every reply.

I see.  I apologize for the misunderstanding.


>The sense that it is readable with any text editor or word processor
>on any system, and will look approximately the same on all of them.


Exactly.  That means, no special programs are required.



>>dealing with two separate file make it more complex to code for those who
>>need reference information.
>
>Agree and disagree.  It depends upon the amount of reference information
>that is involved.  If there is very little information, then 80 characters
>is enough, if there is a lot of information, then it should be in ASN.1,


This is an extreme discussion, isn't it?
Limitation for 80 characters doesn't mean anything except it makes easier
to handle in a program and it does not have much benefits to be a new
standard.  Let me discuss this later.
ASN.1 format might be nice, because it can contain a lot of data, but it is
supported only by NCBI tools only, and it's a little bit too complicated.

The point is, people just want to put information with the sequence.  To
place the information separate enforces users to use or to make another
program for viewing information attached to sequence.  If the required data
is just one or two, no one would care, but in many cases, more than a couple
of sequences and their information is necessary.  This is my concern.


>Genbank, or some other standard (and machine parseable) format.  There may
>be some application around that routinely accesses Genbank in its raw
>distribution, but I've never used it.  Instead these sorts of databases are
>always (?) processed into some local database format, and accessed from
>there.   It's only in the grey area, roughly 80-1000 characters of
>reference information, where there is a lot of disagreement, and a lack of
>standardization.  It's also in this grey area where the transition from
>using the data raw (as a fasta file) to preprocessing (genbank) occurs.

There is a very good standard that only one comment line which starts with
'>' character is allowed for each sequence.
If lengths are really between 80-1000 chars, all you need is to just prepare
1002 bytes buffer should be enough for reading or 2000 bytes for security.
It's not a problem at all for any kind of recent computers, is it?
So, just standardize 1000 chars limit (or whatever appropriate) will make
more sense that any kind of currently available FASTA format database can
fulfill this new standard.


By my understanding, David's proposal is mainly focused on easy handling of
FASTA fomrat data in programs and compatibility with available programs.  It
also offers accessibility to corresponding information to sequence.

By just expanding  the line length limitation, his proposal will fulfill the
all the requirement in your statement except for separation of comment lines
which doesn't seem like much meanings when you can put enough information
with sequence itself - again, by my understanding.  Separation of comment
lines is one way to make the database neater, but no program currently
support this format and it gives more stress to those who have to work with
non-sequence information as well as sequence in programming, and
non-programming.

NCBI's '^A' utilization for separating of lines in a single comment line
makes the line ugly although this technique is usable for now.
Introduction of comment character in the new sequence format is simple
solution for readability to both human and machine.
If sequence file is not necessarily human readable, there should be only two
lines - comment and sequence in one line each.  Then you don't get any
problem with the comment length, because sequence end is easy to detect, so
you can use getchar() instead of gets().  But I don't think this is a good
solution for a couple of reasons.

Andrew's proposal was nice ... but the biggest problem is he couldn't
provide the library to handle sequence data - which means, no one else might
be offering a free and well documented library.

We should go ahead to the next step which is achievable.


Toshinori Endo
Harvard University Biological Laboratories






More information about the Bio-soft mailing list