FASTA format - proposed max line limit

Andrew Dalke dalke at bioreason.com
Wed Nov 18 23:07:13 EST 1998


Tendo <tendo at fas.harvard.edu>:
> What I really couldn't get is whether you meant we don't need
> new standard, or just need 80 chars limit for a single comment
> line.  Or something else?

What I mean is several things, and I'm afraid I didn't express
myself too well.  My final conclusion is that the different
uses of the FASTA format is too entrenched to be easily
supplanted, and that proposed format doesn't address many
of the existing uses.  In other words, pretty much what you've
been saying.

Problem number 1:
 People use "the FASTA format" in many different ways.

   Some want to store all their information in the FASTA file.
Often it's because they don't want the hugh hassle of
synchronizing multiple files.  For example, if you want to
do a lot of sequence analysis using unix pipes, you've only
got a input and output stream to use.

In other cases it's a deliberate decision to make a "portable"
list of sequences.  For the latter case I point to NCBI which
uses the database|id cross references and ^A as a newline-like
seperator.  This allows those data files to be used by FASTA
and by "smarter" tools.

   Some use it only for sequence analysis and store the
non-sequence data in a database or another file.  These people
only need the database id in the comment fields.

  Still others take advantage that they can stick anything
(with suitable encoding) in the comment region.  I've not
seen it, but I can imagine someone putting HTML in the comment
to make it easy to turn the results into a web page.

Problem number 2:
  The format is easy to parse and read.  Why's this a problem?
Consider the BLAST2 file format, which is binary in nature.  No
one is arguing how that format should be changed.  It's designed
to be readable by machines but not people, so if you want to
read it you'll have to write a tool to convert it to a legible
form.
  Since it is so easy to parse it's hard to convince people to
use a more complex/regimented format, like Genbank, SwissProt or
PIR.  Those are a lot harder to write than "is the first character
a '>'?" but allow you to store additional data about the sequence.

  There are a couple of ways to fix this.  One is to write tools
to fold/unfold long comment lines.  For example, allow
multiple successive ">" lines to indicate continutation of the
same comment, with line wrap at 70 some odd columns.  Some "FASTA"
file parsers allow this, some don't, so for those that don't there
needs to be a way to unfolds these back to one line.  This is
a few lines of perl.

  Or, you can fix the viewing tools to understand long lines.
For example, configure "less" to pass .fasta files through "fold"
or some other filter before viewing.  Or fix EMACS to better
understand how to wrap lines.  In fact, I think that's a minor
mode already.

> This is a solution, but only for those who can write a filter
> or emacs-lisp.

Right, but it only has to be done once, documented, and
dissemenated.  This is all part of the general problem that some
thing don't scale well from small data sets to large ones.

Problem number 3:  (related to #2)
  There's a huge existing toolbase that understands FASTA files
but fewer of those understand other formats.  It's hard to rewrite
all those tools, though you can write input filter scripts for
some of them.  

Problem number 4:
  Writing code to read long lines is somewhat cumbersome in C.
You need to remember to use fgets using multiple reads into an
array, and you need to realloc as the input grows.  However,
this is an effectively solved problem that can be found in
many books, probable FAQs, or remedied by using a language
that allows for unlimited input line lengths, like C++, Python
or Perl.

> Good for professional or well trained programmers.
> Beginner programmers who want to write a program to handle
> sequence will find it hard to write the reading code, though.

Perhaps I should put this as Problem number 5.  There's no
reason for beginning programmers to write parsers.  There really
should be libraries to read the different formats.  The Bioperl
people have made some headway to making it easy for the "beginner
programmer" to get access to the data, but I've my own beliefs
that perl isn't the best language for a beginning programmer.
(I'm a big Python fan.)

C/C++ libraries are not really useful to the beginner programmer,
and are somewhat hard to do well in the first place.  Now with
C++ STL and string classes this should be easier, but still not
as easy as I would like.

Really you would like something like:

  from SeqIO import PIR
  infile = PIR.open("inputfile.pir")
  while 1:
    rec = infile.read_record()
    if rec is None:
      break
    seq = rec.sequence()
    print "The sequence is", seq

or even autodetection of the sequence format, but that's hard.

The problems here comes down to:
  writing a good set of libraries
  documenting them
  making them available
and the hard part, getting other people to know they exist. 
At present this is VERY HARD.  Many people in this field, esp.
beginning programmers, not only don't know where to go to find
libraries but don't even know they exist.
  I've also my own theory of one of the problems of computer
classes.  They teach you to do too much on your own and not seek
existing solutions.  That's often called cheating, and it induces
a tendency to reimplement instead of reuse.

> Does [Glimpse] use hashed index or something?

  Yes.  You run it once over the data set to index the words.
These are saved in a repository which is used when you want
a search.  Things get harder when you want record based searches
on compressed files (I spent almost a week getting search times
down from 50 seconds to a second) but for line oriented (grep-like
searches) on uncompressed files it is very fast.

> >  Or, we can do everything in XML  :)
> This is an option, but not really practical, is it?

Alas, no.  I consider myself a good developer in the general field
of computation chemistry/biology and I've been interested in using
XML for about a year, but while it is easy to use, there's still
a learning barrier (finding the existing tools, understanding the nomenclature,
why unicode is important, etc.).  This means it
will still be a while (>3 years?) before XML is popular for
storing sequence or structure data.  But, once Netscape/Mozilla/IE
start supporting XML you should see some interesting things arising.


						Andrew Dalke
						dalke at bioreason.com




More information about the Bio-soft mailing list