FASTA format - proposed max line limit

Andrew Dalke dalke at bioreason.com
Sun Nov 22 08:16:57 EST 1998


Tendo <tendo at fas.harvard.edu> summarized my post as:
>     1) Sequence format need no change
>     2) Instead, we should prepare library as well as documents to
> allow easy coding

 Actually, #1 should probably be "there are so many formats that a
another variation of an existing format won't be useful."
 Yes about #2.

> The problem of current format is
>     1) too long comment lines have to be fit into a single line.
>     2) it makes coding hard
>     3) and it's not very human-readable.

My belief is that only some people believe these are problems, and
I gave examples where some people don't have these problems:
   1) is "fixed" with ^A (by NCBI)
   2) some people find long line input simple (either good C
     programmers or using a language that supports reading a line
     directly into a string)
   3) the FASTA format isn't meant for direct viewing by humans
     (eg, as a format for data transfer between two programs) or
     can be transformed into something viewable, either by a filter
     or modification of the viewing tools.

> About your proposal, it is nice to share the library.
> Can you provide it?

  I wish.  However, all of the code I did along these lines was
while I was at my previous company, and they aren't going to
release the code (unless you buy it).  I'm currently doing small
molecule software so don't have much time for anything sequence
related. :(

> I remember somebody provided "readseq" library somewhere, but
> I can't remember who and where...

As another thread in this discussion showed, it's by d.g.gilbert
and available at http://iubio.bio.indiana.edu:81/soft/molbio/readseq/
amoung other places.

At my previous company we had used the library.  If you need to
read the sequence information it is great.  However, if you want
to pretty up a record for HTML, it isn't what you need.  (For
example, suppose you want to show the record in the same form but
add cross reference links to other databases, or want to add
colors to the sequences (eg, for showing different secondary
structure propensities)).

I've not seen any free software (depending on what you mean
by free) that lets you do this.  Here's what I did which I hope
someone can take up:

  Write a parser for the format which drives a callback.  The
parser converts the data into a data structure appropriate to
the format and the callback gets record elements and can add
its own parsing and create whatever representation it wants
of the element.  This would be similar to, eg, the SAX
interface to XML.

  Since that probably makes little sense, here's an example of
how a FASTA parser might be written.  This implements a variation
of FASTA which allows multiple sequential comment lines but
doesn't allow spaces after the sequence.  It only parses a
single record.  It is written in Python and doesn't implement
full error checking.

def parse_fasta_record(lines, callback):
  comments = ""
  sequence = ""
  callback.parse("__begin")  # Let's the callback initialize, if needed

  callback.parse("__begin_comment")  # Start of comment section
  while 1:
    if not lines:                    # must be something
        callback_parse("__error")    # tell the callback about problem
        return
    line, lines = lines[0], lines[1:]
    if line[0] == ">":
        callback.parse("comment", line)
        comments = comments + line[1:]
  callback.parse("__end_comment", comments)

  callback.parse("__begin_sequence")  # Sequence section
  while 1:
    # already have a line stored in "line"
    callback_parse("sequence", line)
    sequence = sequence + string.strip(line)
    if not lines:
        break
    line, lines = lines[0], lines[1:]
  callback.parse("__end_sequence", sequence)

  callback.parse("__end")
  return (comments, sequence)

And here's a callback to turn FASTA records into HTML

class Fasta2HTML:
    def parse(self, keyword, info):
      if keyword == "__begin":
        self.text = ""
        return
      if keyword == "__end_comment":
        self.text = self.text + "<i>\n" + info + "</i>\n"
        return
      if keyword == "__begin_sequence":
        self.text = self.text + "<pre>\n"
        return
      if keyword == "sequence":
        self.text = self.text + self.info 
        return
      if keyword == "__end_sequence":
        self.text = self.text + "<pre>\n"
        return

and it all can be used as:

tohtml = Fasta2HTML()
(comment, sequence) = parse_fasta_record(record_lines, tohtml)
print "<html>\n" + tohtml.text + "</html>"

This probably isn't the cleanest solution to the problem since
some times you don't want the parsed comment and sequence
information -- you might want everything handled via callbacks.
The mechanism in general is able to parse and display all the
sequence file formats we handled (about 7), and the outputs
of BLAST, FASTA and DSC.  It should work in general for anything
that's hierarchical-like in its format, which is most everything.

Again, I don't have much time for this type of work these days
but I hope this idea helps someone work on such a library.

						Andrew
						dalke at bioreason.com




More information about the Bio-soft mailing list