Parsing framework (BlastXML), was Re: FASTA format - proposed max line limit

Wayne Parrott wayne at workingobjects.com
Mon Nov 23 21:16:29 EST 1998


See below:

Andrew Dalke wrote:
> 

... readseq dialog deleted ...

> 
> At my previous company we had used the library.  If you need to
> read the sequence information it is great.  However, if you want
> to pretty up a record for HTML, it isn't what you need.  (For
> example, suppose you want to show the record in the same form but
> add cross reference links to other databases, or want to add
> colors to the sequences (eg, for showing different secondary
> structure propensities)).
> 
> I've not seen any free software (depending on what you mean
> by free) that lets you do this.  Here's what I did which I hope
> someone can take up:
> 
>   Write a parser for the format which drives a callback.  The
> parser converts the data into a data structure appropriate to
> the format and the callback gets record elements and can add
> its own parsing and create whatever representation it wants
> of the element.  This would be similar to, eg, the SAX
> interface to XML.
> 
>   Since that probably makes little sense, here's an example of
> how a FASTA parser might be written.  This implements a variation
> of FASTA which allows multiple sequential comment lines but
> doesn't allow spaces after the sequence.  It only parses a
> single record.  It is written in Python and doesn't implement
> full error checking.
> 
> def parse_fasta_record(lines, callback):
>   comments = ""
>   sequence = ""
>   callback.parse("__begin")  # Let's the callback initialize, if needed
> 
>   callback.parse("__begin_comment")  # Start of comment section
>   while 1:
>     if not lines:                    # must be something
>         callback_parse("__error")    # tell the callback about problem
>         return
>     line, lines = lines[0], lines[1:]
>     if line[0] == ">":
>         callback.parse("comment", line)
>         comments = comments + line[1:]
>   callback.parse("__end_comment", comments)
> 
>   callback.parse("__begin_sequence")  # Sequence section
>   while 1:
>     # already have a line stored in "line"
>     callback_parse("sequence", line)
>     sequence = sequence + string.strip(line)
>     if not lines:
>         break
>     line, lines = lines[0], lines[1:]
>   callback.parse("__end_sequence", sequence)
> 
>   callback.parse("__end")
>   return (comments, sequence)
> 
> And here's a callback to turn FASTA records into HTML
> 
> class Fasta2HTML:
>     def parse(self, keyword, info):
>       if keyword == "__begin":
>         self.text = ""
>         return
>       if keyword == "__end_comment":
>         self.text = self.text + "<i>\n" + info + "</i>\n"
>         return
>       if keyword == "__begin_sequence":
>         self.text = self.text + "<pre>\n"
>         return
>       if keyword == "sequence":
>         self.text = self.text + self.info
>         return
>       if keyword == "__end_sequence":
>         self.text = self.text + "<pre>\n"
>         return
> 
> and it all can be used as:
> 
> tohtml = Fasta2HTML()
> (comment, sequence) = parse_fasta_record(record_lines, tohtml)
> print "<html>\n" + tohtml.text + "</html>"
> 
> This probably isn't the cleanest solution to the problem since
> some times you don't want the parsed comment and sequence
> information -- you might want everything handled via callbacks.
> The mechanism in general is able to parse and display all the
> sequence file formats we handled (about 7), and the outputs
> of BLAST, FASTA and DSC.  It should work in general for anything
> that's hierarchical-like in its format, which is most everything.
> 
> Again, I don't have much time for this type of work these days
> but I hope this idea helps someone work on such a library.
> 
>                                                 Andrew
>                                                 dalke at bioreason.com

I agree with the event-oriented parser design Andrew describes for the
reason that it "separates concerns" of parsing (recognizing an input
stream) from those of translating input data to an output form, e.g.,
html, object-model, data-structure. In fact I propose that anyone
wishing to develop a robust reusable parser framework consider this
approach, i.e., event-oriented parser.

For those interested, Andrew describes a type of parser which upon
recognition of a significant input element outputs an event in the form
of a callback. Typically, an application-specific event handler is
registered with the parser to be the receiver of its callbacks. This
event handler responds to the events in the manner deemed appropriate by
its application-specific logic, e.g., output html, add data to
object-model, etc.

Another type of parser is the document-oriented parser. This type of
parser translates input elements directly into an object-model or
hierarchical data-structure. An application programmer must write code
to traverse the resulting data-structure to access data of interest.
This can be very inefficient when only a small number of attributes are
desired in a large input stream. In the case of very large input streams
it may be impossible to parse due to a lack of memory to hold the
resulting data-structure.

It is my experience that document-oriented parsers are the most comment
type of parser in the bioinformatics industry. As a result we/I (I
confess) keep reinventing parsers for our application-specific cases.
For example, this past year in my consulting practice I've been tasked
multiple times to develop unique application specific parsers for Blast.
The primary difference in the parsers was not the recognition logic but
rather the output logic. 

As part of PharmTools(TM) (under development) I'm developing a set of
tools for transforming and manipulating data in common file formats.
I'll be putting the finishing touches on BlastXML, a PharmTools utility,
in the next week or two. BlastXML is a set of Java-based components for
creating XML formatted Blast output streams, event-based parsers for the
native and XML Blast streams, a Blast DTD, a parser event interface, and
a Blast object-model for holding parsed results. While XML is the
underlying stream protocol, the upper layers deal only with Blast result
semantics, e.g., database, score, HSP, ... 

The relevance of BlastXML to Andrew's post is that it defines an event
handler known as the BlastElementHandler (i.e., a callback interface).
BlastElementHandler defines the event notification interface used by a
Blast parser when it recognizes a significant element. By subclassing
from BlastElementHandler developers will be able to map event/callback
data to their own object-model(s). 

I know Perl reigns in bioinformatics but since the majority of my work
is in Java I naturally chose it for implementation of the initial
version. A Perl implementation of BlastXML is being planned - pending a
better understanding of bioperl components. 

FYI: a Fasta search result equivilent of BlastXML as well as several
other parsers are under development which will employ an event-oriented
parser model. I'll post an announcement when BlastXML is released to
this newsgroup or you may watch my website for more info.

W
-- 
-----------------------------------------------------------------------
 Wayne Parrott                   email: wayne at workingobjects.com      |
 WorkingObjects.com              voice: (972)491-3704                 |
 "Distributed Object Technology    fax: (972)491-7284                 |
   for Life Sciences"              web: http://www.workingobjects.com |
----------------------------------------------------------------------- 
 "The main thing, is to keep the main thing, the main thing" 
   lyrics by Scott Krippayne




More information about the Bio-soft mailing list