open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

Tim Cutts timc at chiark.greenend.org.uk
Tue Aug 8 07:05:14 EST 2000


In article <87em4abii3.fsf at genehack.org>,
John S. J. Anderson <jacobs+usenet at genehack.org> wrote:
>
>I _was_ seriously suggesting that you should just be able to read the
>source, however. If the source is so obfustacated and poorly commented
>that someone qualified to review the rest of the paper can't figure it
>out, then the paper shouldn't be published. 
>

Now *that* I agree with 100%.  Much, perhaps most, of the bioinformatics
software I have the source for is very badly written.  No comments, very
poor programming style (large monolithic routines of unclear function).
Before I'm shouted at, I'm aware of performance hit issues associated
with cache misses due to subroutine calls, but that doesn't excuse the
majority of cases.  Much of the bioinformatics software I use is
predominantly I/O bound anyway.

If software is not clearly written enough to allow a newcomer to that
code to grasp what's going on, then the chances are higher that there
are implementation errors or other bugs; that's the nature of software
development.

As a (non-bioinformatics) example, I was having trouble last month with
a Linux NFS server.  I'm not a hard-core C programmer.  Neither do I
know much about operating system design in general, or the Linux kernel
in particular.  But the code is so beautifully organised and clearly
written that I had no difficulty in identifying the problem within an
hour or so, and participating in discussions which resulted in a fix.

I have often had much less success fixing bugs in (simpler)
bioinformatics codes, largely because the code's a mess.  Subjecting
code to peer review would do much to rectify this situation.

>I'm sorry -- do people still write their own RNGs? (Why?) 

It's worse than that.  Large numbers of programmers still seem to think
that their C library's rand() function is sufficient...

>This would seem to me to fall into the second class I talked about --
>problems that are going to show up when people try to build off the
>code. Basically, if the codebase is that fragile, you should run into
>problems the first time you try to build in on a new OS/libs
>combination. 

This is a major argument in favour of open source software.  I like
PowerPC hardware, and Linux.  If I could get source code for all my
software, I'd have no qualms about running LinuxPPC on PowerMacs.

>(And if the person doesn't know what a debugger is, they probably
>shouldn't be reviewing the paper in the first place...)

Quite.  Debuggers can be enormously useful purely for tracing program
flow in a large program, most of which is probably uninteresting from
the point of view of evaluating the implementation of the algorithm.

Tim.







More information about the Bio-soft mailing list