open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

alrichards at my-deja.com alrichards at my-deja.com
Sun Jul 30 16:01:44 EST 2000


In article <8766pnlh1u.fsf at genehack.org>,
  "John S. J. Anderson" <jacobs+usenet at genehack.org> wrote:
> Okay, I was initially taken aback by your argument, but I've spent a
> couple of days thinking about it, and I can now firmly say that I
> think you're wrong. (He said, modestly.)

Fair enough.

>
> The issue here, to clarify, is the correctness of methods being
> published, or the correctness of methods being used to generate data
> that's being published. The process of peer review is supposed to do
> that, and part of that process does involve people who are experts
> in the particular sub-field (your (a)). In most cases (I believe), it
> may not involve programmers (your (b)), but I'm trying to argue that
> in those cases involving code, it should. Otherwise, the results can't
> truly said to have been reviewed.

A few counter comments/questions.

How much bioinformatics research does not involve program code? Answer
probably not much. Even paper which make use of standard programs such
as BLAST probably make use of some home grown programs to sort/collate/
process the results.

What fraction of the editorial boards of the major journals would have
the time or ability to take a program apart line-by-line? Answer - very
few.

Don't get me wrong - I think you are absolutely right in describing
how the peer review process _should_ work - but my comments were
made on the assumption that the reality is somewhat different.

>
> But here you veer off. Note that in most journals, publishing
> something is regarded as an implicit offer to share reagents. Note
> also that, while uncommon, it's not unheard of for people to travel to
> other people's labs _because_ of an inability to replicate results
> elsewhere.

Comparing the two source codes would be fine - so if the other party
finds they cannot replicate the results he/she can compare their
program with the original to sort out what's wrong. This would only
happen after the other party has written a new implementation to
compare of course - which is _exactly_ how I was suggesting the
validation process would work.

>
>
> If I'm reviewing a computational paper, it's not practical for me to
> re-implement the algorithm described in the paper. I can, however,
read
> the source code, look at the inputs and outputs described in the
> paper, read the conclusions, and decide whether or not everything
> seems reasonable and correct.

Err, not to be rude here, and I'm sure you are a much better programmer
than I ever was - but are you seriously suggesting that you can
just "read the source" and examine the inputs and outputs of a piece of
software as complex as BLAST? That's 100s of thousands of lines of
code. Not to mention the problem of getting the same input data
(same databank releases etc. etc.) I used to have trouble doing this
with my own codes - I couldn't imagine doing it to someone else's
code without having them around to take me through it.

Suppose the code was fine, but the results simply came from a bias
in the random number generator? In my programming days, bad random
number generators were all over the place and errors only came to
light years after a paper was published (if at all!). How are you going
to find errors like that just by looking at the code? Reimplementing
software with a different random number generator would very quickly
diagnose a fault like this - along with problems with maths libraries
etc. etc. (another bane of my previous existence).

>
> There are (basically) two questions facing someone reviewing a
> computational paper -- (1) Is the algorithm effective. That is, does
> it appear that it will do what the authors of the paper claim? (2) Is
> the algorithm implemented properly? That is, are the algorithm and the
> code used to execute the algorithm equivalent or isomorphic?
>
> You _cannot_ answer (2) without seeing the source code.

Or as I say above, even with seeing it!

>
> If you're one of the people reviewing the paper, it's your _job_ to
> (a) note the omissions of key information, (b) review whatever source
> code was presented to you, and (c) reject the paper, because it's
> incomplete.

I still wonder how many referees are going to be in a position to
do this kind of analysis. Bioinformatics is too rapidly moving to
have a review process like those used in mathematics journals - there
a paper can take _years_ to be published due to the detailed refereeing
procedure! Given that the referees of top journals are busy people with
heavy teaching/admin loads, I just don't see them spending weeks running
software through a code debugger (assuming they know what a debugger
actually is!).

>
> The thing is, before those methods get distilled down and 'kit-ified',
> they're (mostly) documented in papers. Agreed, most of these aren't

I know I'm sticking with the one example here - but I believe the
paper describing the recent version of BLAST is the most cited paper
in molecular biology ever - I dug the paper out to see if I could
make some more sense of the options available on the BLAST server. From
what I could see the paper covered less than 50% of the information I
was looking for! No way did that paper go through the refereeing
process you are describing - I rather wish it had. OK, the source code
is available, but without any knowledge of C, this wasn't very
enlightening to me. The maths in the paper I could follow (my original
degree was in applied stats) - but the code was just unintelligible.

>
> Most software in biology (especially 'kit' level commercial software),
> on the other hand, is a total black box. You might be able to get the
> underlying algorithm, but that's no guarantee that it's been
> implemented properly in the binary you have. Most suppliers (that I'm
> familiar with) don't supply regression tests, or any tests at all.

No arguments there! But I just don't see this changing without some
radical ideas - hence my rather tongue-in-cheek suggestion to keep
the code secret for a while!

>
> No! If the authors don't release code, or (especially) don't provide
> enough details of the algorithm to allow a 'clean room'
> re-implementation, the paper shouldn't ever be published in the first
> place.

Again, no arguments there - but I don't see how we can reach your ideal
scenario without hampering rapid progress, or causing a strike amongst
the bioinformatics journal referees out there!

** Alan **


Sent via Deja.com http://www.deja.com/
Before you buy.







More information about the Bio-soft mailing list