open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

John S. J. Anderson jacobs+usenet at genehack.org
Mon Jul 31 07:18:28 EST 2000


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>>> "Al" == alrichards  <alrichards at my-deja.com> writes:

Al> How much bioinformatics research does not involve program code? 
Al> Answer probably not much. Even paper which make use of standard
Al> programs such as BLAST probably make use of some home grown
Al> programs to sort/collate/ process the results.

I honestly have no idea how many papers feature original code. I guess
I'm not as concerned with code that just re-arranges the data set (the
'sorting/collating' part of your comment), but code that 'processes'
should probably be reviewed. On the plus side, it should be pretty
simple code.

Al> What fraction of the editorial boards of the major journals would
Al> have the time or ability to take a program apart line-by-line? 
Al> Answer - very few.

I don't think the editorial boards should be doing this,
necessarily. I think the anonymous peer reviewers _should_ be, by all
means. 

Al> Don't get me wrong - I think you are absolutely right in
Al> describing how the peer review process _should_ work - but my
Al> comments were made on the assumption that the reality is somewhat
Al> different.

Boy, I hope not. 

[code access ~ reagent sharing]
Al> Comparing the two source codes would be fine - so if the other
Al> party finds they cannot replicate the results he/she can compare
Al> their program with the original to sort out what's wrong. This
Al> would only happen after the other party has written a new
Al> implementation to compare of course - which is _exactly_ how I was
Al> suggesting the validation process would work.

But how are they going to how what to re-implement? By looking at the
algorithm? That just seems like a terrible waste of energy to me, and
it doesn't really apply to the primary problem of peer review prior to
publication. 

[effectiveness of code review in peer review]
Al> Err, not to be rude here, and I'm sure you are a much better
Al> programmer than I ever was - but are you seriously suggesting that
Al> you can just "read the source" and examine the inputs and outputs
Al> of a piece of software as complex as BLAST? That's 100s of
Al> thousands of lines of code. Not to mention the problem of getting
Al> the same input data (same databank releases etc. etc.) I used to
Al> have trouble doing this with my own codes - I couldn't imagine
Al> doing it to someone else's code without having them around to take
Al> me through it.

No offense taken -- I wouldn't actually call myself a programmer yet;
I'm just someone who programs occasionally. I'm working on becoming a
programmer, but I'm not really there yet.

I _was_ seriously suggesting that you should just be able to read the
source, however. If the source is so obfustacated and poorly commented
that someone qualified to review the rest of the paper can't figure it
out, then the paper shouldn't be published. 

I've never actually seen the BLAST source -- I suppose I could try to
have a look at it. I wouldn't have thought it that complex/long (100s of
kLOC, that is). I would have guessed that most of the complexity was
re-iterative in nature, rather than explicit in the code.

Reading someone else's code is difficult -- but it's like reading
scientific papers in general. You keep at it, and eventually you come
up with some strategies that allow you to extract information.

Al> Suppose the code was fine, but the results simply came from a bias
Al> in the random number generator? In my programming days, bad random
Al> number generators were all over the place and errors only came to
Al> light years after a paper was published (if at all!). How are you
Al> going to find errors like that just by looking at the code? 

I'm sorry -- do people still write their own RNGs? (Why?) 

This would seem to me to fall into the second class I talked about --
problems that are going to show up when people try to build off the
code. Basically, if the codebase is that fragile, you should run into
problems the first time you try to build in on a new OS/libs
combination. 

So you're right, here -- I don't think this would be seen by just code
gazing.

Al> I still wonder how many referees are going to be in a position to
Al> do this kind of analysis. Bioinformatics is too rapidly moving to
Al> have a review process like those used in mathematics journals -
Al> there a paper can take _years_ to be published due to the detailed
Al> refereeing procedure! Given that the referees of top journals are
Al> busy people with heavy teaching/admin loads, I just don't see them
Al> spending weeks running software through a code debugger (assuming
Al> they know what a debugger actually is!).

In the review situations I've been involved in, each paper was subject
to probably 10 person-hours of effort, split across reading (and
re-reading) the manuscript under review, tracking down and reading
relevant existing lit, thinking about the results and claims in the
paper, and actually writing the review. (These were molecular biology
papers, by the way.) I don't think reviewing source code would bloat
that time factor too much, as you're not going to be reviewing results
as much in a bioinformatics paper.

(And if the person doesn't know what a debugger is, they probably
shouldn't be reviewing the paper in the first place...)

Al> I know I'm sticking with the one example here - but I believe the
Al> paper describing the recent version of BLAST is the most cited
Al> paper in molecular biology ever - I dug the paper out to see if I
Al> could make some more sense of the options available on the BLAST
Al> server. From what I could see the paper covered less than 50% of
Al> the information I was looking for! No way did that paper go
Al> through the refereeing process you are describing - I rather wish
Al> it had. OK, the source code is available, but without any
Al> knowledge of C, this wasn't very enlightening to me. The maths in
Al> the paper I could follow (my original degree was in applied stats)
Al> - but the code was just unintelligible.

I haven't carefully read the paper or seen the code, so I can't
comment on this example. If I get some time today or tomorrow, I will
try to do that. Anybody else want to jump in on this one?

>> No! If the authors don't release code, or (especially) don't
>> provide enough details of the algorithm to allow a 'clean room'
>> re-implementation, the paper shouldn't ever be published in the
>> first place.

Al> Again, no arguments there - but I don't see how we can reach your
Al> ideal scenario without hampering rapid progress, or causing a
Al> strike amongst the bioinformatics journal referees out there!

Well, I don't know either, but conversations like this are probably
going to be an important part of the process...

john.

- -- 
- ----------------------------------------------------------------------------
           [ John S Jacobs Anderson ]------><URL:mailto:jacobs at genehack.org>
[ Genehack: Not your daddy's weblog ]------><URL:http://genehack.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.2 (GNU/Linux)
Comment: Mailcrypt 3.5.5 and Gnu Privacy Guard

iD8DBQE5hW6UWRJRdOm3KFARAq2vAJ9t2XerEIizfLhSWCDjo/0od0ijBwCfbayw
N+MT/Q9JlmxlAudHFngl/68=
=okpO
-----END PGP SIGNATURE-----







More information about the Bio-soft mailing list