open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

John S. J. Anderson jacobs+usenet at genehack.org
Sun Jul 30 11:27:25 EST 2000


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>>> "alrichards" == alrichards  <alrichards at my-deja.com> writes:

alrichards> Here it is: programs should not be made available at all -
alrichards> or at least should not be released for at least 2 years
alrichards> after the associated paper is published. 

Okay, I was initially taken aback by your argument, but I've spent a
couple of days thinking about it, and I can now firmly say that I
think you're wrong. (He said, modestly.)

I agree with your comments about the laziness of scientists (although
I'm sure most of us would prefer to think of it as being efficient and
not repeating the work of others 8^). 

alrichards> Finding bugs in code is one thing - the program crashes
alrichards> when you input a negative number for example. Any half
alrichards> competent programmer could track down a bug like
alrichards> that. However, finding a subtle mistake in a complex set
alrichards> of statistical codes, for example, is beyond the ability
alrichards> of anyone who is not a) an expert in the scientific field
alrichards> and b) an expert in programming. This combination is so
alrichards> rare that it cannot be relied upon to keep checks on the
alrichards> quality of bioinformatics methods.

The issue here, to clarify, is the correctness of methods being
published, or the correctness of methods being used to generate data
that's being published. The process of peer review is supposed to do
that, and part of that process does involve people who are experts
in the particular sub-field (your (a)). In most cases (I believe), it
may not involve programmers (your (b)), but I'm trying to argue that
in those cases involving code, it should. Otherwise, the results can't
truly said to have been reviewed.

alrichards> Of course having the source code is a convenience for
alrichards> tracking down annoying "user interface" bugs and the like

Let us not get off into the topic of how badly the UI stinks in much of the
software used in the biology world -- we might never get back to the
topic at hand. 8^)

alrichards> - but claiming that having the source code is the best way
alrichards> to ensure scientific accuracy is I believe not valid. The
alrichards> analogy to experimental papers is a useful one.  If I
alrichards> publish a paper describing a wet-lab experiment then I try
alrichards> to describe the steps in as much detail as I think is
alrichards> necessary for someone with suitable skills and a standard
alrichards> lab to replicate the whole experiment. 

I agree with you up to this point.

alrichards> The onus is on people who make use of my work to try to
alrichards> replicate the results and the onus is then on me to help
alrichards> explain what's wrong when people find they cannot
alrichards> replicate the experiment. Note that I don't expect these
alrichards> people to turn up on my doorstep wearing a lab coat ready
alrichards> to use my lab and my reagents.

But here you veer off. Note that in most journals, publishing
something is regarded as an implicit offer to share reagents. Note
also that, while uncommon, it's not unheard of for people to travel to
other people's labs _because_ of an inability to replicate results
elsewhere. 

alrichards> The point of replication is that someone should easily be
alrichards> able to replicate the experiment elsewhere. Who knows, the
alrichards> reason my results were so good might be that my buffer
alrichards> solutions are contaminated with silver salts? Or my lab is
alrichards> close to electric power lines. The only way to discover
alrichards> that fact would be to replicate the same experiment with
alrichards> similar but non-identical reagents and apparatus.

Yes, but computational experiments are different. If I'm reviewing a
paper, it's not practical for me to get samples of all the reagents
used in the paper and repeat the experiments. I can, however, read the
methods, and look at the results, and look at the conclusions being
reached, and decide whether or not everything seems reasonable and
correct.

If I'm reviewing a computational paper, it's not practical for me to
re-implement the algorithm described in the paper. I can, however, read
the source code, look at the inputs and outputs described in the
paper, read the conclusions, and decide whether or not everything
seems reasonable and correct.

There are (basically) two questions facing someone reviewing a
computational paper -- (1) Is the algorithm effective. That is, does
it appear that it will do what the authors of the paper claim? (2) Is
the algorithm implemented properly? That is, are the algorithm and the
code used to execute the algorithm equivalent or isomorphic?

You _cannot_ answer (2) without seeing the source code.

alrichards> Now lets take this analogy to scientific codes. Somebody
alrichards> describes a new algorithm in a paper which produces
alrichards> excellent results. Assuming this method is complex
alrichards> (e.g. the BLAST program or molecular mechanics software)
alrichards> then the chances are that the paper omits some key facts
alrichards> that turn out to be critical to the success of the
alrichards> program. How is this going to be discovered? Not by
alrichards> releasing the source code that's for sure. It's all very
alrichards> well saying that you are _able_ to look at the source code
alrichards> - but do you?

If you're one of the people reviewing the paper, it's your _job_ to
(a) note the omissions of key information, (b) review whatever source
code was presented to you, and (c) reject the paper, because it's
incomplete. 

alrichards> How many times has BLAST been reimplemented to
alrichards> validate the method? Answer: probably never. How many
alrichards> people have picked apart the BLAST code and compared it
alrichards> line by line with the algorithm described in the paper? I
alrichards> bet the answer to this is close to if not equal to zero as
alrichards> well. Why would you need to?  The code is public domain
alrichards> and the software seems to work properly. It's the computer
alrichards> equivalent of buying a molecular biology "kit".

The thing is, before those methods get distilled down and 'kit-ified',
they're (mostly) documented in papers. Agreed, most of these aren't
things that are that interesting to read, and they're not published in
top flight journals, but the data is out there. Also, most of the kit
suppliers I'm familiar with willing to supply you with data about the
performance of their kit under various conditions. Not the most
reliable source, but again, the data is available.

Most software in biology (especially 'kit' level commercial software),
on the other hand, is a total black box. You might be able to get the
underlying algorithm, but that's no guarantee that it's been
implemented properly in the binary you have. Most suppliers (that I'm
familiar with) don't supply regression tests, or any tests at all.

alrichards> So how could BLAST be properly validated? The authors
alrichards> should not release the code - or at least keep the code
alrichards> secret for a period of at least 2 years. What would happen
alrichards> then? People would try to replicate the method with new
alrichards> software to check they get the same results. This is how
alrichards> science is done in other areas. Of course, if the authors
alrichards> of a bioinformatics paper do not provide enough
alrichards> information to allow the algorithm to be reimplemented
alrichards> then that's another problem entirely - and again the only
alrichards> way to properly identify that problem is for people to try
alrichards> to replicate the software.

No! If the authors don't release code, or (especially) don't provide
enough details of the algorithm to allow a 'clean room'
re-implementation, the paper shouldn't ever be published in the first
place. 

john.

- -- 
- ----------------------------------------------------------------------------
           [ John S Jacobs Anderson ]------><URL:mailto:jacobs at genehack.org>
[ Genehack: Not your daddy's weblog ]------><URL:http://genehack.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.2 (GNU/Linux)
Comment: Mailcrypt 3.5.5 and Gnu Privacy Guard

iD8DBQE5hFdsWRJRdOm3KFARAuT2AKCPdN5KhJymVgV+Uqv2ofsRtXwSrwCdGyxL
9y3cNFjy4x83eyGlglbaR+I=
=PUXR
-----END PGP SIGNATURE-----







More information about the Bio-soft mailing list