Unsolved or Poorly Solved Computational Problems

Kevin Karplus karplus at bray.cse.ucsc.edu
Mon Mar 3 18:01:17 EST 2003

In article <3e60203a.28367625 at agate.berkeley.edu>, Bob wrote:
> On 28 Feb 2003 09:57:45 -0000, "Richard Scott"
><rtscott at forgetspamPacbell.net> wrote:
>>This leads to two other issues which I have encountered in both the academic
>>and commercial aspects of computer science, and which I suspect are big
>>problems in biotechnology. First, the lack of the right sort of experimental
>>data in key areas (e.g., actual disk accessing patterns on request by
>>request basis). Second, an unwillingness to design and construct tediously
>>accurate and complete simulation models (e.g., timing accurate disk

The lack of experimental data is always a big problem in biology.
The huge quantities of data are also a problem.
Sound contradictory?  Not really---there is a lot of data in the form
of genome sequences, DNA microarray data, yeast 2-hybrid experiments,
SNPs, ... , but the actual data relevant to a particular biological question
may be almost non-existent.  Also, many of the data sources are very
noisy, so picking out the small signal from the masses of junk (junk
only for the particular question being addressed---highly useful data
for other questions) is often quite difficult.  We use a lot of
Bayesian statistics, since we rarely have enough data for frequentists
methods to be useful.

>>The computer industry has no excuses for not properly instrumenting key
>>behavioral components in hardware but I imagine that a similar effort in
>>molecular chemistry or biology is virtually impossible with today's
>>technology. If that assumption is correct, there must be a crucial need for
>>accurate simulation capabilities to test various theories. In that regard, I
>>have looked at UCSD Professors Nathan Baker's and Michael J. Holst's work on
>>modeling the "MC", a simulation of the electrostatics of chained biological
>>molecular ( http://www.sciencenews.org/20010901/fob8ref.asp and
>>http://www.scicomp.ucsd.edu/~mholst/ ) and several of the gene sequencing
>>programs that are publically available (BLAST and so on). 

Simulation requires a thorough knowledge of the system being simulated.
For most of biochemistry and molecular biology, this knowledge is not
quite available.  For systems biology, it is probably at least 50
years off. (The biophysics people have a touching faith in
simulation---they seem to feel that with the entire research budget of
the US spent on computers for them, they'd be able to simulate any
physical process.  The fact remains that the details of protein
folding are still well out of reach of biophysical simulation, and
more complicated processes (like protein-protein docking and the
actions of the ribosome) are even further off.

>> Still there seems
>>to be little or nothing available in terms of dynamically simulating actual
>>interactions. Also, the sequencing algorithms seem to be all statistical in
>>nature rather than trying to find exact or near exact matches. Is that the
>>result of the huge size of the problem, or performance considerations. What
>>happens if you might find more than one exact match? Some of these questions
>>are a result of sheer ignorance on my part for which I apologize but I
>>suspect there also issues of modeling inaccuracy and computational

Exact match algorithms are easy---they came first in the field.  But
we're not interested in exact matches---we want biologically relevant
matches, which means dealing with the changes introduced during evolution.
There ARE still algorithms in common use that use exact matches as a
quick prefilter before doing more rigorous inexact matching.  (BLAST
and BLAT come to mind.)

>>I have started to follow your suggestions regarding CASP experiments and
>>would appreciate any specifics you might have regarding the "thousands of
>>other problems in bioinformatics" particularly where the issues involve long
>>sequential chains of entities such as atoms, molecules or representations of
> I agree that Kevin's reply was good. Your posts suggest that you are a
> serious programmer who wants to make some useful contribution.
> Collaborate with people in the field. The odds of you writing anything
> useful while on the outside are nil. You need the type of
> understanding that comes form being there, and seeing what people are
> doing. To some extent, they don't know what they want, so asking them
> does not really yield good info. Ongoing, give and take communication
> between user and programmer is critical.

I agree with Bob on this---you can get some stuff from looking at the
same problems everyone else is looking at (and a lot of what I do is
like that), but a lot of the big
contributions come from recognizing a question that hasn't been asked
before, and solving it before someone else even thinks to ask the question.

Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.

More information about the Bio-www mailing list