IMPORTANT!! - Proposal for a different format for bionet.molbiol.genbank.updates
roy at mchip00.med.nyu.edu
Mon Mar 21 16:10:48 EST 1994
Dave Kristofferson wrote:
> So what to make of this exercise in public opinion polling?!?!?
I havn't been active on the bionet groups for a couple of years,
since my activities have taken me rather out of the day-to-day world of
molecular biology. I'm still alive, I just don't do DNA any more. But, as
one of the driving forces behind getting bionet.molbio.genbank.updates
created a few years ago, I guess I really should say something about this.
My original vision of g.m.b.u sent something as follows. Picture a
network that looks like this (the letters are nodes, the numbers inter-node
2 3 4
Let's assume that node A is the source of some data (the genbank
updates) that nodes B, C, D, E, F, and G all want. For simplicity's sake, I
have represented the network as rooted tree. It's really a mesh (i.e. there
are many possible paths between any pair of nodes), but for the purposes of
this discussion, assuming it's a tree won't hurt. Don't think of the nodes
as the host sites, but rather as the Internet backbone routers to which
those hosts are connected. If that last sentence confuses you (i.e. you're
not a network guru), pretend I never said that and you won't loose anything
important. Examine what happens if B-G all get the data using FTP direct
Site Links data must traverse
C 2, 3
E 2, 5
F 2, 5, 6
G 2, 3, 4
This adds up to 12 link traversals to distribute the data to each
site. In particular, link 2 had the same data travel over it 5 times, and
it traveled over links 3 and 5 twice each.
Now, do the same thing with a "store, replicate, and forward" type
network such as netnews provides. The data now only has to flow over each
link once, i.e. 6 link traversals to distribute the data to the 6 sites.
Assuming cost of distribution is linear with the number of link traversals
(a big assumption, but not an unreasonable one, I think), the cost to
distribute the data using FTP to a central site is double what it is using
netnews's flooding algorithm. This is obviously a trivial example, but I
think you get the idea.
There is also the reliability issue. For F to get the data using
FTP, links 2, 5, and 6, and nodes A, B, E, and F all need to be up at the
same time. Using usenet, all that you need is for (A,2,B), (B,5,E), and
(E,6,F) to each be up sequentially.
Enough theory. As I read recently somewhere, "Theory and practice
are the same in theory, but different in practice". I think that applies
very well here.
Somewhere along the line, one of my main underlying assumptions was
proven false. I assumed that, given that this was not being done in one of
the "big 6" mainline hierarchies (rec, soc, sci, news, talk, and comp), and
especially due to the volume of data being transmitted, only those sites
that were actually interested in getting the data would subscribe to the
group. Alas, as it turned out, many (if not most) news sites seem to be
configured to automatically get full news feeds. As b.m.g.u got rolling, I
watched with some amusement as it quickly rose to the top of the arbitron
rating lists for "cost as a function of kbytes posted divided by number of
readers". I didn't let this bother me since I knew that arbitron was a
statistical survey and the extremely limited distribution of the group was a
statistical anomoly that the arbitron algorithms just weren't designed to
Then, I noticed that it was also doing quite respectively on the
statistical postings showing the number of sites receiving the group. This
I viewed with rising alarm. Apparantly, even though I couldn't concieve of
more than a few hundred sites on the net being interested, something like
50% of the sites on the net (which must have been 10's of thousands even
back then) were getting the genbank updates.
Based on email queries I made at the time, this seemed to be because
of several things. First, many news administrators just couldn't be
bothered to worry about exactly which groups they got and just got them all.
Some actually told me that even though they didn't want the group, they felt
obligated to get it just in case some downstream site they fed wanted it.
I'm not sure what to make of that, but it did indeed seem to be the case.
So, while it may indeed be true that netnews can be a much more
efficient way to distribute data, in practice the gross inefficiency of
many, many sites getting the group even though the didn't actually want the
data far outweighed any possible efficiency gain. By orders of magnitude.
Sometimes I wondered if I had unleashed some sort of Frankenstien monster on
the world which had gone amuck, wreaking network havoc throughout the world,
except that nobody noticed but me.
Well, with that somewhat long-winded explanation, I really do have
to say that if Dave's figures are anywhere near accurate, and there really
are only 30-odd people using the data, it might very well be time to kill
Before all the involved parties were actually convinced to get
b.m.g.u going, we came to an agreement that it was to be an experimental
service only. All the principles wrote a nice little paper, full of happy
predictions, got it published in CABIOS, and life went on. Unfortunately,
all research papers need to have data, and conclusions based on that data.
In this case, I'd say the conclusion has to be that while the experiment was
a theoretical success (i.e. we proved that one can indeed use usenet as a
transport layer to maintain a distributed database), it was a practical
Boy, I should have written all this network theory stuff a few years
ago. Would have made for a much longer paper :-)
Roy Smith <roy at nyu.edu>
Hippocrates Project, Department of Microbiology, Coles 202
NYU School of Medicine, 550 First Avenue, New York, NY 10016
"This never happened to Bart Simpson."
More information about the Bionews