PC/Mainframe - data vs. storage

Charles Bailey bailey at hmivax.humgen.upenn.edu
Mon Jun 22 16:21:20 EST 1992

In article <1992Jun19.122907.12041 at athena.cs.uga.edu>, russell at dogwood.botany.uga.edu writes:
> 1.  The amount of data in GenBank, EMBL etc. must be going up rapidly,
>     and someone must be projecting this into the future.

Last I heard, GenBank was increasing at ~25% of total size per quarterly
release.  The specific figure for the current release of EMBL (31) is 14%.
(Both GenBank and EMBL should grow by the same amount, with a difference in
rate determined only by the difference in size of the base pool of sequences
which haven't been exchanged with each other yet.)  I don't have figures at my
fingertips for protein sequence databases, but I expect that the amounts of
growth are similar, especially if one considers predicted sequences derived
from nucleic acid sequences.

> 2.  The capacity of storage media is going up, and the price per
>     capacity is going down.

True enough.  I expect that the immediate limit here may prove to be not the
size of the physical device, but the ability of the OS or the analysis sw to
deal with datasets spread over multiple devices.  For instance, the
GenBank+EMBL+PIR+SwissProt+littlestuff sets which many packages use will
probably exceed 600 Mbytes in the very near future, meaning that if redundant
entries are not eliminated (e.g. identical entries in GenBank and EMBL; I'm not
talking about retstucturing within a db here), it will not fit onto a single
CD-ROM.  This means sites which recieve updates by CD will have to mount
multiple discs, or will have to transfer data from CD to a large disk pack. 
This may be a point in favor of 'big' (i.e. workstation or larger) systems,
since they handle multiple volume sets more easily (yes, this is a gross
generalization) and because sites with money to invest in large disks are also
likely to have money to invest in fast CPUs (see below).  (Of course the
problem of maintaining accurate data locally will intensify, but that's already
been discussed.)

> These two trends could resolve in several ways -
>     There might be so much sequence information that local computers
>     would not be able to handle the storage, and we would all have
>     to rely on the big facilities for searching, homologies, etc.
>     The improvement in data storage and its lowered price might
>     reach the point where PCs and Macs could reasonably handle
>     all the required tasks, including storage of all sequence data.
> Has anyone projected these kind of thoughts for 5 years from now, as
> opposed to what is the optimum system for the size of GenBank right
> now.

Actually, I think that the limiting factor in most cases will be cycles, not
storage space.  This is especially true as smallish minis and largeish micros
use more and more of the same mass storage technology (e.g. a SCSI chain filled
with 1-2 Gbyte disks will be adequate for a while :-)).  As the data expands,
however, and as techniques for analysis become more sophisticated,  micros like
the Mac and IBM PC will suffer more severely from CPU performance limitations. 
I expect that for the near future they will perform well for tasks like
restriction mapping, contig assambly, and perhaps simple pattern searches or
alignments, and may in fact have an advantage in these areas since most of the
packages I've seen have nicer interfaces than the mini/mainframe sw.  The
faster processors in the minis, however, will substantially outperform micros
in tasks like database searching, multiple alignments, etc.  (For the *real*
processing nuts, there's always parallel machines, but I don't expect to see
them in general-use analysis facilities for a while.)  Particularly as the
prices for small minis and workstations drop, I'd recommend that any site which
plans to do significant database searching or complex alignment locally but as
much CPU as they can reasonably afford.

Eventually, I expect the line between 'PC's and workstations will blur
sufficiently that the distinction will not be useful, but the basic
observations I've made here will likely hold true.  I'd be interested to hear
others' thoughts, especially if they think I'm way off the mark.


					Charles Bailey

!          Dept. of Human Genetics / Howard Hughes Medical Institute
! University of Pennsylvania School of Medicine  Rm. 430 Clinical Research Bldg.
!     422 Curie Blvd.  Philadelphia, PA 19104 USA      Tel. (215) 898-1699
!          Internet: bailey at hmivax.humgen.upenn.edu  (IN

More information about the Bio-soft mailing list