Keith Bradnam said:
: I have a simple 'Species' class in my database which mostly contains
: details of Proteins and Sequences for each species represented in the
: database. For 'Arabidopsis thaliana' there is (as you would expect) a lot
: of sequences (over 183,000).
Many of us have a similar problem if not so extreme. Some few records get
filled with a huge number of links, too many to be conceivably useful. Not
only is browsing the record a ridiculous notion, even the performance
efficiency of a "follow" query ("find species 'Arabidopsis thaliana';
follow sequence") has no utility, no reason you would want a result set
that large. Even when such records aren't big enough to actually crash the
system they can be a performance problem and a user-ugliness.
On the other hand in the record for Species 'Arabidopsis arenosa' which has
only four sequence entries, these might be quite interesting to somebody.
Such unbalanced situations almost always arise from loading sequences into
a model like Keith's:
?Sequence ...
Species ?Species XREF Sequence
Which is a natural way to process input from Genbank/EMBL records. Of
course the input could be balanced better by parsing otherwise, eschewing
the XREF and loading the data from the other side:
Species : X
Sequence Y
, but this would still require additional code to filter which Species
should be excluded.
The ACEDB code already has a trigger that detects when loading such XREFs
creates an unuseful result. You can read it in database/log.wrm:
: 2001-01-20_00:30:20 genome 20042 Class Sequence, object
: BCD421_WHE1B0042 has 4280 > 3000 cells. This is just a warning, acedb
: has no hard limits on the mumber of cells per object, but the
: performances degrade on very large objects
: Either, you are cross referencing many entries into a single object, it
: may not be useful, and you could drop the XREF in the model and get the
: same info via an occasional query or, continually, via a subclass, or
: This object is Class:?Text, and you should rather use plain Text in the
: model or define a controlled vocabulary by giving an explicit list of
: tags
REQUEST
My request is, can the code be extended to allow the database owner to
set a hard limit on XREFs? A configuration in a wspec/ file? When the
limit is exceeded all the links could be deleted and replaced with an
object like Sequence : "More than 10,000, not shown".
APPRECIATION
One of ACEDB's strengths is its ability to handle efficiently large numbers
of sparse many-to-many relations. My database has hundreds of these, yours
probably does too. In a relational DBMS any of these with a sparseness
above about 10:1 would need to be implemented with a separate relationship
table, each of which would itself have a cost in performance, data size and
mainly complexity. In ACEDB I don't see any cost up to at least 100:1. I
estimate about half my records fall in this range, and nearly all the rest
are below. This is a big win for the ACEDB data structure.
The place where we're having problems is beyond 1000:1. Here a properly
structured RDBMS could be quite efficient whereas ACEDB isn't. The question
is, how often is it useful? If we could just truncate such records wouldn't
that be what we really want?
- Dave
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David E. Matthews USDA-ARS Plant Genome Database Curator
Adjunct Associate Professor
Department of Plant Breeding Email: matthews at greengenes.cit.cornell.edu
Cornell University Phone: 607-255-9951
Ithaca, New York 14853, USA Fax: 607-255-6683
---