timc at chiark.greenend.org.uk
Wed Nov 7 08:27:20 EST 2001
In article <200111052112.QAA00881 at mailbox.nlm.nih.gov>,
Jean Thierry-Mieg <mieg at ncbi.nlm.nih.gov> wrote:
>in our view there exist
> Gene :: which is a set of alternativelly splaiced mRNAs
> mRNA :: which has 3' and 5' UTR and exons and introns and is supported by
> Product: which correspond to choosing the reading frame from Met to Stop
You state that an exon is not a biological entity. That's really just
an opinion. Whether it's true or not, though, doesn't matter. I and
many others are faced with a practical problem of modelling transcripts
in a relational database. A transcript is too complicated to be
represented by a single entity in a relational database. Single
contiguous spans of DNA are simple enough.
Therefore, the obvious way to model a transcript is as an ordered list
of simpler UTR and exon entities.
Converting this relational representation into the current ACeDB
representation works very well. However, converting the ACeDB
representation to a relational representation is much more awkward. If
someone changes a transcript in ACeDB, and I want to check in the
relational database what needs to be updated, I have to jump through
hoops checking that the coordinates of each exon in the ACEDB transcript
are the same as they are in the relational database. This is very slow
indeed, because I have no quick way of identifying the particular exon
entities I need to check.
If an Exon class were added to the ACeDB model, I could store the
relational database's unique ID for that exon there.
If the ACeDB team are serious about binding FMAP to relational backends,
this sort of thing is going to become an important consideration, in my
I would rather use distinct objects than your suggestion (in your other
posting) to add it in the constructed type in the sequence object, for
the following reasons:
1) The unique identifier is a property of the exon, not of the
transcript. Object-oriented principles therefore suggest it
should be stored in the exon object.
2) If an exon is shared between multiple transcripts, links to an ?Exon
object help ensure that the data is consistent between the multiple
3) Use of constructed types will result in duplication of data in
multiple splice variants, with resulting wasted storage and risks
of loss of data integrity.
And while my idea of an XREF back to the transcript was made rather
flippantly, it's not that useless. The link already exists in the
relational representation, and is very useful. For example:
I have a SNP, I want to predict its consequences with respect to any
local transcripts. Since a transcript is not a simple entity in the
relational schema, I look for exons which map to the same region as my
SNP. Instantly, I know whether this SNP is in a coding region or not.
I can trivially follow the Exon<->Transcript relation to find out which
splice variants are involved.
Compare the way I have to do that with ACeDB. I have to find all the
transcript sequences to which it maps. That bit's easy enough.
I don't yet know whether it's coding or not. To do that, I have to
parse the ?Sequence objects to find out where the Exons are, and do some
really nasty arithmetic (involving some data which I have to obtain from
the transcript's parent sequence, because that's where its orientation
is stored), and finally iterate through those coordinates looking to see
whether an exon maps to the same location. This is *really* messy!
If the exons were real objects mapped directly to parent sequences, this
sort of analysis would be much easier.
Computer modelling of biological systems, in my opinion, does not have
to follow the biological reality at the implementation level (which is
what we are talking about). As long as it doesn't result in
demonstrably false results, the model is valid, surely?
More information about the Acedb