Does anyone using acedb have cytogenetic data?
If so, do you find it handles this data adequately/some of it/not at all?
Are there any improvements that you would like to see?
Having tried to convert my data into acedb3 format it is still extremely
frustrating to find that yet again, it is impossible to enter cytogenetic
data and get it to display on appropriate maps. It seems to me that very
little thought has actually been given to the representation of this data.
With the next workshop approaching (July 1994), this would be a good time
to get some improvements made that will be useful to everyone and, more
importantly, get them incorporated into the main version of acedb.
The following are based mainly on my own requirements. If these need to be
more generic, or you can see a better way to do the same thing, then please
comment.
Data
----
The acedb philosophy is that "objects" displayed on a map are defined
either as a single Locus or as an Interval which is composed of two end
points or Loci. (Boston workshop) In practice, map locations are defined
in the ?map_location model and are defined numerically.
Mouse cytogenetic data is of two types. Genes with locations derived mainly
from hybridisation in situ (HIS) and somatic cell (SC) techniques, and
chromosomal anomalies (or rearrangements) with breakpoint data, eg insertions,
translocations, etc. Breakpoints can be regarded as Loci in acedb. If an
anomaly has two breakpoints on the same chromosome then the intervening
region forms a segment or interval. In all cases the location is given as
chromosome bands.
Improvements to Existing Models and Code Wanted
-----------------------------------------------
1. The ?map_location model allow
- locations to be given as chromosome bands.
- end points be specified to indicate ranges
Eg human Xq21.1-Xq27.3
This means the computer works out the numeric coordinates required for
placing loci and intervals on the map(s) from the Chrom_Band definitions.
If the band is defined in terms of the tags Contains and Contained_in
then the algorithmn will have to recurse until it can either get a
numeric definition or fail.
2. The computer find the missing bands in that range if the full complement
of bands is needed elsewhere. (They usually are if the bands in the
Chromosome column on the map(s) are to be highlighted correctly.)
3. If a set of ?Chrom_Bands forms an idiogram for a chromosome, is there
any way the same idiogram can be used on several maps without having
to redefine the Chrom_Bands for each map?
At the very least then, the ?map_location model would be:
?map_location UNIQUE Position UNIQUE Float Float
Multi_Position Float Float
Ends Left UNIQUE Float Float
Right UNIQUE Float Float
Chrom_Band ?Chrom_Band ?Chrom_Band
Junction ?Chrom_Band ?Chrom_Band
where Chrom_Band refers to a location within a band or a range of bands
and Junction refers to a point at the junction of two adjacent bands.
(Yes, we do have this data.)
Except that Left and Right currently define the Interval end points and
should have Chrom_Band definitions as well. (Also, I am for ever being
told by the scientists at Harwell that Left and Right are meaningless and
to use the proper terms instead, ie Proximal and Distal.)
Problems with ?map_location Model
---------------------------------
1. Locations are numeric and have to be specified fairly accurately.
Cytogenetic data is given as chromosome bands and is often imprecise.
For example, a deletion is from mouse band 4E2 to the end of the
chromosome (4E2-4ter).
2. The data in the ?map_location model are separated from any supporting
data in the rest of the parent model (eg Locus). This is not a problem
if there is only one position in the map_location model and the parent
model refers to only one location. However, nearly all chromosome
anomalies have two or more breakpoints. Some of these will be listed in
the same data structure, eg Translocations. Some have BOTH genetic and
cytogenetic data which has to be linked to the correct location.
3. Most of the confusion and problems that I have experienced stem
from the fact that ?map_location is used to define BOTH single points
on a map and intervals (two points) on a map.
Proposal
--------
If this proposal is going to cause immense work by changing some of the
fundamental data definitions, at least consider the possibility of creating
new models for cytogenetic data where these changes can be implemented.
For instance, instead of the model Interval there could be a CytoInterval.
The suggestion is that:
i. ?map_location be used for SINGLE locations only.
ii. All locations are defined by two points.
- If they are the same, the location is a single point
- If different, the location is a range
(Apparently this would preempt the time when molecular
information will be available and positions will be ranges
on the physical map.)
iii. All Intervals are defined by pairs of double points.
- If the end point is a range then either the mid point of the
range would be used to draw the Interval, or, an error bar
could be drawn to indicate the extent of the uncertainty.
- If only ONE point is given then the end of the Interval is from
the single point to the end of the chromosome. A dotted line
continuing on from the main part of the Interval (solid line)
could indicate the uncertainty at that end on the map.
iv. Intervals (or CytoIntervals) are composed of two Locus definitions
which in turn access the ?map_location model.
v. All anomaly breakpoints be treated as Loci (ie use the model Locus).
vi. Can we agree on a common definition for Chromosome Anomalies
(Rearrangements)?
The model definitions would thus become:
?map_location Position Float Float
Chrom_Band ?Chrom_Band ?Chrom_Band
Junction ?Chrom_Band ?Chrom_Band
?Locus Identifier UNIQUE Text
Location Map ?Map XREF Locus #map_location
... // rest of model definition
?Interval Identifier UNIQUE Text
Type ?Text // anomaly type, eg deletion, etc.
Location Map ?Map Proximal #Locus
Distal #Locus
Contains Locus ?Locus
Fragment ?Interval
Contained_in ?Chrom_Anomaly XREF Interval
1. The new ?map_location model does not have to affect numeric positions
for single Loci. The existing arrangement could continue. However, any
chromosome band information would have to conform to the above schema
and changes would be required to implement the new Interval or CytoInterval.
2. #Locus is preferable to ?Locus because in the majority of cases the Locus
model will only be used for the map location data. For anomaly breakpoints
which may have need of the rest of the Locus model it may still be
preferable. It is easier to see all the data displayed together under one
tree (model) rather than have to go to other trees to see different parts
of the data. Besides I object to having to define x number of loci for the
same anomaly (given that there could be many breakpoints involved) and
don't see why they should be lumped in with genes and given identities of
their own.
HOWEVER, being in sub objects makes it difficult to get at the location
data and there is an efficiency tradeoff. What will happen when the linkage
data is added to each breakpoint?
3. Treating a set of Chrom_Bands as an idiogram will be necessary in order
to find the end of the chromosome when the location is very uncertain.
4. For human chromosome anomalies which contain large repeats of data
(such as Simon Mercer's data) Intervals can be defined as belonging to
or containing other Intervals.
5. Chromosome Anomaly Definitions.
Any definition for chromosomal anomalies needs to be able to cope with
both the Long and Short nomenclature forms. The most simplistic model
would be:
?Chrom_Anomaly Identifier UNIQUE Text
Type ?Text // eg deletion, etc.
Chromosome Text
Location Cytogenetic_location Text
Contains Fragment ?Interval XREF Chrom_Anomaly
Bkpt #Locus
Foreign_gene ?Locus XREF Chrom_Anomaly
... // rest of model definition
- With this model it is relatively easy for the computer to pick up the
Intervals and Loci and display them on the map but some clarity
regarding which breakpoints belong to which Type is lost.
- For complex anomalies involving several different types, how useful would
it be to explicitly identify these? Part of a definition that I have been
using does this:
?Chrom_Anomaly Identifier UNIQUE Text
Type Deletion ?Interval
Duplication ?Interval
Insertion Donated_segment ?Interval
Recipient_bkpt #Locus
Inversion ?Interval
Translocation Bkpt1 #Locus
Bkpt2 #Locus
HSR #Locus // Homogeneously Staining Region
Robertsonian
Tertiary_nullisomic Extra_seg1 ?Interval
Extra_seg2 ?Interval
Tertiary_trisomic Missing_seg1 ?Interval
Missing_seg2 ?Interval
Transgene #Locus
Foreign_gene ?Locus
Transposition ?Interval
Monosomy ?Text
Trisomy ?Text
... // rest of model definition
- The disadvantage with this definition is that the computer has a lot
more work to do to find the map locations.
- Heterochromatin's or Homogeneously Staining Regions (HSR's) are
unidentifiable regions of foreign DNA whose size is often specified
by + or ++ only. Would it be adequate to define their endpoints by
those of the regions into which they have been inserted on the host
chromosome? Please comment.
- Representing circular chromosomes may be more difficult.
6. The Location subtree is often used by several models, eg Locus,
Interval, Chrom_Anomaly and in the Homology model developed by
Jo Dicks and myself. Can we agree on a common definition for it?
I apologise for sending such a long letter to the bulletin board but it is
difficult to convey the complexities of dealing with chromosomal anomalies
in acedb. In defence of the charge that all of this should have been
discussed last November when Jean first put forward his proposals: well
I did send a detailed reply to the bulletin board (Nov 24 1993) but no one
took any notice. If any of the above is useful to other people please say so.
I have code that implements a small part of the above proposal but it needs
a significant amount of work to extend it to do all of the above. A stable
model definition would help. I would be happy to work on this but suspect
that it will also involve substantial changes to parts of the acedb code.
Merging new code into the main acedb code each time there is a new version
released is a real pain and I don't want to waste any more of my time on it.
It would be better if the authors of acedb were involved.
Many thanks to Simon Mercer for the recent helpful discussion.
Michelle Kirby (email kirbym at har-rbu.mrc.ac.uk)